Module: Gitlab::EncodingHelper
- Extended by:
- EncodingHelper
- Included in:
- AbstractPathValidator, EncodingHelper, Git, Git::Blame, Git::Blob, Git::Commit, Git::Diff, Git::Ref, Git::Repository, Git::Tag, Git::Tree, GitalyClient::BlobService, GitalyClient::CommitService, GitalyClient::ConflictsService, GitalyClient::OperationService, GitalyClient::RefService, GitalyClient::RemoteService, GitalyClient::RepositoryService, GitalyClient::WikiPage, GitalyClient::WikiService, GrapeLogging::Formatters::LogrageWithTimestamp, Search::FoundBlob, Search::Query, MergeRequestContextCommitDiffFile, MergeRequestDiffFile, NamespacePathValidator, ProjectPathValidator
- Defined in:
- lib/gitlab/encoding_helper.rb
Constant Summary collapse
- ENCODING_CONFIDENCE_THRESHOLD =
This threshold is carefully tweaked to prevent usage of encodings detected by CharlockHolmes with low confidence. If CharlockHolmes confidence is low, we're better off sticking with utf8 encoding. Reason: git diff can return strings with invalid utf8 byte sequences if it truncates a diff in the middle of a multibyte character. In this case CharlockHolmes will try to guess the encoding and will likely suggest an obscure encoding with low confidence. There is a lot more info with this merge request: gitlab.com/gitlab-org/gitlab_git/merge_requests/77#note_4754193
50
Instance Method Summary collapse
- #binary_io(str_or_io) ⇒ Object
- #detect_binary?(data, detect = nil) ⇒ Boolean
- #detect_libgit2_binary?(data) ⇒ Boolean
- #encode!(message) ⇒ Object
- #encode_binary(str) ⇒ Object
- #encode_utf8(message, replace: "") ⇒ Object
Instance Method Details
#binary_io(str_or_io) ⇒ Object
79 80 81 82 83 84 |
# File 'lib/gitlab/encoding_helper.rb', line 79 def binary_io(str_or_io) io = str_or_io.to_io.dup if str_or_io.respond_to?(:to_io) io ||= StringIO.new(str_or_io.to_s.freeze) io.tap { |io| io.set_encoding(Encoding::ASCII_8BIT) } end |
#detect_binary?(data, detect = nil) ⇒ Boolean
40 41 42 43 |
# File 'lib/gitlab/encoding_helper.rb', line 40 def detect_binary?(data, detect = nil) detect ||= CharlockHolmes::EncodingDetector.detect(data) detect && detect[:type] == :binary && detect[:confidence] == 100 end |
#detect_libgit2_binary?(data) ⇒ Boolean
45 46 47 48 49 50 51 |
# File 'lib/gitlab/encoding_helper.rb', line 45 def detect_libgit2_binary?(data) # EncodingDetector checks the first 1024 * 1024 bytes for NUL byte, libgit2 checks # only the first 8000 (https://github.com/libgit2/libgit2/blob/2ed855a9e8f9af211e7274021c2264e600c0f86b/src/filter.h#L15), # which is what we use below to keep a consistent behavior. detect = CharlockHolmes::EncodingDetector.new(8000).detect(data) detect && detect[:type] == :binary end |
#encode!(message) ⇒ Object
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/gitlab/encoding_helper.rb', line 18 def encode!() = force_encode_utf8() return if .valid_encoding? # return message if message type is binary detect = CharlockHolmes::EncodingDetector.detect() return .force_encoding("BINARY") if detect_binary?(, detect) if detect && detect[:encoding] && detect[:confidence] > ENCODING_CONFIDENCE_THRESHOLD # force detected encoding if we have sufficient confidence. .force_encoding(detect[:encoding]) end # encode and clean the bad chars .replace clean() rescue ArgumentError => e return unless e..include?('unknown encoding name') encoding = detect ? detect[:encoding] : "unknown" "--broken encoding: #{encoding}" end |
#encode_binary(str) ⇒ Object
73 74 75 76 77 |
# File 'lib/gitlab/encoding_helper.rb', line 73 def encode_binary(str) return "" if str.nil? str.dup.force_encoding(Encoding::ASCII_8BIT) end |
#encode_utf8(message, replace: "") ⇒ Object
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/gitlab/encoding_helper.rb', line 53 def encode_utf8(, replace: "") = force_encode_utf8() return if .valid_encoding? detect = CharlockHolmes::EncodingDetector.detect() if detect && detect[:encoding] begin CharlockHolmes::Converter.convert(, detect[:encoding], 'UTF-8') rescue ArgumentError => e Gitlab::AppLogger.warn("Ignoring error converting #{detect[:encoding]} into UTF8: #{e.}") '' end else clean(, replace: replace) end rescue ArgumentError nil end |