Method: RubyParser::Legacy::RubyParserStuff#handle_encoding
- Defined in:
- lib/ruby_parser/legacy/ruby_parser_extras.rb
#handle_encoding(str) ⇒ Object
Returns a UTF-8 encoded string after processing BOMs and magic encoding comments.
Holy crap… ok. Here goes:
Ruby’s file handling and encoding support is insane. We need to be able to lex a file. The lexer file is explicitly UTF-8 to make things cleaner. This allows us to deal with extended chars in class and method names. In order to do this, we need to encode all input source files as UTF-8. First, we look for a UTF-8 BOM by looking at the first line while forcing its encoding to ASCII-8BIT. If we find a BOM, we strip it and set the expected encoding to UTF-8. Then, we search for a magic encoding comment. If found, it overrides the BOM. Finally, we force the encoding of the input string to whatever was found, and then encode that to UTF-8 for compatibility with the lexer.
1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 |
# File 'lib/ruby_parser/legacy/ruby_parser_extras.rb', line 1023 def handle_encoding str str = str.dup has_enc = str.respond_to? :encoding encoding = nil header = str.each_line.first(2) header.map! { |s| s.force_encoding "ASCII-8BIT" } if has_enc first = header.first || "" encoding, str = "utf-8", str[3..-1] if first =~ /\A\xEF\xBB\xBF/ encoding = $1.strip if header.find { |s| s[/^#.*?-\*-.*?coding:\s*([^ ;]+).*?-\*-/, 1] || s[/^#.*(?:en)?coding(?:\s*[:=])\s*([\w-]+)/, 1] } if encoding then if has_enc then encoding.sub!(/utf-8-.+$/, 'utf-8') # HACK for stupid emacs formats hack_encoding str, encoding else warn "Skipping magic encoding comment" end else # nothing specified... ugh. try to encode as utf-8 hack_encoding str if has_enc end str end |