Class: Hermeneutics::Entities
- Inherits:
-
Object
- Object
- Hermeneutics::Entities
- Defined in:
- lib/hermeneutics/escape.rb
Overview
Translate HTML and XML character entities: "&" to "&" and vice versa.
What actually happens
HTML pages usually come in with characters encoded < for < and € for €.
Further, they may contain a meta tag in the header like this:
< http-equiv="Content-Type" content="text/html; charset=utf-8" />
< charset="utf-8" /> (HTML5)
or
<?xml version="1.0" encoding="UTF-8" ?> (XHTML)
When charset is utf-8 and the file contains the byte sequence "\303\244"/"\xc3\xa4" then there will be displayed a character "ä".
When charset is iso8859-15 and the file contains the byte sequence "\344"/"\xe4" then there will be displayed a character "ä", too.
The sequence "ä" will produce an "ä" in any case.
What you should do
Generating your own HTML pages you will always be safe when you only produce entity tags as ä and € or ä and € respectively.
What this module does
This module translates strings to a HTML-masked version. The encoding will not be changed and you may demand to keep 8-bit-characters.
Examples
Entities.encode "<" #=> "<"
Entities.decode "<" #=> "<"
Entities.encode "äöü" #=> "äöü"
Entities.decode "äöü" #=> "äöü"
Constant Summary collapse
- SPECIAL_ASC =
:stopdoc:
{ '"' => "quot", "&" => "amp", "<" => "lt", ">" => "gt", }
- RE_ASC =
/[#{SPECIAL_ASC.keys.map { |x| Regexp.quote x }.join}]/- SPECIAL =
{ "\u00a0" => "nbsp", "¡" => "iexcl", "¢" => "cent", "£" => "pound", "€" => "euro", "¥" => "yen", "Š" => "Scaron", "¤" => "curren", "¦" => "brvbar", "§" => "sect", "š" => "scaron", "©" => "copy", "ª" => "ordf", "«" => "laquo", "¬" => "not", "" => "shy", "¨" => "uml", "®" => "reg", "¯" => "macr", "°" => "deg", "±" => "plusmn", "²" => "sup2", "³" => "sup3", "µ" => "micro", "¶" => "para", "´" => "acute", "·" => "middot", "¹" => "sup1", "º" => "ordm", "»" => "raquo", "Œ" => "OElig", "œ" => "oelig", "¸" => "cedil", "¼" => "frac14", "½" => "frac12", "Ÿ" => "Yuml", "¿" => "iquest", "¾" => "frac34", "À" => "Agrave", "Á" => "Aacute", "Â" => "Acirc", "Ã" => "Atilde", "Ä" => "Auml", "Å" => "Aring", "Æ" => "AElig", "Ç" => "Ccedil", "È" => "Egrave", "É" => "Eacute", "Ê" => "Ecirc", "Ë" => "Euml", "Ì" => "Igrave", "Í" => "Iacute", "Î" => "Icirc", "Ï" => "Iuml", "Ð" => "ETH", "Ñ" => "Ntilde", "Ò" => "Ograve", "Ó" => "Oacute", "Ô" => "Ocirc", "Õ" => "Otilde", "Ö" => "Ouml", "×" => "times", "Ø" => "Oslash", "Ù" => "Ugrave", "Ú" => "Uacute", "Û" => "Ucirc", "Ü" => "Uuml", "Ý" => "Yacute", "Þ" => "THORN", "ß" => "szlig", "à" => "agrave", "á" => "aacute", "â" => "acirc", "ã" => "atilde", "ä" => "auml", "å" => "aring", "æ" => "aelig", "ç" => "ccedil", "è" => "egrave", "é" => "eacute", "ê" => "ecirc", "ë" => "euml", "ì" => "igrave", "í" => "iacute", "î" => "icirc", "ï" => "iuml", "ð" => "eth", "ñ" => "ntilde", "ò" => "ograve", "ó" => "oacute", "ô" => "ocirc", "õ" => "otilde", "ö" => "ouml", "÷" => "divide", "ø" => "oslash", "ù" => "ugrave", "ú" => "uacute", "û" => "ucirc", "ü" => "uuml", "ý" => "yacute", "þ" => "thorn", "ÿ" => "yuml", "‚" => "bsquo", "‘" => "lsquo", "„" => "bdquo", "“" => "ldquo", "‹" => "lsaquo", "›" => "rsaquo", "–" => "ndash", "—" => "mdash", "‰" => "permil", "…" => "hellip", "†" => "dagger", "‡" => "Dagger", }.update SPECIAL_ASC
- NAMES =
SPECIAL.invert
Instance Attribute Summary collapse
-
#keep_8bit ⇒ Object
:startdoc:.
Class Method Summary collapse
-
.decode(str) ⇒ Object
:call-seq: Entities.decode( str) -> str.
- .encode(str) ⇒ Object
- .std ⇒ Object
Instance Method Summary collapse
- #decode(str) ⇒ Object
-
#encode(str) ⇒ Object
:call-seq: ent.encode( str) -> str.
-
#initialize(keep_8bit = nil) ⇒ Entities
constructor
:call-seq: new( keep_8bit = nil) -> ent new( :keep_8bit => val) -> ent.
Constructor Details
#initialize(keep_8bit = nil) ⇒ Entities
:call-seq:
new( keep_8bit = nil) -> ent
new( :keep_8bit => val) -> ent
Creates an Entities converter.
The parameter may be given as one value or as a hash.
ent = Entities.new true
ent = Entities.new :keep_8bit => true
129 130 131 132 133 134 |
# File 'lib/hermeneutics/escape.rb', line 129 def initialize keep_8bit = nil @keep_8bit = case keep_8bit when Hash then keep_8bit[ :keep_8bit] else keep_8bit end end |
Instance Attribute Details
#keep_8bit ⇒ Object
:startdoc:
116 117 118 |
# File 'lib/hermeneutics/escape.rb', line 116 def keep_8bit @keep_8bit end |
Class Method Details
.decode(str) ⇒ Object
:call-seq:
Entities.decode( str) -> str
Replace HTML-style masks by normal characters:
Entities.decode "<" #=> "<"
Entities.decode "äöü" #=> "äöü"
Unmasked 8-bit-characters ("ä" instead of "ä") will be kept but translated to a unique encoding.
s = "ä ö ü"
s.encode! "utf-8"
Entities.decode s #=> "ä ö ü"
s = "\xe4 ö \xfc €"
s.force_encoding "iso-8859-15"
Entities.decode s #=> "ä ö ü €"
(in iso8859-15)
207 208 209 210 211 |
# File 'lib/hermeneutics/escape.rb', line 207 def decode str str.gsub /&(.+?);/ do (named_decode $1) or (numeric_decode $1) or $& end end |
.encode(str) ⇒ Object
182 183 184 |
# File 'lib/hermeneutics/escape.rb', line 182 def encode str std.encode str end |
.std ⇒ Object
178 179 180 |
# File 'lib/hermeneutics/escape.rb', line 178 def std @std ||= new end |
Instance Method Details
#decode(str) ⇒ Object
170 171 172 |
# File 'lib/hermeneutics/escape.rb', line 170 def decode str self.class.decode str end |
#encode(str) ⇒ Object
:call-seq:
ent.encode( str) -> str
Create a string thats characters are masked the HTML style:
ent = Entities.new
ent.encode "&<\"" #=> "&<""
ent.encode "äöü" #=> "äöü"
The result will be in the same encoding as the source even if it will not contain any 8-bit characters (what can only happen when keep_8bit is set).
ent = Entities.new true
uml = "<ä>".encode "UTF-8"
ent.encode uml #=> "<\xc3\xa4>" in UTF-8
uml = "<ä>".encode "ISO-8859-1"
ent.encode uml #=> "<\xe4>" in ISO-8859-1
157 158 159 160 161 162 163 164 165 166 167 168 |
# File 'lib/hermeneutics/escape.rb', line 157 def encode str r = str.new_string r.gsub! RE_ASC do |x| "&#{SPECIAL_ASC[ x]};" end unless @keep_8bit then r.gsub! /[^\0-\x7f]/ do |c| c.encode! __ENCODING__ s = SPECIAL[ c] || ("#x%04x" % c.ord) "&#{s};" end end r end |