Class: Hermeneutics::Entities
- Inherits:
-
Object
- Object
- Hermeneutics::Entities
- Defined in:
- lib/hermeneutics/escape.rb
Overview
Translate HTML and XML character entities: "&"
to "&"
and vice versa.
What actually happens
HTML pages usually come in with characters encoded <
for <
and €
for €
.
Further, they may contain a meta tag in the header like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta charset="utf-8" /> (HTML5)
or
<?xml version="1.0" encoding="UTF-8" ?> (XHTML)
When charset
is utf-8
and the file contains the byte sequence "\303\244"
/"\xc3\xa4"
then there will be displayed a character "ä"
.
When charset
is iso8859-15
and the file contains the byte sequence "\344"
/"\xe4"
then there will be displayed a character "ä"
, too.
The sequence "ä"
will produce an "ä"
in any case.
What you should do
Generating your own HTML pages you will always be safe when you only produce entity tags as ä
and €
or ä
and €
respectively.
What this module does
This module translates strings to a HTML-masked version. The encoding will not be changed and you may demand to keep 8-bit-characters.
Examples
Entities.encode "<" #=> "<"
Entities.decode "<" #=> "<"
Entities.encode "äöü" #=> "äöü"
Entities.decode "äöü" #=> "äöü"
Constant Summary collapse
- SPECIAL_ASC =
:stopdoc:
{ '"' => "quot", "&" => "amp", "<" => "lt", ">" => "gt", }
- RE_ASC =
/[#{SPECIAL_ASC.keys.map { |x| Regexp.quote x }.join}]/
- SPECIAL =
{ "\u00a0" => "nbsp", "¡" => "iexcl", "¢" => "cent", "£" => "pound", "€" => "euro", "¥" => "yen", "Š" => "Scaron", "¤" => "curren", "¦" => "brvbar", "§" => "sect", "š" => "scaron", "©" => "copy", "ª" => "ordf", "«" => "laquo", "¬" => "not", "" => "shy", "¨" => "uml", "®" => "reg", "¯" => "macr", "°" => "deg", "±" => "plusmn", "²" => "sup2", "³" => "sup3", "µ" => "micro", "¶" => "para", "´" => "acute", "·" => "middot", "¹" => "sup1", "º" => "ordm", "»" => "raquo", "Œ" => "OElig", "œ" => "oelig", "¸" => "cedil", "¼" => "frac14", "½" => "frac12", "Ÿ" => "Yuml", "¿" => "iquest", "¾" => "frac34", "À" => "Agrave", "Á" => "Aacute", "Â" => "Acirc", "Ã" => "Atilde", "Ä" => "Auml", "Å" => "Aring", "Æ" => "AElig", "Ç" => "Ccedil", "È" => "Egrave", "É" => "Eacute", "Ê" => "Ecirc", "Ë" => "Euml", "Ì" => "Igrave", "Í" => "Iacute", "Î" => "Icirc", "Ï" => "Iuml", "Ð" => "ETH", "Ñ" => "Ntilde", "Ò" => "Ograve", "Ó" => "Oacute", "Ô" => "Ocirc", "Õ" => "Otilde", "Ö" => "Ouml", "×" => "times", "Ø" => "Oslash", "Ù" => "Ugrave", "Ú" => "Uacute", "Û" => "Ucirc", "Ü" => "Uuml", "Ý" => "Yacute", "Þ" => "THORN", "ß" => "szlig", "à" => "agrave", "á" => "aacute", "â" => "acirc", "ã" => "atilde", "ä" => "auml", "å" => "aring", "æ" => "aelig", "ç" => "ccedil", "è" => "egrave", "é" => "eacute", "ê" => "ecirc", "ë" => "euml", "ì" => "igrave", "í" => "iacute", "î" => "icirc", "ï" => "iuml", "ð" => "eth", "ñ" => "ntilde", "ò" => "ograve", "ó" => "oacute", "ô" => "ocirc", "õ" => "otilde", "ö" => "ouml", "÷" => "divide", "ø" => "oslash", "ù" => "ugrave", "ú" => "uacute", "û" => "ucirc", "ü" => "uuml", "ý" => "yacute", "þ" => "thorn", "ÿ" => "yuml", "‚" => "bsquo", "‘" => "lsquo", "„" => "bdquo", "“" => "ldquo", "‹" => "lsaquo", "›" => "rsaquo", "–" => "ndash", "—" => "mdash", "‰" => "permil", "…" => "hellip", "†" => "dagger", "‡" => "Dagger", }.update SPECIAL_ASC
- NAMES =
SPECIAL.invert
Instance Attribute Summary collapse
-
#keep_8bit ⇒ Object
:startdoc:.
Class Method Summary collapse
-
.decode(str) ⇒ Object
:call-seq: Entities.decode( str) -> str.
- .encode(str) ⇒ Object
- .std ⇒ Object
Instance Method Summary collapse
- #decode(str) ⇒ Object
-
#encode(str) ⇒ Object
:call-seq: ent.encode( str) -> str.
-
#initialize(keep_8bit: nil) ⇒ Entities
constructor
:call-seq: new( keep_8bit: bool) -> ent.
Constructor Details
Instance Attribute Details
#keep_8bit ⇒ Object
:startdoc:
116 117 118 |
# File 'lib/hermeneutics/escape.rb', line 116 def keep_8bit @keep_8bit end |
Class Method Details
.decode(str) ⇒ Object
:call-seq:
Entities.decode( str) -> str
Replace HTML-style masks by normal characters:
Entities.decode "<" #=> "<"
Entities.decode "äöü" #=> "äöü"
Unmasked 8-bit-characters ("ä"
instead of "ä"
) will be kept but translated to a unique encoding.
s = "ä ö ü"
s.encode! "utf-8"
Entities.decode s #=> "ä ö ü"
s = "\xe4 ö \xfc €"
s.force_encoding "iso-8859-15"
Entities.decode s #=> "ä ö ü €"
(in iso8859-15)
200 201 202 203 204 |
# File 'lib/hermeneutics/escape.rb', line 200 def decode str str.gsub /&(.+?);/ do (named_decode $1) or (numeric_decode $1) or $& end end |
.encode(str) ⇒ Object
175 176 177 |
# File 'lib/hermeneutics/escape.rb', line 175 def encode str std.encode str end |
.std ⇒ Object
171 172 173 |
# File 'lib/hermeneutics/escape.rb', line 171 def std @std ||= new end |
Instance Method Details
#decode(str) ⇒ Object
163 164 165 |
# File 'lib/hermeneutics/escape.rb', line 163 def decode str self.class.decode str end |
#encode(str) ⇒ Object
:call-seq:
ent.encode( str) -> str
Create a string thats characters are masked the HTML style:
ent = Entities.new
ent.encode "&<\"" #=> "&<""
ent.encode "äöü" #=> "äöü"
The result will be in the same encoding as the source even if it will not contain any 8-bit characters (what can only happen when keep_8bit
is set).
ent = Entities.new true
uml = "<ä>".encode "UTF-8"
ent.encode uml #=> "<\xc3\xa4>" in UTF-8
uml = "<ä>".encode "ISO-8859-1"
ent.encode uml #=> "<\xe4>" in ISO-8859-1
150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/hermeneutics/escape.rb', line 150 def encode str r = str.new_string r.gsub! RE_ASC do |x| "&#{SPECIAL_ASC[ x]};" end unless @keep_8bit then r.gsub! /[^\0-\x7f]/ do |c| c.encode! __ENCODING__ s = SPECIAL[ c] || ("#x%04x" % c.ord) "&#{s};" end end r end |