Class: FeedNormalizer::HtmlCleaner

Inherits:
Object
  • Object
show all
Defined in:
lib/html-cleaner.rb

Overview

Constant Summary collapse

HTML_ELEMENTS =

allowed html elements.

%w(
  a abbr acronym address area b bdo big blockquote br button caption center
  cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3
  h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s
  samp small span strike strong sub sup table tbody td tfoot th thead tr tt
  u ul var
)
HTML_ATTRS =

allowed attributes.

%w(
  abbr accept accept-charset accesskey align alt axis border cellpadding
  cellspacing char charoff charset checked cite class clear cols colspan
  color compact coords datetime dir disabled for frame headers height href
  hreflang hspace id ismap label lang longdesc maxlength media method
  multiple name nohref noshade nowrap readonly rel rev rows rowspan rules
  scope selected shape size span src start summary tabindex target title
  type usemap valign value vspace width
)
HTML_URI_ATTRS =

allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.

%w(
  href src cite usemap longdesc
)
DODGY_URI_SCHEMES =
%w(
  javascript vbscript mocha livescript data
)

Class Method Summary collapse

Class Method Details

.add_entities(str) ⇒ Object

Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {

This method could be improved by adding a whitelist of html entities.



148
149
150
# File 'lib/html-cleaner.rb', line 148

def add_entities(str)
  str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
end

.clean(str) ⇒ Object

Does this:

  • Unescape HTML

  • Parse HTML into tree

  • Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree

  • Each tag:

    • remove tag if not whitelisted

    • escape HTML tag contents

    • remove all attributes not on whitelist

    • extra-scrub URI attrs; see dodgy_uri?

Extra (i.e. unmatched) ending tags and comments are removed.



59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/html-cleaner.rb', line 59

def clean(str)
  str = unescapeHTML(str)

  doc = Hpricot(str, :fixup_tags => true)
  doc = subtree(doc, :body)

  # get all the tags in the document
  # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
  # including text nodes instead of just tagged elements.
  tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq

  # Remove tags that aren't whitelisted.
  remove_tags!(doc, tags - HTML_ELEMENTS)
  remaining_tags = tags & HTML_ELEMENTS

  # Remove attributes that aren't on the whitelist, or are suspicious URLs.
  (doc/remaining_tags.join(",")).each do |element|
    element.raw_attributes.reject! do |attr,val|
      !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
    end

    element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
  end unless remaining_tags.empty?

  doc.traverse_text {|t| t.set(add_entities(t.to_html))}

  # Return the tree, without comments. Ugly way of removing comments,
  # but can't see a way to do this in Hpricot yet.
  doc.to_s.gsub(/<\!--.*?-->/mi, '')
end

.dodgy_uri?(uri) ⇒ Boolean

Returns true if the given string contains a suspicious URL, i.e. a javascript link.

This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.

Returns:

  • (Boolean)


113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# File 'lib/html-cleaner.rb', line 113

def dodgy_uri?(uri)
  uri = uri.to_s

  # special case for poorly-formed entities (missing ';')
  # if these occur *anywhere* within the string, then throw it out.
  return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)

  # Try escaping as both HTML or URI encodings, and then trying
  # each scheme regexp on each
  [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
    DODGY_URI_SCHEMES.each do |scheme|

      regexp = "#{scheme}:".gsub(/./) do |char|
        "([\000-\037\177\s]*)#{char}"
      end

      # regexp looks something like
      # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
      return true if (unesc_uri =~ %r{\A#{regexp}}mi)
    end
  end

  nil
end

.flatten(str) ⇒ Object

For all other feed elements:

  • Unescape HTML.

  • Parse HTML into tree (taking ‘body’ as root, if present)

  • Takes text out of each tag, and escapes HTML.

  • Returns all text concatenated.



95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/html-cleaner.rb', line 95

def flatten(str)
  str.gsub!("\n", " ")
  str = unescapeHTML(str)

  doc = Hpricot(str, :xhtml_strict => true)
  doc = subtree(doc, :body)

  out = ""
  doc.traverse_text {|t| out << add_entities(t.to_html)}

  return out
end

.unescapeHTML(str, xml = true) ⇒ Object

unescapes HTML. If xml is true, also converts XML-only named entities to HTML.



139
140
141
# File 'lib/html-cleaner.rb', line 139

def unescapeHTML(str, xml = true)
  CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
end