Class: FeedNormalizer::HtmlCleaner

Inherits:
Object
  • Object
show all
Defined in:
lib/html-cleaner.rb

Overview

Constant Summary collapse

HTML_ELEMENTS =

allowed html elements.

%w(
  a abbr acronym address area b bdo big blockquote br button caption center
  cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3
  h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s
  samp small span strike strong sub sup table tbody td tfoot th thead tr tt
  u ul var
)
HTML_ATTRS =

allowed attributes.

%w(
  abbr accept accept-charset accesskey align alt axis border cellpadding
  cellspacing char charoff charset checked cite class clear cols colspan
  color compact coords datetime dir disabled for frame headers height href
  hreflang hspace id ismap label lang longdesc maxlength media method
  multiple name nohref noshade nowrap readonly rel rev rows rowspan rules
  scope selected shape size span src start summary tabindex target title
  type usemap valign value vspace width
)
HTML_URI_ATTRS =

allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.

%w(
  href src cite usemap longdesc
)
DODGY_URI_SCHEMES =
%w(
  javascript vbscript mocha livescript data
)

Class Method Summary collapse

Class Method Details

.add_entities(str) ⇒ Object

Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {

This method could be improved by adding a whitelist of html entities.



152
153
154
# File 'lib/html-cleaner.rb', line 152

def add_entities(str)
  str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
end

.clean(str) ⇒ Object

Does this:

  • Unescape HTML

  • Parse HTML into tree

  • Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree

  • Each tag:

    • remove tag if not whitelisted

    • escape HTML tag contents

    • remove all attributes not on whitelist

    • extra-scrub URI attrs; see dodgy_uri?

Extra (i.e. unmatched) ending tags and comments are removed.



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/html-cleaner.rb', line 60

def clean(str)
  str = unescapeHTML(str)

  doc = Hpricot(str, :fixup_tags => true)
  doc = subtree(doc, :body)

  # get all the tags in the document
  # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
  # including text nodes instead of just tagged elements.
  tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq

  # Remove tags that aren't whitelisted.
  remove_tags!(doc, tags - HTML_ELEMENTS)
  remaining_tags = tags & HTML_ELEMENTS

  # Remove attributes that aren't on the whitelist, or are suspicious URLs.
  (doc/remaining_tags.join(",")).each do |element|
    next if element.raw_attributes.nil? || element.raw_attributes.empty?
    element.raw_attributes.reject! do |attr,val|
      !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
    end

    element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
  end unless remaining_tags.empty?
  
  doc.traverse_text do |t|
    t.swap(add_entities(t.to_html))
  end

  # Return the tree, without comments. Ugly way of removing comments,
  # but can't see a way to do this in Hpricot yet.
  doc.to_s.gsub(/<\!--.*?-->/mi, '')
end

.dodgy_uri?(uri) ⇒ Boolean

Returns true if the given string contains a suspicious URL, i.e. a javascript link.

This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.

Returns:

  • (Boolean)


117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# File 'lib/html-cleaner.rb', line 117

def dodgy_uri?(uri)
  uri = uri.to_s

  # special case for poorly-formed entities (missing ';')
  # if these occur *anywhere* within the string, then throw it out.
  return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)

  # Try escaping as both HTML or URI encodings, and then trying
  # each scheme regexp on each
  [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
    DODGY_URI_SCHEMES.each do |scheme|

      regexp = "#{scheme}:".gsub(/./) do |char|
        "([\000-\037\177\s]*)#{char}"
      end

      # regexp looks something like
      # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
      return true if (unesc_uri =~ %r{\A#{regexp}}mi)
    end
  end

  nil
end

.flatten(str) ⇒ Object

For all other feed elements:

  • Unescape HTML.

  • Parse HTML into tree (taking ‘body’ as root, if present)

  • Takes text out of each tag, and escapes HTML.

  • Returns all text concatenated.



99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/html-cleaner.rb', line 99

def flatten(str)
  str.gsub!("\n", " ")
  str = unescapeHTML(str)

  doc = Hpricot(str, :xhtml_strict => true)
  doc = subtree(doc, :body)

  out = []
  doc.traverse_text {|t| out << add_entities(t.to_html)}

  return out.join
end

.unescapeHTML(str, xml = true) ⇒ Object

unescapes HTML. If xml is true, also converts XML-only named entities to HTML.



143
144
145
# File 'lib/html-cleaner.rb', line 143

def unescapeHTML(str, xml = true)
  CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
end