Class: FeedNormalizer::HtmlCleaner

Inherits:

Object

Object
FeedNormalizer::HtmlCleaner

show all

Defined in:: lib/html-cleaner.rb

Overview

Various methods for cleaning up HTML and preparing it for safe public consumption.

Documents used for refrence:

Constant Summary collapse

HTML_ELEMENTS = allowed html elements.

%w(
  a abbr acronym address area b bdo big blockquote br button caption center
  cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3
  h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s
  samp small span strike strong sub sup table tbody td tfoot th thead tr tt
  u ul var
)

HTML_ATTRS = allowed attributes.

%w(
  abbr accept accept-charset accesskey align alt axis border cellpadding
  cellspacing char charoff charset checked cite class clear cols colspan
  color compact coords datetime dir disabled for frame headers height href
  hreflang hspace id ismap label lang longdesc maxlength media method
  multiple name nohref noshade nowrap readonly rel rev rows rowspan rules
  scope selected shape size span src start summary tabindex target title
  type usemap valign value vspace width
)

HTML_URI_ATTRS =

allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.

%w(
  href src cite usemap longdesc
)

DODGY_URI_SCHEMES =

%w(
  javascript vbscript mocha livescript data
)

Class Method Summary collapse

.add_entities(str) ⇒ Object

Adds entities where possible.
.clean(str) ⇒ Object

Does this: - Unescape HTML - Parse HTML into tree - Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree - Each tag: - remove tag if not whitelisted - escape HTML tag contents - remove all attributes not on whitelist - extra-scrub URI attrs; see dodgy_uri?.
.dodgy_uri?(uri) ⇒ Boolean

Returns true if the given string contains a suspicious URL, i.e.
.flatten(str) ⇒ Object

For all other feed elements: - Unescape HTML.
.unescapeHTML(str, xml = true) ⇒ Object

unescapes HTML.

Class Method Details

.add_entities(str) ⇒ `Object`

Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become &#123;

This method could be improved by adding a whitelist of html entities.



148
149
150

# File 'lib/html-cleaner.rb', line 148

def add_entities(str)
  str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
end

.clean(str) ⇒ `Object`

Does this:

Unescape HTML
Parse HTML into tree
Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree
Each tag:
- remove tag if not whitelisted
- escape HTML tag contents
- remove all attributes not on whitelist
- extra-scrub URI attrs; see dodgy_uri?

Extra (i.e. unmatched) ending tags and comments are removed.

# File 'lib/html-cleaner.rb', line 59

def clean(str)
  str = unescapeHTML(str)

  doc = Hpricot(str, :fixup_tags => true)
  doc = subtree(doc, :body)

  # get all the tags in the document
  # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
  # including text nodes instead of just tagged elements.
  tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq

  # Remove tags that aren't whitelisted.
  remove_tags!(doc, tags - HTML_ELEMENTS)
  remaining_tags = tags & HTML_ELEMENTS

  # Remove attributes that aren't on the whitelist, or are suspicious URLs.
  (doc/remaining_tags.join(",")).each do |element|
    element.raw_attributes.reject! do |attr,val|
      !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
    end

    element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
  end unless remaining_tags.empty?

  doc.traverse_text {|t| t.set(add_entities(t.to_html))}

  # Return the tree, without comments. Ugly way of removing comments,
  # but can't see a way to do this in Hpricot yet.
  doc.to_s.gsub(/<\!--.*?-->/mi, '')
end

.dodgy_uri?(uri) ⇒ `Boolean`

Returns true if the given string contains a suspicious URL, i.e. a javascript link.

This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.

Returns:

(Boolean)

# File 'lib/html-cleaner.rb', line 113

def dodgy_uri?(uri)
  uri = uri.to_s

  # special case for poorly-formed entities (missing ';')
  # if these occur *anywhere* within the string, then throw it out.
  return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)

  # Try escaping as both HTML or URI encodings, and then trying
  # each scheme regexp on each
  [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
    DODGY_URI_SCHEMES.each do |scheme|

      regexp = "#{scheme}:".gsub(/./) do |char|
        "([\000-\037\177\s]*)#{char}"
      end

      # regexp looks something like
      # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
      return true if (unesc_uri =~ %r{\A#{regexp}}mi)
    end
  end

  nil
end

.flatten(str) ⇒ `Object`

For all other feed elements:

Unescape HTML.
Parse HTML into tree (taking ‘body’ as root, if present)
Takes text out of each tag, and escapes HTML.
Returns all text concatenated.

# File 'lib/html-cleaner.rb', line 95

def flatten(str)
  str.gsub!("\n", " ")
  str = unescapeHTML(str)

  doc = Hpricot(str, :xhtml_strict => true)
  doc = subtree(doc, :body)

  out = ""
  doc.traverse_text {|t| out << add_entities(t.to_html)}

  return out
end

.unescapeHTML(str, xml = true) ⇒ `Object`

unescapes HTML. If xml is true, also converts XML-only named entities to HTML.



139
140
141

# File 'lib/html-cleaner.rb', line 139

def unescapeHTML(str, xml = true)
  CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
end

Class: FeedNormalizer::HtmlCleaner

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.add_entities(str) ⇒ Object

.clean(str) ⇒ Object

.dodgy_uri?(uri) ⇒ Boolean

.flatten(str) ⇒ Object

.unescapeHTML(str, xml = true) ⇒ Object

.add_entities(str) ⇒ `Object`

.clean(str) ⇒ `Object`

.dodgy_uri?(uri) ⇒ `Boolean`

.flatten(str) ⇒ `Object`

.unescapeHTML(str, xml = true) ⇒ `Object`