Method: Treat::Workers::Formatters::Readers::HTML.read

Defined in:
lib/treat/workers/formatters/readers/html.rb

.read(document, options = {}) ⇒ Object

Read the HTML document and strip it of its markup.

Options:

text when cleaning the document (default: false).
  • (Boolean) :remove_empty_nodes => remove <p> tags that have no text content

  • (String) :encoding => if the page is of a known encoding, you can specify it; if left unspecified, the encoding will be guessed (only in Ruby 1.9.x)

  • (String) :html_headers => in Ruby 1.9.x these will be passed to the guess_html_encoding gem to aid with guessing the HTML encoding.

  • (Array of String) :tags => the base whitelist of tags to sanitize, defaults to %w[div p]. also removes p tags that contain only images

  • (Array of String) :attributes => list allowed attributes

  • (Array of String) :ignore_image_format => for use with images.

  • (Numeric) :min_image_height => minimum image height for images.

  • (Numeric) :min_image_width => minimum image width for images.



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/treat/workers/formatters/readers/html.rb', line 38

def self.read(document, options = {})

  # set encoding with the guess_html_encoding
  options = DefaultOptions.merge(options)
  html = File.read(document.file)

  silence_warnings do
    # Strip comments
    html.gsub!(/<!--[^>]*-->/m, '')
    d = Readability::Document.new(html, options)
    document.value = "<h1>#{d.title}</h1>\n" + d.content
    document.set :format, 'html'
    document.set :images, d.images
  end

  document

end