Method: Treat::Workers::Formatters::Readers::HTML.read

Defined in:: lib/treat/workers/formatters/readers/html.rb

.read(document, options = {}) ⇒ `Object`

Read the HTML document and strip it of its markup.

Options:

text when cleaning the document (default: false).

(Boolean) :remove_empty_nodes => remove <p> tags that have no text content
(String) :encoding => if the page is of a known encoding, you can specify it; if left unspecified, the encoding will be guessed (only in Ruby 1.9.x)
(String) :html_headers => in Ruby 1.9.x these will be passed to the guess_html_encoding gem to aid with guessing the HTML encoding.
(Array of String) :tags => the base whitelist of tags to sanitize, defaults to %w[div p]. also removes p tags that contain only images
(Array of String) :attributes => list allowed attributes
(Array of String) :ignore_image_format => for use with images.
(Numeric) :min_image_height => minimum image height for images.
(Numeric) :min_image_width => minimum image width for images.

# File 'lib/treat/workers/formatters/readers/html.rb', line 38

def self.read(document, options = {})

  # set encoding with the guess_html_encoding
  options = DefaultOptions.merge(options)
  html = File.read(document.file)

  silence_warnings do
    # Strip comments
    html.gsub!(/<!--[^>]*-->/m, '')
    d = Readability::Document.new(html, options)
    document.value = "<h1>#{d.title}</h1>\n" + d.content
    document.set :format, 'html'
    document.set :images, d.images
  end

  document

end

Method: Treat::Workers::Formatters::Readers::HTML.read

.read(document, options = {}) ⇒ Object

.read(document, options = {}) ⇒ `Object`