Class: Distillery::Document

Inherits:
SimpleDelegator
  • Object
show all
Defined in:
lib/distillery/document.rb

Overview

Wraps a Nokogiri document for the HTML page to be disilled and holds all methods to clean and distill the document down to just its content element.

Constant Summary collapse

UNLIKELY_TAGS =

HTML elements unlikely to contain the content element.

%w[head script link meta]
UNLIKELY_IDENTIFIERS =

HTML ids and classes that are unlikely to contain the content element.

/combx|comment|community|disqus|extra|foot|header|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup/i
REMOVAL_WHITELIST =

Elements that are whitelisted from being removed as unlikely elements

%w[a body]
BLOCK_ELEMENTS =

“Block” elements who signal its parent is less-likely to be the content element.

%w[a blockquote dl div img ol p pre table ul]
POSITIVE_IDENTIFIERS =

HTML ids and classes that are positive signals of the content element.

/article|body|content|entry|hentry|page|pagination|post|text/i
NEGATIVE_IDENTIFIERS =

HTML ids and classes that are negative signals of the content element.

/combx|comment|contact|foot|footer|footnote|link|media|promo|related|scroll|shoutbox|sponsor|tags|widget/i
UNRELATED_ELEMENTS =

HTML elements that are unrelated to the content in the content element.

%w[iframe form object]
POSSIBLE_UNRELATED_ELEMENTS =

HTML elements that are possible unrelated to the content of the content HTML element.

%w[table ul div a]
0.045
DOM_PRIORITIZATION =

The prioritization level given to elements higher in the DOM

25

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(page_string) ⇒ Document

Create a new Document

Parameters:

  • str (String)

    The HTML document to distill as a string.



51
52
53
54
# File 'lib/distillery/document.rb', line 51

def initialize(page_string)
  @scores = Hash.new(0)
  super(::Nokogiri::HTML(page_string))
end

Instance Attribute Details

#docObject (readonly)

The Nokogiri document



43
44
45
# File 'lib/distillery/document.rb', line 43

def doc
  @doc
end

#scoresObject (readonly)

Hash of xpath => content score of elements in this document



46
47
48
# File 'lib/distillery/document.rb', line 46

def scores
  @scores
end

Instance Method Details

#clean_top_scoring_elements!(options = {}) ⇒ Object

Attempts to clean the top scoring node from non-page content items, such as advertisements, widgets, etc



118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# File 'lib/distillery/document.rb', line 118

def clean_top_scoring_elements!(options = {})
  keep_images = !!options[:images]

  top_scoring_elements.each do |element|

    element.search("*").each do |node|
      next if contains_content_image?(node) && keep_images
      node.remove if has_empty_text?(node)
    end

    element.search("*").each do |node|
      next if contains_content_image?(node) && keep_images
      if UNRELATED_ELEMENTS.include?(node.name) ||
        (node.text.count(',') < 2 && unlikely_to_be_content?(node))
        node.remove
      end
    end
  end
end

#distill!(options = {}) ⇒ Object

Distills the document down to just its content.

Parameters:

  • options (Hash) (defaults to: {})

    Distillation options

Options Hash (options):

  • :dirty (Symbol)

    Do not clean the content element HTML



106
107
108
109
110
111
112
113
114
# File 'lib/distillery/document.rb', line 106

def distill!(options = {})
  remove_irrelevant_elements!
  remove_unlikely_elements!

  score!

  clean_top_scoring_elements!(options) unless options.delete(:clean) == false
  top_scoring_elements.map(&:inner_html).join("\n")
end

#mark_scorable_elements!Object

Marks elements that are suitable for scoring with a special HTML attribute



75
76
77
78
79
80
81
# File 'lib/distillery/document.rb', line 75

def mark_scorable_elements!
  search('div', 'p').each do |element|
    if element.name == 'p' || scorable_div?(element)
      element['data-distillery'] = 'scorable'
    end
  end
end

#remove_irrelevant_elements!(tags = UNLIKELY_TAGS) ⇒ Object

Removes irrelevent elements from the document. This is usually things like <script>, <link> and other page elements we don’t care about



58
59
60
# File 'lib/distillery/document.rb', line 58

def remove_irrelevant_elements!(tags = UNLIKELY_TAGS)
  search(*tags).each(&:remove)
end

#remove_unlikely_elements!Object

Removes unlikely elements from the document. These are elements who have classes that seem to indicate they are comments, headers, footers, nav, etc



64
65
66
67
68
69
70
71
72
# File 'lib/distillery/document.rb', line 64

def remove_unlikely_elements!
  search('*').each do |element|
    idclass = "#{element['class']}#{element['id']}"

    if idclass =~ UNLIKELY_IDENTIFIERS && !REMOVAL_WHITELIST.include?(element.name)
      element.remove
    end
  end
end

#score!Object

Scores the document elements based on an algorithm to find elements which hold page content.



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/distillery/document.rb', line 85

def score!
  mark_scorable_elements!

  scorable_elements.each do |element|
    points = 1
    points += element.text.split(',').length
    points += [element.text.length / 100, 3].min

    scores[element.path] = points
    scores[element.parent.path] += points
    scores[element.parent.parent.path] += points.to_f/2
  end

  augment_scores_by_link_weight!
  augment_scores_by_dom_depth!
end