Class: Scrubyt::Constraint

Inherits:
Object
  • Object
show all
Defined in:
lib/scrubyt/core/scraping/constraint.rb

Overview

Rejecting result instances based on further rules

The two most trivial problems with a set of rules is that they match either less or more instances than we would like them to. Constraints are a way to remedy the second problem: they serve as a tool to filter out some result instances based on rules. A typical example:

  • ensure_presence_of_ancestor_pattern consider this model:

    <book>
      <author>...</author>
      <title>...</title>
    </book>
    

If I attach the ensure_presence_of_ancestor_pattern to the pattern ‘book’ with values ‘author’ and ‘title’, only those books will be matched which have an author and a title (i.e.the child patterns author and title must extract something). This is a way to say ‘a book MUST have an author and a title’.

Constant Summary collapse

CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_PATTERN =

Different constraint types

0
CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ATTRIBUTE =
1
CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ATTRIBUTE =
2
CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ANCESTOR_NODE =
3
CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ANCESTOR_NODE =
4

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#targetObject (readonly)

Returns the value of attribute target.



46
47
48
# File 'lib/scrubyt/core/scraping/constraint.rb', line 46

def target
  @target
end

#typeObject (readonly)

Returns the value of attribute type.



46
47
48
# File 'lib/scrubyt/core/scraping/constraint.rb', line 46

def type
  @type
end

Class Method Details

.add_ensure_absence_of_ancestor_node(node_name, attributes) ⇒ Object

If this type of constraint is added to a pattern, the HTML node extracted by the pattern must NOT contain a HTML ancestor node called ‘node_name’ with the attribute set ‘attributes’.

“attributes” is an array of hashes, for example

=> ‘red’, => ‘www.google.com

in the case that more values have to be checked with the same key (e.g. ‘class’ => ‘small’ and ‘ class’ => ‘wide’ it has to be written as [=> [‘small’,‘wide’]]

“attributes” can be empty - in this case just the ‘node_name’ is checked



89
90
91
92
# File 'lib/scrubyt/core/scraping/constraint.rb', line 89

def self.add_ensure_absence_of_ancestor_node(node_name, attributes)
  Constraint.new([node_name, attributes],
                 CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ANCESTOR_NODE)
end

.add_ensure_absence_of_attribute(attribute_hash) ⇒ Object

If this type of constraint is added to a pattern, the HTML node it targets must NOT have an attribute named “attribute_name” with the value “attribute_value”



64
65
66
67
# File 'lib/scrubyt/core/scraping/constraint.rb', line 64

def self.add_ensure_absence_of_attribute(attribute_hash)
  Constraint.new(attribute_hash,
                 CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ATTRIBUTE)
end

.add_ensure_presence_of_ancestor_node(node_name, attributes) ⇒ Object

If this type of constraint is added to a pattern, the HTML node extracted by the pattern must NOT contain a HTML ancestor node called ‘node_name’ with the attribute set ‘attributes’.

“attributes” is an array of hashes, for example

=> ‘red’, => ‘www.google.com

in the case that more values have to be checked with the same key (e.g. ‘class’ => ‘small’ and ‘ class’ => ‘wide’ it has to be written as [=> [‘small’,‘wide’]]

“attributes” can be empty - in this case just the ‘node_name’ is checked



105
106
107
108
# File 'lib/scrubyt/core/scraping/constraint.rb', line 105

def self.add_ensure_presence_of_ancestor_node(node_name, attributes)
  Constraint.new([node_name, attributes],
                 CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ANCESTOR_NODE)
end

.add_ensure_presence_of_attribute(attribute_hash) ⇒ Object

If this type of constraint is added to a pattern, the HTML node it targets must have an attribute named “attribute_name” with the value “attribute_value”



73
74
75
76
# File 'lib/scrubyt/core/scraping/constraint.rb', line 73

def self.add_ensure_presence_of_attribute(attribute_hash)
  Constraint.new(attribute_hash,
                 CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ATTRIBUTE)
end

.add_ensure_presence_of_pattern(ancestor) ⇒ Object

If this type of constraint is added to a pattern, it must have an ancestor pattern (child pattern, or child pattern of a child pattern, etc.) denoted by “ancestor” ‘Has an ancestor pattern’ means that the ancestor pattern actually extracts something (just by looking at the wrapper model, the ancestor pattern is always present) Note that from this type of constraint there is no ‘ensure_absence’ version, since I could not think about an use case for that



56
57
58
# File 'lib/scrubyt/core/scraping/constraint.rb', line 56

def self.add_ensure_presence_of_pattern(ancestor)
  Constraint.new(ancestor, CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_PATTERN)
end

Instance Method Details

#check(result) ⇒ Object

Evaluate the constraint; if this function returns true, it means that the constraint passed, i.e. its filter will be added to the exctracted content of the pattern



113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/scrubyt/core/scraping/constraint.rb', line 113

def check(result)
  case @type
    #checked after evaluation, so here always return true
    when CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_PATTERN
      return true
    when CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ATTRIBUTE
      attribute_present(result)
    when CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ATTRIBUTE
      !attribute_present(result)
    when CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ANCESTOR_NODE
      ancestor_node_present(result)
    when CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ANCESTOR_NODE
      !ancestor_node_present(result)
  end
end