Class: Scrubyt::BaseFilter

Inherits:
Object
  • Object
show all
Defined in:
lib/scrubyt/core/scraping/filters/base_filter.rb

Overview

Filter out relevant pieces from the parent pattern

A Scrubyt extractor is almost like a waterfall: water is pouring from the top until it reaches the bottom. The biggest difference is that instead of water, a HTML document travels through the space.

Of course Scrubyt would not make much sense if the same document would arrive at the bottom that was poured in at the top - since in this case we might use an indentity transformation (i.e. do nothing with the input) as well.

This is where filters came in: as they name says, they filter the stuff that is pouring from above, to leave the interesting parts and discard the rest. The working of a filter will be explained most easily by the help of an example. Let’s consider that we would like to extract information from a webshop; Concretely we are interested in the name of the items and the URL pointing to the image of the item.

To accomplish this, first we select the items with the pattern item (a pattern is a logical grouping of fillters; see Pattern documentation) Then our new context is the result extracted by the ‘item’ pattern; For every ‘item’ pattern, further extract the name and the image of the item; and finally, extract the href attribute of the image. Let’s see an illustration:

root             --> This pattern is called a 'root pattern', It is invisible to you
|                    and basically it represents the document; it has no filters
+-- item         --> Filter what's coming from above (the whole document) to get
    |                relevant pieces of data (in this case webshop items)
    +-- name     --> Again, filter what's coming from above (a webshop item) and
    |                leave only item names after this operation
    +-- image    --> This time filter the image of the item
        |
        +-- href --> And finally, from the image elements, get the attribute 'href'

Constant Summary collapse

EXAMPLE_TYPE_XPATH =

XPath example, like html/body/tr/td etc.

0
EXAMPLE_TYPE_STRING =

String from the document, for example ‘Canon EOS 300 D’.

1
EXAMPLE_TYPE_IMAGE =

Image example, like ‘

2
EXAMPLE_TYPE_CHILDREN =

No example - the actual XPath is determined from the children XPaths (their LCA)

3
EXAMPLE_TYPE_REGEXP =

Regexp example, like /d+@*d+/

4
EXAMPLE_TYPE_COMPOUND =

Compound example, like :contains => ‘goodies’

5

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_name, *args, &block) ⇒ Object



69
70
71
72
73
74
75
76
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 69

def method_missing(method_name, *args, &block)
  case method_name.to_s
  when /^ensure.+/
    constraints << Constraint.send("add_#{method_name.to_s}".to_sym, self, *args)
  else
    throw_method_missing(method_name, *args, &block)
  end
end

Instance Attribute Details

#constraintsObject

Returns the value of attribute constraints.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def constraints
  @constraints
end

#exampleObject

Returns the value of attribute example.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def example
  @example
end

#example_typeObject

Returns the value of attribute example_type.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def example_type
  @example_type
end

#final_resultObject

Returns the value of attribute final_result.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def final_result
  @final_result
end

#parent_patternObject

Returns the value of attribute parent_pattern.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def parent_pattern
  @parent_pattern
end

#regexpObject

Returns the value of attribute regexp.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def regexp
  @regexp
end

#temp_sinkObject

Returns the value of attribute temp_sink.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def temp_sink
  @temp_sink
end

#xpathObject

Returns the value of attribute xpath.



52
53
54
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 52

def xpath
  @xpath
end

Class Method Details

.create(parent_pattern, example = nil) ⇒ Object



55
56
57
58
59
60
61
62
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 55

def self.create(parent_pattern, example=nil)
  filter_name = (parent_pattern.type.to_s.split("_").map!{|e| e.capitalize }.join) + 'Filter'
  if filter_name == 'RootFilter'
    BaseFilter.new(parent_pattern, example)
  else
    instance_eval("#{filter_name}.new(parent_pattern, example)")
  end
end

Instance Method Details

#throw_method_missingObject

TODO still used?



68
# File 'lib/scrubyt/core/scraping/filters/base_filter.rb', line 68

alias_method :throw_method_missing, :method_missing