Class: Pollex::Scraper

Inherits:
Object
  • Object
show all
Includes:
Singleton
Defined in:
lib/pollex/scraper.rb

Overview

Singleton object for scraping Pollex, caching the results, and extracting data.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeScraper

Instantiates a cache of size 100 for storing scraped pages.



9
10
11
12
# File 'lib/pollex/scraper.rb', line 9

def initialize()
  @cache = LRUCache.new(:max_size => 100, :default => nil)
  @verbose = false
end

Instance Attribute Details

#verboseObject

Returns the value of attribute verbose.



6
7
8
# File 'lib/pollex/scraper.rb', line 6

def verbose
  @verbose
end

Instance Method Details

#get(path, attr_infos) ⇒ Array<Symbol, String>

Gets arbitrary data from a page, with optional post-processing.

Examples:

Return information about the level of a given reconstruction

Scraper.instance.get(@reconstruction_path, [
  [:level_token, "table[1]/tr[2]/td/a/text()", lambda {|x| x.split(':')[0]}],
  [:level_path, "table[1]/tr[2]/td/a/@href"]
])


47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/pollex/scraper.rb', line 47

def get(path, attr_infos)
  page = open_with_cache(path)
  contents = page.css('#content')

  attrs = {}
  attr_infos.each do |name, xpath, post_processor|
    attrs[name] = ''
    if xpath
      attrs[name] = contents.at_xpath(xpath).to_s.strip
    end
    if post_processor
      attrs[name] = post_processor.call(attrs[name])
    end
  end
  attrs
end

#get_all(klass, path, attr_infos, table_num = 0) ⇒ Array<klass>, ...

Gets all elements from a table within a page, with optional post-processing. The results are returned as either an array of key-value pairs or as an array of objects, if a klass is specifed. If more than one page of results is found, the first page of results is returned as a PaginatedArray.

Examples:

Return an array of all SemanticFields in Pollex

Scraper.instance.get_all(SemanticField, "/category/", [
  [:id, 'td[1]/a/text()'],
  [:path, 'td[1]/a/@href'],
  [:name, 'td[2]/a/text()'],
  [:count, 'td[3]/text()']
])


87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# File 'lib/pollex/scraper.rb', line 87

def get_all(klass, path, attr_infos, table_num = 0)
  page = open_with_cache(path)

  rows = page.css('table')[table_num].css('tr')
  objs = rows[1..-1].map do |row|
    attrs = {}
    attr_infos.each do |name, xpath, post_processor|
      attrs[name] = ''
      if xpath
        attrs[name] = row.at_xpath(xpath).to_s.strip
      end
      if post_processor
        attrs[name] = post_processor.call(attrs[name])
      end
    end
    attrs
  end

  # check if there is a "next" page
  last_link = page.css('.pagination a').last()
  if last_link and last_link.text()[0..3] == 'Next'
    last_link_path = last_link.attributes()['href']
    new_path = path.split('?')[0] + last_link_path

    results = PaginatedArray.new()
    results.query = {:klass => klass, :attr_infos => attr_infos, :table_num => table_num}
    results.next_page = new_path
    results.concat(objs.to_a) # merge rather than create new array
  else
    results = objs
  end

  if klass
    results.map! {|x| klass.new(x) }
  end

  results
end

#open_with_cache(path) ⇒ Nokogiri::HTML::Document

Opens the given Pollex page, either by retrieving it from the cache or by making a request with Nokogiri and then storing it in the cache.



18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# File 'lib/pollex/scraper.rb', line 18

def open_with_cache(path)
  if @cache[path]
    if @verbose
      puts "Opening cached contents of http://pollex.org.nz#{path} ..."
    end
    @cache[path]
  else
    if @verbose
      puts "Connecting to http://pollex.org.nz#{path} ..."
    end
    page = Nokogiri::HTML(open("http://pollex.org.nz#{path}"))
    @cache[path] = page
    page
  end
end