Class: WebScraper

Inherits:
Object
  • Object
show all
Defined in:
lib/web_scraper.rb

Overview

WebScraper allows you to describe html structure declaratively, get appropriate blocks, and work with them as with ruby objects.

Examples:

class Article < WebScraper
  resource 'http://hbswk.hbs.edu/topics/it.html'

  base css: '.tile-medium'

  property :title,       xpath: './/h4/a/text()'
  property :date,        xpath: './/li[1]/text()'
  property :category,    xpath: './/li[2]/a/text()'
  property :description, xpath: './/p/text()'

  key :title
end

puts "#{Article.count} articles were found"
puts

articles = Article.all

articles.each do |article|
  header = article.title
  puts header
  puts '=' * header.length
  puts

  subheader = "#{article.date} #{article.category}"
  puts subheader
  puts '-' * subheader.length
  puts

  puts article.description
  puts
end

article =  Article.find('Tech Investment the Wise Way')

puts article.description

Defined Under Namespace

Classes: BaseDefentitionError, ConfigurationError, KeyDefentitionError, PropertyDefentitionError, ResourceDefentitionError

Class Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(node) ⇒ WebScraper

Sets nokogiri node. It’s private method.



248
249
250
# File 'lib/web_scraper.rb', line 248

def initialize(node)
  @node = node
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args, &block) ⇒ Object

Returns appropriate value for property if found. Converts it to the defined type.

Examples:

puts article.description


271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# File 'lib/web_scraper.rb', line 271

def method_missing(name, *args, &block)
  if self.class.properties.key? name
    property = self.class.properties[name]

    type = property[:type]
    value = @node.send(*property[:selector])

    case type
    when :string  then value.text.strip
    when :integer then value.text.to_i
    when :float   then value.text.to_f
    when :node    then value
    end
  else
    super(name, *args, &block)
  end
end

Class Attribute Details

._baseObject (readonly)

Returns the value of attribute _base.



156
157
158
# File 'lib/web_scraper.rb', line 156

def _base
  @_base
end

._keyObject (readonly)

Returns the value of attribute _key.



217
218
219
# File 'lib/web_scraper.rb', line 217

def _key
  @_key
end

._resourceObject (readonly)

Returns the value of attribute _resource.



139
140
141
# File 'lib/web_scraper.rb', line 139

def _resource
  @_resource
end

.propertiesObject (readonly)

Returns the value of attribute properties.



201
202
203
# File 'lib/web_scraper.rb', line 201

def properties
  @properties
end

Class Method Details

.allObject

Loads html page, detects appropriate blocks, wraps them in objects. The result will be cached.

Examples:

articles = Article.all

Raises:



94
95
96
97
98
99
# File 'lib/web_scraper.rb', line 94

def all
  raise ConfigurationError unless valid?

  @all ||= Nokogiri::HTML(open(_resource))
           .send(*_base).map { |node| new(node) }
end

.base(_base) ⇒ Object

Defines base – selector which determines blocks of content. You can use css or xpath selectors.

Examples:

class Article < WebScraper
  ...
  base css: '.tile-medium'
  ...
end

Raises:



150
151
152
153
154
# File 'lib/web_scraper.rb', line 150

def base(_base)
  raise BaseDefentitionError unless valid_selector? _base

  @_base = _base.to_a.flatten
end

.countObject

Returns number of objects found.

Examples:

puts "#{Article.count} articles were found"


105
106
107
# File 'lib/web_scraper.rb', line 105

def count
  all.size
end

.find(key) ⇒ Object

Finds first object with required key.

Examples:

article = Article.find('Tech Investment the Wise Way')


121
122
123
# File 'lib/web_scraper.rb', line 121

def find(key)
  all.find { |e| e.send(_key) == key }
end

.key(_key) ⇒ Object

Defines key – property which will be used in find method.

Examples:

class Article < WebScraper
  ...
  key :title
  ...
end

Raises:



211
212
213
214
215
# File 'lib/web_scraper.rb', line 211

def key(_key)
  raise KeyDefentitionError unless properties.keys.include? _key

  @_key = _key
end

.property(*args) ⇒ Object

Defines property – name (and type optionally) and selector. You can use css or xpath selectors. Types determine returning values. Available types (default is string): string, integer, float, node. The node option means nokogiri node.

Examples:

class Article < WebScraper
  ...
  property :title,           xpath: './/h4/a/text()'
  property  views: :integer, xpath: './/h4/span/text()'
  ...
end


171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# File 'lib/web_scraper.rb', line 171

def property(*args)
  @properties ||= {}

  exception = PropertyDefentitionError

  case args.length
  when 1
    params = args[0]

    raise exception unless params.is_a? Hash

    info = params.reject { |k| [:css, :xpath].include? k }
    selector = params.select { |k| [:css, :xpath].include? k }
  when 2
    name, selector = args
    info = { name => :string }
  else
    raise exception
  end

  raise exception unless valid_selector? selector
  raise exception unless valid_info? info

  name = info.keys.first
  type = info.values.first
  selector = selector.to_a.flatten

  @properties[name] = { type: type, selector: selector }
end

.resetObject

Resets cache of the html data.

Examples:

Article.reset


113
114
115
# File 'lib/web_scraper.rb', line 113

def reset
  @all = nil
end

.resource(_resource) ⇒ Object

Defines resource – url of the html page.

Examples:

class Article < WebScraper
  ...
  resource 'http://hbswk.hbs.edu/topics/it.html'
  ...
end

Raises:



133
134
135
136
137
# File 'lib/web_scraper.rb', line 133

def resource(_resource)
  raise ResourceDefentitionError unless _resource.is_a? String

  @_resource = _resource
end

.valid?Boolean

Checks if all attributes were set.

Returns:

  • (Boolean)


221
222
223
# File 'lib/web_scraper.rb', line 221

def valid?
  _resource && _base && _key
end

.valid_info?(info) ⇒ Boolean

Checks if property information (i.e. name and type) were defined correctly.

Returns:

  • (Boolean)


236
237
238
239
240
241
# File 'lib/web_scraper.rb', line 236

def valid_info?(info)
  (info.is_a? Hash) &&
  (info.size == 1) &&
  (info.keys.first.is_a? Symbol) &&
  ([:string, :integer, :float, :node].include? info.values.first)
end

.valid_selector?(selector) ⇒ Boolean

Checks if selector was defined correctly.

Returns:

  • (Boolean)


227
228
229
230
231
232
# File 'lib/web_scraper.rb', line 227

def valid_selector?(selector)
  (selector.is_a? Hash) &&
  (selector.size == 1) &&
  ([:css, :xpath].include? selector.keys.first) &&
  (selector.values.first.is_a? String)
end

Instance Method Details

#css(*args) ⇒ Object

Allows you to use nokogiri css method directly on your object. It proxies it to nokogiri node.



255
256
257
# File 'lib/web_scraper.rb', line 255

def css(*args)
  @node.css(*args)
end

#xpath(*args) ⇒ Object

Allows you to use nokogiri xpath method directly on your object. It proxies it to nokogiri node.



262
263
264
# File 'lib/web_scraper.rb', line 262

def xpath(*args)
  @node.xpath(*args)
end