Class: Wayfarer::Job

Inherits:
ActiveJob::Base
  • Object
show all
Extended by:
Forwardable
Includes:
Hooks, Locals
Defined in:
lib/wayfarer/job.rb

Overview

A Job is a class that has a Routing::Router with many Routing::Rules which are matched against a URI. Rules map URIs onto job instance methods. Under the hood, jobs are instantiated within separate threads by a Processor. Every instance gets its own thread. If a URI is matched, its Page is retrieved, and made available to instance methods via #page.

Jobs implement ActiveJob's Job API and are therefore compatible with a wide range of job queues. To run a job immediately, call ::perform_now. enqueue a job, call ::perform_later.

Callbacks collapse

Callbacks collapse

Methods included from Locals

included, thread_safe_counterpart

Constructor Details

#initialize(*argv) ⇒ Job

Returns a new instance of Job.


119
120
121
122
123
# File 'lib/wayfarer/job.rb', line 119

def initialize(*argv)
  @halts = false
  @staged_uris = []
  super(*argv)
end

Class Attribute Details

.config {|Configuration| ... } ⇒ Configuration

A configuration based off the global Wayfarer.config.

Yields:

Returns:


83
84
85
86
87
# File 'lib/wayfarer/job.rb', line 83

def config
  @config ||= Wayfarer.config.clone
  yield(@config) if block_given?
  @config
end

.router(&proc) ⇒ Routing::Router Also known as: route, routes

A router. If a block is passed in, it is evaluated within the Router's instance.

Returns:


92
93
94
95
96
# File 'lib/wayfarer/job.rb', line 92

def router(&proc)
  @router ||= Routing::Router.new
  @router.instance_eval(&proc) if block_given?
  @router
end

Instance Attribute Details

#adapterObject


114
115
116
# File 'lib/wayfarer/job.rb', line 114

def adapter
  @adapter
end

#pageObject (protected)

The Page representing the URI currently processed by an action. When using the Selenium adapter, Page#body gets refreshed on every call. Otherwise, subsequent DOM updates (i.e. JavaScript-induced) would be invisible.

Returns:

  • Page


111
# File 'lib/wayfarer/job.rb', line 111

attr_writer :page

#paramsObject


117
118
119
# File 'lib/wayfarer/job.rb', line 117

def params
  @params
end

#staged_urisArray<String>, Array<URI> (readonly)

Returns URIs to stage for the next cycle.

Returns:

  • (Array<String>, Array<URI>)

    URIs to stage for the next cycle.

See Also:


108
109
110
# File 'lib/wayfarer/job.rb', line 108

def staged_uris
  @staged_uris
end

Class Method Details

.after_crawlObject

Callback that fires once after all pages have been retrieved and processing is done.


40
# File 'lib/wayfarer/job.rb', line 40

define_hook :after_crawl

.before_crawlObject

Callback that fires once before any pages are retrieved.


34
# File 'lib/wayfarer/job.rb', line 34

define_hook :before_crawl

.prepareObject

Returns a class copy.


60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/wayfarer/job.rb', line 60

def prepare
  duplicate = dup
  duplicate.router = router.dup
  duplicate.locals = locals.deep_dup
  duplicate.config = config.dup

  duplicate.locals.each do |(key, val)|
    duplicate.locals[key] = Locals.thread_safe_counterpart(val)
  end

  duplicate.locals.each do |(key, _)|
    duplicate.send(:define_method, key) do duplicate.locals[key] end
    duplicate.send(:define_singleton_method, key) do
      duplicate.locals[key]
    end
  end

  duplicate
end

.setup_adapter {|[HTTPAdapters::NetHTTPAdapter, HTTPAdapters::SeleniumAdapter], [Selenium::WebDriver::Driver, nil], [Capybara::Selenium::Driver, nil]| ... } ⇒ Object

Callback that fires when HTTP adapters are instantiated.

Yields:


46
# File 'lib/wayfarer/job.rb', line 46

define_hooks :setup_adapter

Instance Method Details

#browserObject (protected)

A Capybara driver that wraps the #driver.


206
# File 'lib/wayfarer/job.rb', line 206

delegate browser: :adapter

#docObject (protected)

The parsed response body. When using the Selenium adapter, this parses the body again on every call. Otherwise, subsequent DOM updates (i.e. JavaScript-induced) would be invisible.

See Also:


195
# File 'lib/wayfarer/job.rb', line 195

delegate doc: :page

#driverObject (protected)

The Selenium WebDriver.

See Also:


201
# File 'lib/wayfarer/job.rb', line 201

delegate driver: :adapter

#haltObject (protected)

Sets a halting flag that signals the processor to stop its work.


142
143
144
# File 'lib/wayfarer/job.rb', line 142

def halt
  @halts = true
end

#halts?Boolean

Whether this job will stop processing.

Returns:

  • (Boolean)

126
127
128
# File 'lib/wayfarer/job.rb', line 126

def halts?
  @halts
end

#loggerObject (protected)


209
# File 'lib/wayfarer/job.rb', line 209

delegate logger: :"self.class

#perform(*uris) ⇒ Object

Note:

ActiveJob API

Performs this job.


133
134
135
# File 'lib/wayfarer/job.rb', line 133

def perform(*uris)
  Crawl.new(self.class, *uris).execute
end

#stage(*uris) ⇒ Object (protected)

Adds URIs to process in the next cycle. If a relative path is given, an absolute URI is constructed from the current #page's URI.

Parameters:

  • (String, URI, Array<String>, Array<URI>)

150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/wayfarer/job.rb', line 150

def stage(*uris)
  expanded = uris.flatten.map do |u|
    if (uri = URI(u)).absolute?
      uri
    else
      # URI#join would discard the path of page.uri.path
      current = page.uri.dup
      current.path = File.join(page.uri.path, uri.path)
      current
    end
  end

  # This method has somewhat become the guard keeper for invalid URIs that
  # would lead to exceptions otherwise down the line
  supported = expanded.select do |uri|
    HTTPAdapters::NetHTTPAdapter::RECOGNIZED_URI_TYPES.any? do |type|
      uri.is_a?(type)
    end
  end

  @staged_uris.push(*supported)
end