Class: DaimonSkycrawlers::Crawler::Base

Inherits:
Object
  • Object
show all
Includes:
DaimonSkycrawlers::Callbacks, DaimonSkycrawlers::ConfigMixin, DaimonSkycrawlers::Configurable, LoggerMixin
Defined in:
lib/daimon_skycrawlers/crawler/base.rb

Overview

The base class of crawler

A crawler implementation can inherit this class and override #fetch in the class.

Direct Known Subclasses

Default

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from DaimonSkycrawlers::Configurable

#configure

Methods included from DaimonSkycrawlers::Callbacks

#after_process, #before_process, #clear_after_process_callbacks, #clear_before_process_callbacks, #run_after_process_callbacks, #run_before_process_callbacks

Constructor Details

#initialize(base_url = nil, faraday_options: {}, options: {}) ⇒ Base

Returns a new instance of Base.

Parameters:

  • base_url (String) (defaults to: nil)

    Base URL for crawler

  • faraday_options (Hash) (defaults to: {})

    options for Faraday

  • options (Hash) (defaults to: {})

    options for crawler



45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 45

def initialize(base_url = nil, faraday_options: {}, options: {})
  super()
  @base_url = base_url
  @faraday_options = faraday_options
  @options = options
  @prepare = ->(connection) {}
  @skipped = false
  @n_processed_urls = 0

  setup_default_filters
  setup_default_post_processes
end

Instance Attribute Details

#n_processed_urlsObject (readonly)

Returns the value of attribute n_processed_urls.



38
39
40
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 38

def n_processed_urls
  @n_processed_urls
end

#storageDaimonSkycrawlers::Storage::Base

Retrieve storage instance



88
89
90
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 88

def storage
  @storage ||= Storage::RDB.new
end

Instance Method Details

#connectionFaraday

Returns:

  • (Faraday)


102
103
104
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 102

def connection
  @connection ||= Faraday.new(@base_url, @faraday_options)
end

#fetch(path, message = {}) ⇒ Faraday::Response

Fetch URL

Override this method in subclass.

Parameters:

  • path (String)

    URI or path

  • message (Hash) (defaults to: {})

    message can include anything

Returns:

  • (Faraday::Response)

    HTTP response

Raises:

  • (NotImplementedError)


147
148
149
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 147

def fetch(path, message = {})
  raise NotImplementedError, "Must implement this method in subclass"
end

#get(path, params = {}) ⇒ Faraday::Response

GET URL with params

Parameters:

  • path (String)

    URI or path

  • params (Hash) (defaults to: {})

    query parameters

Returns:

  • (Faraday::Response)

    HTTP response



159
160
161
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 159

def get(path, params = {})
  @connection.get(path, params)
end

#post(path, params = {}) ⇒ Faraday::Response

POST URL with params

Parameters:

  • path (String)

    URI or path

  • params (Hash) (defaults to: {})

    query parameters

Returns:

  • (Faraday::Response)

    HTTP response



171
172
173
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 171

def post(path, params = {})
  @connection.post(path, params)
end

#prepare {|connection| ... } ⇒ Object

Call this method before DaimonSkycrawlers.register_crawler For example, you can login before fetch URL

Yields:



79
80
81
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 79

def prepare(&block)
  @prepare = block
end

#process(message, &block) ⇒ Object

Process crawler sequence

  1. Run registered filters
  2. Prepare connection
  3. Download(fetch) data from given URL
  4. Run post processes (store downloaded data to storage)

Parameters:

  • message (Hash)

    parameters for crawler



116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 116

def process(message, &block)
  @skipped = false
  @n_processed_urls += 1

  proceeding = run_before_process_callbacks(message)
  unless proceeding
    skip(message[:url])
    return
  end

  # url can be a path
  url = message.delete(:url)
  url = (URI(connection.url_prefix) + url).to_s

  @prepare.call(connection)
  response = fetch(url, message, &block)
  data = { url: url, message: message, response: response }
  run_after_process_callbacks(data)
  data
end

#setup_connection(options = {}) {|faraday| ... } ⇒ Object

Set up connection

Parameters:

  • options (Hash) (defaults to: {})

    options for Faraday

Yields:

  • (faraday)

Yield Parameters:

  • faraday (Faraday)


65
66
67
68
69
70
71
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 65

def setup_connection(options = {})
  merged_options = @faraday_options.merge(options)
  faraday_options = merged_options.empty? ? nil : merged_options
  @connection = Faraday.new(@base_url, faraday_options) do |faraday|
    yield faraday
  end
end

#skipped?true|false

Returns:

  • (true|false)


95
96
97
# File 'lib/daimon_skycrawlers/crawler/base.rb', line 95

def skipped?
  @skipped
end