Class: DaimonSkycrawlers::Processor::Spider
- Defined in:
- lib/daimon_skycrawlers/processor/spider.rb
Overview
Web spider class. By default extract all links and follow.
Instance Attribute Summary collapse
-
#enqueue ⇒ Object
If true enqueue found links.
-
#link_message ⇒ Object
writeonly
Specify hash literal to propagate arbitrary data next crawler/processor.
-
#link_rules ⇒ Object
same as Nokogiri::XML::DocumentFragment#search In generally, we can set XPath or CSS selector.
-
#next_page_link_message ⇒ Object
writeonly
Sets the attribute next_page_link_message.
-
#next_page_link_rules ⇒ Object
same as Nokogiri::XML::DocumentFragment#search In generally, we can set XPath or CSS selector.
Attributes inherited from Base
Instance Method Summary collapse
-
#append_link_filter(filter = nil) {|message| ... } ⇒ Object
Append filter to reduce links found by link_rules.
- #call(message) ⇒ Object
-
#extract_link {|element| ... } ⇒ Object
Register block to process element found by DaimonSkycrawlers::Processor::Spider#link_rules.
-
#extract_next_page_link {|element| ... } ⇒ Object
Register block to process element found by DaimonSkycrawlers::Processor::Spider#next_page_link_rules.
-
#initialize ⇒ Spider
constructor
A new instance of Spider.
Methods inherited from Base
Methods included from Configurable
Methods included from Callbacks
#after_process, #before_process, #clear_after_process_callbacks, #clear_before_process_callbacks, #run_after_process_callbacks, #run_before_process_callbacks
Constructor Details
#initialize ⇒ Spider
45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 45 def initialize super @link_filters = [] @doc = nil @links = nil @enqueue = true @link_rules = ["a"] @extract_link = ->(element) { element["href"] } = {} @next_page_link_rules = nil @extract_next_page_link = ->(element) { element["href"] } = {} end |
Instance Attribute Details
#enqueue ⇒ Object
If true enqueue found links
34 35 36 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 34 def enqueue @enqueue end |
#link_message=(value) ⇒ Object (writeonly)
Specify hash literal to propagate arbitrary data next crawler/processor. This is for filtering message before crawler/processor processes the message.
43 44 45 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 43 def (value) = value end |
#link_rules ⇒ Object
same as Nokogiri::XML::DocumentFragment#search In generally, we can set XPath or CSS selector.
34 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 34 attr_accessor :enqueue, :link_rules, :next_page_link_rules |
#next_page_link_message=(value) ⇒ Object (writeonly)
Sets the attribute next_page_link_message
43 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 43 attr_writer :link_message, :next_page_link_message |
#next_page_link_rules ⇒ Object
same as Nokogiri::XML::DocumentFragment#search In generally, we can set XPath or CSS selector.
34 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 34 attr_accessor :enqueue, :link_rules, :next_page_link_rules |
Instance Method Details
#append_link_filter(filter = nil) {|message| ... } ⇒ Object
Append filter to reduce links found by link_rules
66 67 68 69 70 71 72 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 66 def append_link_filter(filter = nil, &block) if block_given? @link_filters << block else @link_filters << filter if filter.respond_to?(:call) end end |
#call(message) ⇒ Object
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 101 def call() depth = Integer([:depth] || 2) return if depth <= 1 page = storage.read() unless page log.warn("Could not read page: url=#{message[:url]}, key=#{message[:key]}") return end @doc = Nokogiri::HTML(page.body) = { depth: depth - 1, } = .merge() links.each do |url| enqueue_url(url, ) end next_page_url = next_page_link if next_page_url = .merge() enqueue_url(next_page_url, ) end end |
#extract_link {|element| ... } ⇒ Object
Register block to process element found by DaimonSkycrawlers::Processor::Spider#link_rules
82 83 84 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 82 def extract_link(&block) @extract_link = block end |
#extract_next_page_link {|element| ... } ⇒ Object
Register block to process element found by DaimonSkycrawlers::Processor::Spider#next_page_link_rules
94 95 96 |
# File 'lib/daimon_skycrawlers/processor/spider.rb', line 94 def extract_next_page_link(&block) @extract_next_page_link = block end |