Class: RubyCrawl::Browser

Inherits:
Object
  • Object
show all
Defined in:
lib/rubycrawl/browser.rb,
lib/rubycrawl/browser/extraction.rb

Overview

Wraps Ferrum to provide a simple crawl interface. Each crawl gets its own isolated page (own context = own cookies/storage). Browser (Chrome) is launched once lazily and reused across crawls.

Defined Under Namespace

Modules: Extraction

Constant Summary collapse

BLOCKED_RESOURCE_TYPES =
%w[image media font stylesheet].freeze

Instance Method Summary collapse

Constructor Details

#initialize(timeout: 30, headless: true, browser_options: {}) ⇒ Browser

Returns a new instance of Browser.



15
16
17
18
19
20
21
# File 'lib/rubycrawl/browser.rb', line 15

def initialize(timeout: 30, headless: true, browser_options: {})
  @timeout         = timeout
  @headless        = headless
  @browser_options = browser_options
  @browser         = nil
  @mutex           = Mutex.new
end

Instance Method Details

#crawl(url, wait_until: nil, block_resources: true) ⇒ RubyCrawl::Result

Crawl a URL and return a RubyCrawl::Result.

Parameters:

  • url (String)
  • wait_until (String, nil) (defaults to: nil)

    “load”, “domcontentloaded”, “networkidle”, “commit”

  • block_resources (Boolean) (defaults to: true)

    block images/fonts/CSS/media for speed

Returns:



29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/rubycrawl/browser.rb', line 29

def crawl(url, wait_until: nil, block_resources: true)
  page = lazy_browser.create_page(new_context: true)

  begin
    setup_resource_blocking(page) if block_resources
    navigate(page, url, wait_until.to_s)
    extract(page)
  rescue ::Ferrum::TimeoutError => e
    raise RubyCrawl::TimeoutError, "Navigation timed out: #{e.message}"
  rescue ::Ferrum::StatusError => e
    raise RubyCrawl::NavigationError, "Navigation failed: #{e.message}"
  rescue ::Ferrum::Error => e
    raise RubyCrawl::ServiceError, "Browser error: #{e.message}"
  ensure
    begin
      page&.close
    rescue StandardError
      nil
    end
  end
end