Class: RubyCrawl::Browser

Inherits:

Object

Object
RubyCrawl::Browser

show all

Defined in:: lib/rubycrawl/browser.rb,
lib/rubycrawl/browser/extraction.rb

Overview

Wraps Ferrum to provide a simple crawl interface. Each crawl gets its own isolated page (own context = own cookies/storage). Browser (Chrome) is launched once lazily and reused across crawls.

Defined Under Namespace

Modules: Extraction

Constant Summary collapse

BLOCKED_RESOURCE_TYPES =

%w[image media font stylesheet].freeze

Instance Method Summary collapse

#crawl(url, wait_until: nil, block_resources: true) ⇒ RubyCrawl::Result

Crawl a URL and return a RubyCrawl::Result.
#initialize(timeout: 30, headless: true, browser_options: {}) ⇒ Browser constructor

A new instance of Browser.

Constructor Details

#initialize(timeout: 30, headless: true, browser_options: {}) ⇒ `Browser`

Returns a new instance of Browser.

# File 'lib/rubycrawl/browser.rb', line 15

def initialize(timeout: 30, headless: true, browser_options: {})
  @timeout         = timeout
  @headless        = headless
  @browser_options = browser_options
  @browser         = nil
  @mutex           = Mutex.new
end

Instance Method Details

#crawl(url, wait_until: nil, block_resources: true) ⇒ `RubyCrawl::Result`

Crawl a URL and return a RubyCrawl::Result.

Parameters:

url (String)
wait_until (String, nil) (defaults to: nil) —

“load”, “domcontentloaded”, “networkidle”, “commit”
block_resources (Boolean) (defaults to: true) —

block images/fonts/CSS/media for speed

Returns:

(RubyCrawl::Result)

# File 'lib/rubycrawl/browser.rb', line 29

def crawl(url, wait_until: nil, block_resources: true)
  page = lazy_browser.create_page(new_context: true)

  begin
    setup_resource_blocking(page) if block_resources
    navigate(page, url, wait_until.to_s)
    extract(page)
  rescue ::Ferrum::TimeoutError => e
    raise RubyCrawl::TimeoutError, "Navigation timed out: #{e.message}"
  rescue ::Ferrum::StatusError => e
    raise RubyCrawl::NavigationError, "Navigation failed: #{e.message}"
  rescue ::Ferrum::Error => e
    raise RubyCrawl::ServiceError, "Browser error: #{e.message}"
  ensure
    begin
      page&.close
    rescue StandardError
      nil
    end
  end
end