Class: RubyCrawl::Browser
- Inherits:
-
Object
- Object
- RubyCrawl::Browser
- Defined in:
- lib/rubycrawl/browser.rb,
lib/rubycrawl/browser/extraction.rb
Overview
Wraps Ferrum to provide a simple crawl interface. Each crawl gets its own isolated page (own context = own cookies/storage). Browser (Chrome) is launched once lazily and reused across crawls.
Defined Under Namespace
Modules: Extraction
Constant Summary collapse
- BLOCKED_RESOURCE_TYPES =
%w[image media font stylesheet].freeze
Instance Method Summary collapse
-
#crawl(url, wait_until: nil, block_resources: true) ⇒ RubyCrawl::Result
Crawl a URL and return a RubyCrawl::Result.
-
#initialize(timeout: 30, headless: true, browser_options: {}) ⇒ Browser
constructor
A new instance of Browser.
Constructor Details
#initialize(timeout: 30, headless: true, browser_options: {}) ⇒ Browser
Returns a new instance of Browser.
15 16 17 18 19 20 21 |
# File 'lib/rubycrawl/browser.rb', line 15 def initialize(timeout: 30, headless: true, browser_options: {}) @timeout = timeout @headless = headless = @browser = nil @mutex = Mutex.new end |
Instance Method Details
#crawl(url, wait_until: nil, block_resources: true) ⇒ RubyCrawl::Result
Crawl a URL and return a RubyCrawl::Result.
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/rubycrawl/browser.rb', line 29 def crawl(url, wait_until: nil, block_resources: true) page = lazy_browser.create_page(new_context: true) begin setup_resource_blocking(page) if block_resources navigate(page, url, wait_until.to_s) extract(page) rescue ::Ferrum::TimeoutError => e raise RubyCrawl::TimeoutError, "Navigation timed out: #{e.message}" rescue ::Ferrum::StatusError => e raise RubyCrawl::, "Navigation failed: #{e.message}" rescue ::Ferrum::Error => e raise RubyCrawl::ServiceError, "Browser error: #{e.message}" ensure begin page&.close rescue StandardError nil end end end |