Class: RubyCrawl

Inherits:

Object

Object
RubyCrawl

show all

Includes:: Helpers

Defined in:: lib/rubycrawl.rb,
lib/rubycrawl/errors.rb,
lib/rubycrawl/result.rb,
lib/rubycrawl/browser.rb,
lib/rubycrawl/helpers.rb,
lib/rubycrawl/railtie.rb,
lib/rubycrawl/version.rb,
lib/rubycrawl/site_crawler.rb,
lib/rubycrawl/robots_parser.rb,
lib/rubycrawl/url_normalizer.rb,
lib/rubycrawl/browser/extraction.rb,
lib/rubycrawl/markdown_converter.rb

Overview

RubyCrawl — pure Ruby web crawler with full JavaScript rendering via Ferrum.

Defined Under Namespace

Modules: Helpers, MarkdownConverter, UrlNormalizer Classes: Browser, ConfigurationError, Error, NavigationError, Railtie, Result, RobotsParser, ServiceError, SiteCrawler, TimeoutError

Constant Summary collapse

VERSION =

'0.4.0'

Constants included from Helpers

Helpers::VALID_WAIT_UNTIL

Class Method Summary collapse

.client ⇒ Object
.configure(**options) ⇒ Object
.crawl(url, **options) ⇒ RubyCrawl::Result

Crawl a single URL and return a Result.
.crawl_site(url) {|page| ... } ⇒ Integer

Crawl multiple pages starting from a URL, following links.

Instance Method Summary collapse

#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts) ⇒ Object
#crawl_site(url, **options, &block) ⇒ Object
#initialize(**options) ⇒ RubyCrawl constructor

A new instance of RubyCrawl.

Constructor Details

#initialize(**options) ⇒ `RubyCrawl`

Returns a new instance of RubyCrawl.

# File 'lib/rubycrawl.rb', line 54

def initialize(**options)
  load_options(options)
  @browser = Browser.new(
    timeout:         @timeout,
    headless:        @headless,
    browser_options: @browser_options
  )
end

Class Method Details

.client ⇒ `Object`



18
19
20

# File 'lib/rubycrawl.rb', line 18

def client
  @client ||= new
end

.configure(**options) ⇒ `Object`



49
50
51

# File 'lib/rubycrawl.rb', line 49

def configure(**options)
  @client = new(**options)
end

.crawl(url, **options) ⇒ `RubyCrawl::Result`

Crawl a single URL and return a Result.

Parameters:

url (String)
options (Hash) —

wait_until:, block_resources:, max_attempts:

Returns:

(RubyCrawl::Result)



26
27
28

# File 'lib/rubycrawl.rb', line 26

def crawl(url, **options)
  client.crawl(url, **options)
end

.crawl_site(url) {|page| ... } ⇒ `Integer`

Crawl multiple pages starting from a URL, following links. Yields each page result to the block as it is crawled.

Examples:

RubyCrawl.crawl_site("https://example.com", max_pages: 100) do |page|
  Page.create!(url: page.url, content: page.clean_text, depth: page.depth)
end

Parameters:

url (String) —

The starting URL
max_pages (Integer) —

Maximum number of pages to crawl (default: 50)
max_depth (Integer) —

Maximum link depth from start URL (default: 3)
same_host_only (Boolean) —

Only follow links on the same host (default: true)

Yields:

(page) —

Yields each page result as it is crawled

Yield Parameters:

page (SiteCrawler::PageResult)

Returns:

(Integer) —

Number of pages crawled



45
46
47

# File 'lib/rubycrawl.rb', line 45

def crawl_site(url, ...)
  client.crawl_site(url, ...)
end

Instance Method Details

#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts) ⇒ `Object`

# File 'lib/rubycrawl.rb', line 63

def crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts)
  validate_url!(url)
  validate_wait_until!(wait_until)
  with_retries(max_attempts) do
    @browser.crawl(url, wait_until: wait_until, block_resources: block_resources)
  end
end

#crawl_site(url, **options, &block) ⇒ `Object`

# File 'lib/rubycrawl.rb', line 71

def crawl_site(url, **options, &block)
  crawler_options = build_crawler_options(options)
  SiteCrawler.new(self, crawler_options).crawl(url, &block)
end