Class: RubyCrawl

Inherits:

Object

Object
RubyCrawl

show all

Includes:: Helpers

Defined in:: lib/rubycrawl.rb,
lib/rubycrawl/errors.rb,
lib/rubycrawl/result.rb,
lib/rubycrawl/helpers.rb,
lib/rubycrawl/railtie.rb,
lib/rubycrawl/version.rb,
lib/rubycrawl/site_crawler.rb,
lib/rubycrawl/service_client.rb,
lib/rubycrawl/url_normalizer.rb,
lib/rubycrawl/markdown_converter.rb

Overview

RubyCrawl provides a simple interface for crawling pages via a local Playwright service.

Defined Under Namespace

Modules: Helpers, MarkdownConverter, UrlNormalizer Classes: ConfigurationError, Error, NavigationError, Railtie, Result, ServiceClient, ServiceError, SiteCrawler, TimeoutError

Constant Summary collapse

DEFAULT_HOST =

'127.0.0.1'

DEFAULT_PORT =

VERSION =

'0.1.4'

Class Method Summary collapse

.client ⇒ Object
.configure(**options) ⇒ Object
.crawl(url, **options) ⇒ Object
.crawl_site(url) {|page| ... } ⇒ Integer

Crawl multiple pages starting from a URL, following links.
.create_session ⇒ String

Create a session for reusing browser context across multiple crawls.
.destroy_session(session_id) ⇒ Object

Destroy a session and close its browser context.

Instance Method Summary collapse

#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) ⇒ Object
#crawl_site(url, **options, &block) ⇒ Object

Crawl multiple pages starting from a URL, following links.
#create_session ⇒ String

Create a session for reusing browser context.
#destroy_session(session_id) ⇒ Object

Destroy a session.
#initialize(**options) ⇒ RubyCrawl constructor

A new instance of RubyCrawl.

Constructor Details

#initialize(**options) ⇒ `RubyCrawl`

Returns a new instance of RubyCrawl.

# File 'lib/rubycrawl.rb', line 65

def initialize(**options)
  load_options(options)
  build_service_client
end

Class Method Details

.client ⇒ `Object`



21
22
23

# File 'lib/rubycrawl.rb', line 21

def client
  @client ||= new
end

.configure(**options) ⇒ `Object`



60
61
62

# File 'lib/rubycrawl.rb', line 60

def configure(**options)
  @client = new(**options)
end

.crawl(url, **options) ⇒ `Object`



25
26
27

# File 'lib/rubycrawl.rb', line 25

def crawl(url, **options)
  client.crawl(url, **options)
end

.crawl_site(url) {|page| ... } ⇒ `Integer`

Crawl multiple pages starting from a URL, following links. Yields each page result to the block as it is crawled.

Examples:

Save pages to database

RubyCrawl.crawl_site("https://example.com", max_pages: 100) do |page|
  Page.create!(url: page.url, html: page.html, depth: page.depth)
end

Parameters:

url (String) —

The starting URL
max_pages (Integer) —

Maximum number of pages to crawl (default: 50)
max_depth (Integer) —

Maximum link depth from start URL (default: 3)
same_host_only (Boolean) —

Only follow links on the same host (default: true)

Yields:

(page) —

Yields each page result as it is crawled

Yield Parameters:

page (SiteCrawler::PageResult) —

The crawled page result

Returns:

(Integer) —

Number of pages crawled



44
45
46

# File 'lib/rubycrawl.rb', line 44

def crawl_site(url, ...)
  client.crawl_site(url, ...)
end

.create_session ⇒ `String`

Create a session for reusing browser context across multiple crawls.

Returns:

(String) —

session_id



50
51
52

# File 'lib/rubycrawl.rb', line 50

def create_session
  client.create_session
end

.destroy_session(session_id) ⇒ `Object`

Destroy a session and close its browser context.

Parameters:

session_id (String)



56
57
58

# File 'lib/rubycrawl.rb', line 56

def destroy_session(session_id)
  client.destroy_session(session_id)
end

Instance Method Details

#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) ⇒ `Object`

# File 'lib/rubycrawl.rb', line 70

def crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil)
  validate_url!(url)
  @service_client.ensure_running
  with_retries(max_attempts) do
    payload = build_payload(url, wait_until, block_resources, session_id)
    response = @service_client.post_json('/crawl', payload)
    raise_node_error!(response)
    build_result(response)
  end
end

#crawl_site(url, **options, &block) ⇒ `Object`

Crawl multiple pages starting from a URL, following links.

#create_session ⇒ `String`

Create a session for reusing browser context.

Returns:

(String) —

session_id

# File 'lib/rubycrawl.rb', line 83

def create_session
  @service_client.ensure_running
  @service_client.create_session
end

#destroy_session(session_id) ⇒ `Object`

Destroy a session.

Parameters:

session_id (String)



90
91
92

# File 'lib/rubycrawl.rb', line 90

def destroy_session(session_id)
  @service_client.destroy_session(session_id)
end

Class: RubyCrawl

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**options) ⇒ RubyCrawl

Class Method Details

.client ⇒ Object

.configure(**options) ⇒ Object

.crawl(url, **options) ⇒ Object

.crawl_site(url) {|page| ... } ⇒ Integer

Examples:

Save pages to database

.create_session ⇒ String

.destroy_session(session_id) ⇒ Object

Instance Method Details

#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) ⇒ Object

#crawl_site(url, **options, &block) ⇒ Object

#create_session ⇒ String

#destroy_session(session_id) ⇒ Object

#initialize(**options) ⇒ `RubyCrawl`

.client ⇒ `Object`

.configure(**options) ⇒ `Object`

.crawl(url, **options) ⇒ `Object`

.crawl_site(url) {|page| ... } ⇒ `Integer`

.create_session ⇒ `String`

.destroy_session(session_id) ⇒ `Object`

#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) ⇒ `Object`

#crawl_site(url, **options, &block) ⇒ `Object`

#create_session ⇒ `String`

#destroy_session(session_id) ⇒ `Object`