Class: RubyCrawl
- Inherits:
-
Object
- Object
- RubyCrawl
- Includes:
- Helpers
- Defined in:
- lib/rubycrawl.rb,
lib/rubycrawl/errors.rb,
lib/rubycrawl/result.rb,
lib/rubycrawl/helpers.rb,
lib/rubycrawl/railtie.rb,
lib/rubycrawl/version.rb,
lib/rubycrawl/site_crawler.rb,
lib/rubycrawl/service_client.rb,
lib/rubycrawl/url_normalizer.rb,
lib/rubycrawl/markdown_converter.rb
Overview
RubyCrawl provides a simple interface for crawling pages via a local Playwright service.
Defined Under Namespace
Modules: Helpers, MarkdownConverter, UrlNormalizer Classes: ConfigurationError, Error, NavigationError, Railtie, Result, ServiceClient, ServiceError, SiteCrawler, TimeoutError
Constant Summary collapse
- DEFAULT_HOST =
'127.0.0.1'- DEFAULT_PORT =
3344- VERSION =
'0.1.4'
Class Method Summary collapse
- .client ⇒ Object
- .configure(**options) ⇒ Object
- .crawl(url, **options) ⇒ Object
-
.crawl_site(url) {|page| ... } ⇒ Integer
Crawl multiple pages starting from a URL, following links.
-
.create_session ⇒ String
Create a session for reusing browser context across multiple crawls.
-
.destroy_session(session_id) ⇒ Object
Destroy a session and close its browser context.
Instance Method Summary collapse
- #crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) ⇒ Object
-
#crawl_site(url, **options, &block) ⇒ Object
Crawl multiple pages starting from a URL, following links.
-
#create_session ⇒ String
Create a session for reusing browser context.
-
#destroy_session(session_id) ⇒ Object
Destroy a session.
-
#initialize(**options) ⇒ RubyCrawl
constructor
A new instance of RubyCrawl.
Constructor Details
#initialize(**options) ⇒ RubyCrawl
Returns a new instance of RubyCrawl.
65 66 67 68 |
# File 'lib/rubycrawl.rb', line 65 def initialize(**) () build_service_client end |
Class Method Details
.client ⇒ Object
21 22 23 |
# File 'lib/rubycrawl.rb', line 21 def client @client ||= new end |
.configure(**options) ⇒ Object
60 61 62 |
# File 'lib/rubycrawl.rb', line 60 def configure(**) @client = new(**) end |
.crawl(url, **options) ⇒ Object
25 26 27 |
# File 'lib/rubycrawl.rb', line 25 def crawl(url, **) client.crawl(url, **) end |
.crawl_site(url) {|page| ... } ⇒ Integer
Crawl multiple pages starting from a URL, following links. Yields each page result to the block as it is crawled.
44 45 46 |
# File 'lib/rubycrawl.rb', line 44 def crawl_site(url, ...) client.crawl_site(url, ...) end |
.create_session ⇒ String
Create a session for reusing browser context across multiple crawls.
50 51 52 |
# File 'lib/rubycrawl.rb', line 50 def create_session client.create_session end |
.destroy_session(session_id) ⇒ Object
Destroy a session and close its browser context.
56 57 58 |
# File 'lib/rubycrawl.rb', line 56 def destroy_session(session_id) client.destroy_session(session_id) end |
Instance Method Details
#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) ⇒ Object
70 71 72 73 74 75 76 77 78 79 |
# File 'lib/rubycrawl.rb', line 70 def crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts, session_id: nil) validate_url!(url) @service_client.ensure_running with_retries(max_attempts) do payload = build_payload(url, wait_until, block_resources, session_id) response = @service_client.post_json('/crawl', payload) raise_node_error!(response) build_result(response) end end |
#crawl_site(url, **options, &block) ⇒ Object
Crawl multiple pages starting from a URL, following links.
96 97 98 99 100 101 |
# File 'lib/rubycrawl.rb', line 96 def crawl_site(url, **, &block) @service_client.ensure_running = () crawler = SiteCrawler.new(self, ) crawler.crawl(url, &block) end |
#create_session ⇒ String
Create a session for reusing browser context.
83 84 85 86 |
# File 'lib/rubycrawl.rb', line 83 def create_session @service_client.ensure_running @service_client.create_session end |
#destroy_session(session_id) ⇒ Object
Destroy a session.
90 91 92 |
# File 'lib/rubycrawl.rb', line 90 def destroy_session(session_id) @service_client.destroy_session(session_id) end |