Class: RubyCrawl
- Inherits:
-
Object
- Object
- RubyCrawl
- Includes:
- Helpers
- Defined in:
- lib/rubycrawl.rb,
lib/rubycrawl/errors.rb,
lib/rubycrawl/result.rb,
lib/rubycrawl/browser.rb,
lib/rubycrawl/helpers.rb,
lib/rubycrawl/railtie.rb,
lib/rubycrawl/version.rb,
lib/rubycrawl/site_crawler.rb,
lib/rubycrawl/robots_parser.rb,
lib/rubycrawl/url_normalizer.rb,
lib/rubycrawl/browser/extraction.rb,
lib/rubycrawl/markdown_converter.rb
Overview
RubyCrawl — pure Ruby web crawler with full JavaScript rendering via Ferrum.
Defined Under Namespace
Modules: Helpers, MarkdownConverter, UrlNormalizer Classes: Browser, ConfigurationError, Error, NavigationError, Railtie, Result, RobotsParser, ServiceError, SiteCrawler, TimeoutError
Constant Summary collapse
- VERSION =
'0.4.0'
Constants included from Helpers
Class Method Summary collapse
- .client ⇒ Object
- .configure(**options) ⇒ Object
-
.crawl(url, **options) ⇒ RubyCrawl::Result
Crawl a single URL and return a Result.
-
.crawl_site(url) {|page| ... } ⇒ Integer
Crawl multiple pages starting from a URL, following links.
Instance Method Summary collapse
- #crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts) ⇒ Object
- #crawl_site(url, **options, &block) ⇒ Object
-
#initialize(**options) ⇒ RubyCrawl
constructor
A new instance of RubyCrawl.
Constructor Details
Class Method Details
.client ⇒ Object
18 19 20 |
# File 'lib/rubycrawl.rb', line 18 def client @client ||= new end |
.configure(**options) ⇒ Object
49 50 51 |
# File 'lib/rubycrawl.rb', line 49 def configure(**) @client = new(**) end |
.crawl(url, **options) ⇒ RubyCrawl::Result
Crawl a single URL and return a Result.
26 27 28 |
# File 'lib/rubycrawl.rb', line 26 def crawl(url, **) client.crawl(url, **) end |
.crawl_site(url) {|page| ... } ⇒ Integer
Crawl multiple pages starting from a URL, following links. Yields each page result to the block as it is crawled.
45 46 47 |
# File 'lib/rubycrawl.rb', line 45 def crawl_site(url, ...) client.crawl_site(url, ...) end |
Instance Method Details
#crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts) ⇒ Object
63 64 65 66 67 68 69 |
# File 'lib/rubycrawl.rb', line 63 def crawl(url, wait_until: @wait_until, block_resources: @block_resources, max_attempts: @max_attempts) validate_url!(url) validate_wait_until!(wait_until) with_retries(max_attempts) do @browser.crawl(url, wait_until: wait_until, block_resources: block_resources) end end |
#crawl_site(url, **options, &block) ⇒ Object
71 72 73 74 |
# File 'lib/rubycrawl.rb', line 71 def crawl_site(url, **, &block) = () SiteCrawler.new(self, ).crawl(url, &block) end |