rubycrawl
Playwright-based web crawler for Ruby — Inspired by crawl4ai (Python), designed idiomatically for Ruby with production-ready features.
RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs and dynamic websites.
Features
- Playwright-powered: Real browser automation for JavaScript-heavy sites
- Production-ready: Designed for Rails apps and production environments
- Simple API: Clean, minimal Ruby interface — zero Playwright knowledge required
- Resource optimization: Built-in resource blocking for faster crawls
- Auto-managed browsers: Browser process reuse and automatic lifecycle management
- Content extraction: HTML, links, and Markdown conversion
- Multi-page crawling: BFS crawler with depth limits and deduplication
- Rails integration: First-class Rails support with generators and initializers
Table of Contents
- Installation
- Quick Start
- Usage
- Rails Integration
- Production Deployment
- Architecture
- Performance
- Development
- Contributing
- License
Installation
Requirements
- Ruby >= 3.0
- Node.js LTS (v18+ recommended) — required for the bundled Playwright service
Add to Gemfile
gem "rubycrawl"
Then install:
bundle install
Install Playwright browsers
After bundling, install the Playwright browsers:
bundle exec rake rubycrawl:install
This command:
- Installs Node.js dependencies in the bundled
node/directory - Downloads Playwright browsers (Chromium, Firefox, WebKit)
- Creates a Rails initializer (if using Rails)
Quick Start
require "rubycrawl"
# Simple crawl
result = RubyCrawl.crawl("https://example.com")
# Access extracted content
puts result.html # Raw HTML content
puts result.markdown # Converted to Markdown
puts result.links # Extracted links from the page
puts result. # Status code, final URL, etc.
Usage
Basic Crawling
The simplest way to crawl a URL:
result = RubyCrawl.crawl("https://example.com")
# Access the results
result.html # => "<html>...</html>"
result.markdown # => "# Example Domain\n\nThis domain is..." (lazy-loaded)
result.links # => [{ "url" => "https://...", "text" => "More info" }, ...]
result. # => { "status" => 200, "final_url" => "https://example.com" }
result.text # => "" (coming soon)
Multi-Page Crawling
Crawl an entire site following links with BFS (breadth-first search):
# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
# Each page is yielded as it's crawled (streaming)
puts "Crawled: #{page.url} (depth: #{page.depth})"
# Save to database
Page.create!(
url: page.url,
html: page.html,
markdown: page.markdown,
depth: page.depth
)
end
Multi-Page Options
| Option | Default | Description |
|---|---|---|
max_pages |
50 | Maximum number of pages to crawl |
max_depth |
3 | Maximum link depth from start URL |
same_host_only |
true | Only follow links on the same domain |
wait_until |
inherited | Page load strategy |
block_resources |
inherited | Block images/fonts/CSS |
Page Result Object
The block receives a PageResult with:
page.url # String: Final URL after redirects
page.html # String: Full HTML content
page.markdown # String: Lazy-converted Markdown
page.links # Array: URLs extracted from page
page. # Hash: HTTP status, final URL, etc.
page.depth # Integer: Link depth from start URL
Configuration
Global Configuration
Set default options that apply to all crawls:
RubyCrawl.configure(
wait_until: "networkidle", # Wait until network is idle
block_resources: true # Block images, fonts, CSS for speed
)
# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")
Per-Request Options
Override defaults for specific requests:
# Use global defaults
result = RubyCrawl.crawl("https://example.com")
# Override for this request only
result = RubyCrawl.crawl(
"https://example.com",
wait_until: "domcontentloaded",
block_resources: false
)
Configuration Options
| Option | Values | Default | Description |
|---|---|---|---|
wait_until |
"load", "domcontentloaded", "networkidle" |
"load" |
When to consider page loaded |
block_resources |
true, false |
true |
Block images, fonts, CSS, media for faster crawls |
Wait strategies explained:
load— Wait for the load event (fastest, good for static sites)domcontentloaded— Wait for DOM ready (medium speed)networkidle— Wait until no network requests for 500ms (slowest, best for SPAs)
Result Object
The crawl result is a RubyCrawl::Result object with these attributes:
result = RubyCrawl.crawl("https://example.com")
result.html # String: Raw HTML content from page
result.markdown # String: Markdown conversion (lazy-loaded on first access)
result.links # Array: Extracted links with url and text
result.text # String: Plain text (coming soon)
result. # Hash: Comprehensive metadata (see below)
Links Format
result.links
# => [
# { "url" => "https://example.com/about", "text" => "About Us" },
# { "url" => "https://example.com/contact", "text" => "Contact" },
# ...
# ]
Markdown Conversion
Markdown is lazy-loaded — conversion only happens when you access .markdown:
result = RubyCrawl.crawl(url)
result.html # ✅ No overhead
result.markdown # ⬅️ Conversion happens here (first call only)
result.markdown # ✅ Cached, instant
Uses reverse_markdown with GitHub-flavored output.
Metadata Fields
The metadata hash includes HTTP and HTML metadata:
result.
# => {
# "status" => 200, # HTTP status code
# "final_url" => "https://...", # Final URL after redirects
# "title" => "Page Title", # <title> tag
# "description" => "...", # Meta description
# "keywords" => "ruby, web", # Meta keywords
# "author" => "Author Name", # Meta author
# "og_title" => "...", # Open Graph title
# "og_description" => "...", # Open Graph description
# "og_image" => "https://...", # Open Graph image
# "og_url" => "https://...", # Open Graph URL
# "og_type" => "website", # Open Graph type
# "twitter_card" => "summary", # Twitter card type
# "twitter_title" => "...", # Twitter title
# "twitter_description" => "...", # Twitter description
# "twitter_image" => "https://...",# Twitter image
# "canonical" => "https://...", # Canonical URL
# "lang" => "en", # Page language
# "charset" => "UTF-8" # Character encoding
# }
Note: All HTML metadata fields may be null if not present on the page.
Error Handling
RubyCrawl provides specific exception classes for different error scenarios:
begin
result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
# Invalid URL or configuration
puts "Configuration error: #{e.message}"
rescue RubyCrawl::TimeoutError => e
# Page load timeout or network timeout
puts "Timeout: #{e.message}"
rescue RubyCrawl:: => e
# Page navigation failed (404, DNS error, SSL error, etc.)
puts "Navigation failed: #{e.message}"
rescue RubyCrawl::ServiceError => e
# Node service unavailable or crashed
puts "Service error: #{e.message}"
rescue RubyCrawl::Error => e
# Catch-all for any RubyCrawl error
puts "Crawl error: #{e.message}"
end
Exception Hierarchy:
RubyCrawl::Error(base class)RubyCrawl::ConfigurationError- Invalid URL or configurationRubyCrawl::TimeoutError- Timeout during crawlRubyCrawl::NavigationError- Page navigation failedRubyCrawl::ServiceError- Node service issues
Automatic Retry: RubyCrawl automatically retries transient failures (service errors, timeouts) up to 3 times with exponential backoff (2s, 4s, 8s). Configure with:
RubyCrawl.configure(max_retries: 5)
# or per-request
RubyCrawl.crawl(url, retries: 1) # Disable retry
Rails Integration
Installation
Run the installer in your Rails app:
bundle exec rake rubycrawl:install
This creates config/initializers/rubycrawl.rb:
# frozen_string_literal: true
# rubycrawl default configuration
RubyCrawl.configure(
wait_until: "load",
block_resources: true
)
Usage in Rails
# In a controller, service, or background job
class ContentScraperJob < ApplicationJob
def perform(url)
result = RubyCrawl.crawl(url)
# Save to database
ScrapedContent.create!(
url: url,
html: result.html,
status: result.[:status]
)
end
end
Production Deployment
Pre-deployment Checklist
- Install Node.js on your production servers (LTS version recommended)
- Run installer during deployment:
bash bundle exec rake rubycrawl:install - Set environment variables (optional):
bash export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
Docker Example
FROM ruby:3.2
# Install Node.js LTS
RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
&& apt-get install -y nodejs
# Install system dependencies for Playwright
RUN npx playwright install-deps
WORKDIR /app
COPY Gemfile* ./
RUN bundle install
# Install Playwright browsers
RUN bundle exec rake rubycrawl:install
COPY . .
CMD ["rails", "server"]
Heroku Deployment
Add the Node.js buildpack:
heroku buildpacks:add heroku/nodejs
heroku buildpacks:add heroku/ruby
Add to package.json in your Rails root:
{
"engines": {
"node": "18.x"
}
}
Performance Tips
- Reuse instances: Use the class-level
RubyCrawl.crawlmethod (recommended) rather than creating new instances - Resource blocking: Keep
block_resources: truefor 2-3x faster crawls when you don't need images/CSS - Concurrency: Use background jobs (Sidekiq, etc.) for parallel crawling
- Browser reuse: The first crawl is slower due to browser launch; subsequent crawls reuse the process
Architecture
RubyCrawl uses a dual-process architecture:
Why this architecture?
- Separation of concerns: Ruby handles orchestration, Node handles browsers
- Stability: Playwright's official Node.js bindings are most reliable
- Performance: Long-running browser process, reused across requests
- Simplicity: No C extensions, pure Ruby + bundled Node service
See .github/copilot-instructions.md for detailed architecture documentation.
Performance
Benchmarks
Typical crawl times (M1 Mac, fast network):
| Page Type | First Crawl | Subsequent | Config |
|---|---|---|---|
| Static HTML | ~2s | ~500ms | block_resources: true |
| SPA (React) | ~3s | ~1.2s | wait_until: "networkidle" |
| Heavy site | ~4s | ~2s | block_resources: false |
Note: First crawl includes browser launch time (~1.5s). Subsequent crawls reuse the browser.
Optimization Tips
- Enable resource blocking for content-only extraction:
RubyCrawl.configure(block_resources: true)
Use appropriate wait strategy:
- Static sites:
wait_until: "load" - SPAs:
wait_until: "networkidle"
- Static sites:
Batch processing: Use background jobs for concurrent crawling:
urls.each { |url| CrawlJob.perform_later(url) }
Development
Setup
git clone [email protected]:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup # Installs dependencies and sets up Node service
Running Tests
bundle exec rspec
Manual Testing
# Terminal 1: Start Node service manually (optional)
cd node
npm start
# Terminal 2: Ruby console
bin/console
> result = RubyCrawl.crawl("https://example.com")
> puts result.html
Project Structure
rubycrawl/
Roadmap
Current (v0.1.0)
- [x] HTML extraction
- [x] Link extraction
- [x] Markdown conversion (lazy-loaded)
- [x] Multi-page crawling with BFS
- [x] URL normalization and deduplication
- [x] Basic metadata (status, final URL)
- [x] Resource blocking
- [x] Rails integration
Coming Soon
- [ ] Plain text extraction
- [ ] Screenshot capture
- [ ] Custom JavaScript execution
- [ ] Session/cookie support
- [ ] Proxy support
- [ ] Robots.txt support
Contributing
Contributions are welcome! Please read our contribution guidelines first.
Development Philosophy
- Simplicity over cleverness: Prefer clear, explicit code
- Stability over speed: Correctness first, optimization second
- Ruby-first: Hide Node.js/Playwright complexity from users
- No vendor lock-in: Pure open source, no SaaS dependencies
Comparison with crawl4ai
| Feature | crawl4ai (Python) | rubycrawl (Ruby) |
|---|---|---|
| Browser automation | Playwright | Playwright |
| Language | Python | Ruby |
| LLM extraction | ✅ | Planned |
| Markdown extraction | ✅ | ✅ |
| Link extraction | ✅ | ✅ |
| Multi-page crawling | ✅ | ✅ |
| Rails integration | N/A | ✅ |
| Resource blocking | ✅ | ✅ |
| Session management | ✅ | Planned |
RubyCrawl aims to bring the same level of accuracy and reliability to the Ruby ecosystem.
License
The gem is available as open source under the terms of the MIT License.
Credits
Inspired by crawl4ai by @unclecode.
Built with Playwright by Microsoft.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]