RubyCrawl 🎭
Production-ready web crawler for Ruby powered by Ferrum — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
RubyCrawl provides accurate, JavaScript-enabled web scraping using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
Why RubyCrawl?
- ✅ Real browser — Handles JavaScript, AJAX, and SPAs correctly
- ✅ Pure Ruby — No Node.js, no npm, no external processes to manage
- ✅ Zero config — Works out of the box, no Ferrum knowledge needed
- ✅ Production-ready — Auto-retry, error handling, resource optimization
- ✅ Multi-page crawling — BFS algorithm with smart URL deduplication
- ✅ Rails-friendly — Generators, initializers, and ActiveJob integration
- ✅ Readability-powered — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages
# One line to crawl any JavaScript-heavy site
result = RubyCrawl.crawl("https://docs.example.com")
result.html # Full HTML with JS rendered
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
result.clean_markdown # Markdown ready for RAG pipelines
result.links # All links with url, text, title, rel
result. # Title, description, OG tags, etc.
Features
- Pure Ruby: Ferrum drives Chromium directly via CDP — no Node.js or npm required
- Production-ready: Designed for Rails apps with auto-retry and exponential backoff
- Simple API: Clean Ruby interface — zero Ferrum or CDP knowledge required
- Resource optimization: Built-in resource blocking for 2-3x faster crawls
- Auto-managed browsers: Lazy Chrome singleton, isolated page per crawl
- Content extraction: Mozilla Readability.js (primary) + link-density heuristic (fallback) — article-quality
clean_html,clean_text,clean_markdown, links, metadata - Multi-page crawling: BFS crawler with configurable depth limits and URL deduplication
- Smart URL handling: Automatic normalization, tracking parameter removal, same-host filtering
- Rails integration: First-class Rails support with generators and initializers
Table of Contents
- Installation
- Quick Start
- Use Cases
- Usage
- Rails Integration
- Production Deployment
- Architecture
- Performance
- Development
- Contributing
- License
Installation
Requirements
- Ruby >= 3.0
- Chrome or Chromium — managed automatically by Ferrum (downloaded on first use)
Add to Gemfile
gem "rubycrawl"
Then install:
bundle install
Install Chrome
Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
bundle exec rake rubycrawl:install
This command:
- ✅ Checks for Chrome/Chromium in your PATH
- ✅ Creates a Rails initializer (if using Rails)
Note: If Chrome is not in your PATH, install it via your system package manager or download from google.com/chrome.
Quick Start
require "rubycrawl"
# Simple crawl
result = RubyCrawl.crawl("https://example.com")
# Access extracted content
result.final_url # Final URL after redirects
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
result.raw_text # Full body.innerText (unfiltered)
result.html # Full raw HTML content
result.links # Extracted links with url, text, title, rel
result. # Title, description, OG tags, etc.
result.['extractor'] # "readability" or "heuristic" — which extractor ran
result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
Use Cases
RubyCrawl is perfect for:
- RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
- Data aggregation: Crawl product catalogs, job listings, or news articles
- SEO analysis: Extract metadata, links, and content structure
- Content migration: Convert existing sites to Markdown for static site generators
- Documentation scraping: Create local copies of documentation with preserved links
Usage
Basic Crawling
result = RubyCrawl.crawl("https://example.com")
result.html # => "<html>...</html>"
result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
result.raw_text # => "Example Domain\nThis domain is..." (full body text)
result. # => { "final_url" => "https://example.com", "title" => "..." }
Multi-Page Crawling
Crawl an entire site following links with BFS (breadth-first search):
# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
# Each page is yielded as it's crawled (streaming)
puts "Crawled: #{page.url} (depth: #{page.depth})"
# Save to database
Page.create!(
url: page.url,
html: page.html,
markdown: page.clean_markdown,
depth: page.depth
)
end
Real-world example: Building a RAG knowledge base
require "rubycrawl"
RubyCrawl.configure(
wait_until: "networkidle", # Ensure JS content loads
block_resources: true # Skip images/fonts for speed
)
pages_crawled = RubyCrawl.crawl_site(
"https://docs.example.com",
max_pages: 500,
max_depth: 5,
same_host_only: true
) do |page|
VectorDB.upsert(
id: Digest::SHA256.hexdigest(page.url),
content: page.clean_markdown,
metadata: {
url: page.url,
title: page.["title"],
depth: page.depth
}
)
end
puts "Indexed #{pages_crawled} pages"
Multi-Page Options
| Option | Default | Description |
|---|---|---|
max_pages |
50 | Maximum number of pages to crawl |
max_depth |
3 | Maximum link depth from start URL |
same_host_only |
true | Only follow links on the same domain |
wait_until |
inherited | Page load strategy |
block_resources |
inherited | Block images/fonts/CSS |
respect_robots_txt |
false | Honour robots.txt rules and auto-sleep Crawl-delay |
robots.txt Support
When respect_robots_txt: true, RubyCrawl fetches robots.txt once at the start of the crawl and:
- Skips any URL disallowed for
User-agent: * - Automatically sleeps the
Crawl-delayspecified in robots.txt between pages
RubyCrawl.crawl_site("https://example.com",
respect_robots_txt: true,
max_pages: 100
) do |page|
puts page.url
end
Or enable globally:
RubyCrawl.configure(respect_robots_txt: true)
If robots.txt is unreachable or missing, crawling proceeds normally (fail open).
Page Result Object
The block receives a PageResult with:
page.url # String: Final URL after redirects
page.html # String: Full raw HTML content
page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
page.clean_text # String: Noise-stripped plain text (derived from clean_html)
page.raw_text # String: Full body.innerText (unfiltered)
page.clean_markdown # String: Lazy-converted Markdown from clean_html
page.links # Array: URLs extracted from page
page. # Hash: final_url, title, OG tags, etc.
page.depth # Integer: Link depth from start URL
Configuration
Global Configuration
RubyCrawl.configure(
wait_until: "networkidle",
block_resources: true,
timeout: 60,
headless: true
)
# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")
Per-Request Options
# Use global defaults
result = RubyCrawl.crawl("https://example.com")
# Override for this request only
result = RubyCrawl.crawl(
"https://example.com",
wait_until: "domcontentloaded",
block_resources: false
)
Configuration Options
| Option | Values | Default | Description |
|---|---|---|---|
wait_until |
"load", "domcontentloaded", "networkidle", "commit" |
nil |
When to consider page loaded (nil = Ferrum default) |
block_resources |
true, false |
nil |
Block images, fonts, CSS, media for faster crawls |
max_attempts |
Integer | 3 |
Total number of attempts (including the first) |
timeout |
Integer (seconds) | 30 |
Browser navigation timeout |
headless |
true, false |
true |
Run Chrome headlessly |
respect_robots_txt |
true, false |
false |
Honour robots.txt rules and auto-sleep Crawl-delay |
Wait strategies explained:
load— Wait for the load event (good for static sites)domcontentloaded— Wait for DOM ready (faster)networkidle— Wait until no network requests for 500ms (best for SPAs)commit— Wait until the first response bytes are received (fastest)
Result Object
result = RubyCrawl.crawl("https://example.com")
result.html # String: Full raw HTML
result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
result.clean_text # String: Plain text derived from clean_html — ideal for RAG
result.raw_text # String: Full body.innerText (unfiltered)
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
result.links # Array: Extracted links with url/text/title/rel
result. # Hash: See below
result.final_url # String: Shortcut for metadata['final_url']
Links Format
result.links
# => [
# { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
# { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
# ]
URLs are automatically resolved to absolute form by the browser.
Markdown Conversion
Markdown is lazy — conversion only happens on first access of .clean_markdown:
result.clean_html # ✅ Already available, no overhead
result.clean_markdown # Converts clean_html → Markdown here (first call only)
result.clean_markdown # ✅ Cached, instant on subsequent calls
Uses reverse_markdown with GitHub-flavored output.
Metadata Fields
result.
# => {
# "final_url" => "https://example.com",
# "title" => "Page Title",
# "description" => "...",
# "keywords" => "ruby, web",
# "author" => "Author Name",
# "og_title" => "...",
# "og_description" => "...",
# "og_image" => "https://...",
# "og_url" => "https://...",
# "og_type" => "website",
# "twitter_card" => "summary",
# "twitter_title" => "...",
# "twitter_description" => "...",
# "twitter_image" => "https://...",
# "canonical" => "https://...",
# "lang" => "en",
# "charset" => "UTF-8",
# "extractor" => "readability" # or "heuristic"
# }
Error Handling
begin
result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
# Invalid URL or option value
rescue RubyCrawl::TimeoutError => e
# Page load timed out
rescue RubyCrawl::NavigationError => e
# Navigation failed (404, DNS error, SSL error)
rescue RubyCrawl::ServiceError => e
# Browser failed to start or crashed
rescue RubyCrawl::Error => e
# Catch-all for any RubyCrawl error
end
Exception Hierarchy:
RubyCrawl::Error
├── ConfigurationError — invalid URL or option value
├── TimeoutError — page load timed out
├── NavigationError — failed (HTTP error, DNS, SSL)
└── ServiceError — browser failed to start or crashed
Automatic Retry: ServiceError and TimeoutError are retried with exponential backoff. NavigationError and ConfigurationError are not retried (they won't succeed on retry).
RubyCrawl.configure(max_attempts: 5) # 5 total attempts
RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
Rails Integration
Installation
bundle exec rake rubycrawl:install
This creates config/initializers/rubycrawl.rb:
RubyCrawl.configure(
wait_until: "load",
block_resources: true
)
Usage in Rails
Background Jobs with ActiveJob
class CrawlPageJob < ApplicationJob
queue_as :crawlers
retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
discard_on RubyCrawl::ConfigurationError
def perform(url)
result = RubyCrawl.crawl(url)
Page.create!(
url: result.final_url,
title: result.['title'],
content: result.clean_text,
markdown: result.clean_markdown,
crawled_at: Time.current
)
end
end
Multi-page RAG knowledge base:
class BuildKnowledgeBaseJob < ApplicationJob
queue_as :crawlers
def perform(documentation_url)
RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
= OpenAI.(page.clean_markdown)
Document.create!(
url: page.url,
title: page.['title'],
content: page.clean_markdown,
embedding: ,
depth: page.depth
)
end
end
end
Best Practices
- Use background jobs to avoid blocking web requests
- Configure retry logic based on error type
- Store
clean_markdownfor RAG applications (preserves heading structure for chunking) - Rate limit external crawling to be respectful
Production Deployment
Pre-deployment Checklist
- Ensure Chrome is installed on your production servers
- Run installer during deployment:
bash bundle exec rake rubycrawl:install
Docker Example
FROM ruby:3.2
# Install Chrome
RUN apt-get update && apt-get install -y \
chromium \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY Gemfile* ./
RUN bundle install
COPY . .
CMD ["rails", "server"]
Ferrum will detect chromium automatically. To specify a custom path:
RubyCrawl.configure(
browser_options: { "browser-path": "/usr/bin/chromium" }
)
Architecture
RubyCrawl uses a single-process architecture:
RubyCrawl (public API)
↓
Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
↓
Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
↓
Chromium ← headless browser
↓
Readability.js → heuristic fallback ← content extraction (inside browser)
- Chrome launches once lazily and is reused across all crawls
- Each crawl gets an isolated page context (own cookies/storage)
- Content extraction runs inside the browser via
page.evaluate():- Primary: Mozilla Readability.js — article-quality extraction for blogs, docs, news
- Fallback: link-density heuristic — covers marketing pages, homepages, SPAs
result.metadata['extractor']tells you which path was used ("readability"or"heuristic")- No separate processes, no HTTP boundary, no Node.js
Performance
- Resource blocking: Keep
block_resources: true(default: nil) to skip images/fonts/CSS for 2-3x faster crawls - Wait strategy: Use
wait_until: "load"for static sites,"networkidle"for SPAs - Browser reuse: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
Parallelism
RubyCrawl does not support parallel page loading within a single process — Ferrum uses one Chrome instance and concurrent access is not thread-safe.
The recommended pattern is job-level parallelism: each background job gets its own RubyCrawl instance and Chrome process, with natural rate limiting via your job queue's concurrency setting:
# Enqueue independent crawls — each job runs its own Chrome
urls.each { |url| CrawlJob.perform_later(url) }
# Control concurrency via your queue worker config (Sidekiq, GoodJob, etc.)
# e.g. Sidekiq concurrency: 3 → 3 Chrome processes crawling in parallel
This also works naturally with respect_robots_txt: true — each job respects Crawl-delay independently.
Development
git clone [email protected]:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup
# Run all tests (Chrome required — installed as a gem dependency)
bundle exec rspec
# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")
> RubyCrawl.crawl("https://example.com").clean_text
> RubyCrawl.crawl("https://example.com").clean_markdown
Contributing
Contributions are welcome! Please read our contribution guidelines first.
- Simplicity over cleverness: Prefer clear, explicit code
- Stability over speed: Correctness first, optimization second
- Hide complexity: Users should never need to know Ferrum exists
License
The gem is available as open source under the terms of the MIT License.
Credits
Built with Ferrum — pure Ruby Chrome DevTools Protocol client.
Content extraction powered by Mozilla Readability.js — the algorithm behind Firefox Reader View.
Markdown conversion powered by reverse_markdown for GitHub-flavored output.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]