RubyCrawl 🎭

Production-ready web crawler for Ruby powered by Ferrum — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.

RubyCrawl provides accurate, JavaScript-enabled web scraping using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.

Why RubyCrawl?

✅ Real browser — Handles JavaScript, AJAX, and SPAs correctly
✅ Pure Ruby — No Node.js, no npm, no external processes to manage
✅ Zero config — Works out of the box, no Ferrum knowledge needed
✅ Production-ready — Auto-retry, error handling, resource optimization
✅ Multi-page crawling — BFS algorithm with smart URL deduplication
✅ Rails-friendly — Generators, initializers, and ActiveJob integration
✅ Readability-powered — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages

# One line to crawl any JavaScript-heavy site
result = RubyCrawl.crawl("https://docs.example.com")

result.html           # Full HTML with JS rendered
result.clean_text     # Noise-stripped plain text (no nav/footer/ads)
result.clean_markdown # Markdown ready for RAG pipelines
result.links          # All links with url, text, title, rel
result.metadata       # Title, description, OG tags, etc.

Features

Pure Ruby: Ferrum drives Chromium directly via CDP — no Node.js or npm required
Production-ready: Designed for Rails apps with auto-retry and exponential backoff
Simple API: Clean Ruby interface — zero Ferrum or CDP knowledge required
Resource optimization: Built-in resource blocking for 2-3x faster crawls
Auto-managed browsers: Lazy Chrome singleton, isolated page per crawl
Content extraction: Mozilla Readability.js (primary) + link-density heuristic (fallback) — article-quality clean_html, clean_text, clean_markdown, links, metadata
Multi-page crawling: BFS crawler with configurable depth limits and URL deduplication
Smart URL handling: Automatic normalization, tracking parameter removal, same-host filtering
Rails integration: First-class Rails support with generators and initializers

Installation
Quick Start
Use Cases
Usage
Rails Integration
Production Deployment
Architecture
Performance
Development
Contributing
License

Installation

Requirements

Ruby >= 3.0
Chrome or Chromium — managed automatically by Ferrum (downloaded on first use)

Add to Gemfile

gem "rubycrawl"

Then install:

bundle install

Install Chrome

Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:

bundle exec rake rubycrawl:install

This command:

✅ Checks for Chrome/Chromium in your PATH
✅ Creates a Rails initializer (if using Rails)

Note: If Chrome is not in your PATH, install it via your system package manager or download from google.com/chrome.

Quick Start

require "rubycrawl"

# Simple crawl
result = RubyCrawl.crawl("https://example.com")

# Access extracted content
result.final_url                   # Final URL after redirects
result.clean_text                  # Noise-stripped plain text (no nav/footer/ads)
result.clean_html                  # Noise-stripped HTML (same noise removed as clean_text)
result.raw_text                    # Full body.innerText (unfiltered)
result.html                        # Full raw HTML content
result.links                       # Extracted links with url, text, title, rel
result.metadata                    # Title, description, OG tags, etc.
result.metadata['extractor']       # "readability" or "heuristic" — which extractor ran
result.clean_markdown              # Markdown converted from clean_html (lazy — first access only)

Use Cases

RubyCrawl is perfect for:

RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
Data aggregation: Crawl product catalogs, job listings, or news articles
SEO analysis: Extract metadata, links, and content structure
Content migration: Convert existing sites to Markdown for static site generators
Documentation scraping: Create local copies of documentation with preserved links

Usage

Basic Crawling

result = RubyCrawl.crawl("https://example.com")

result.html           # => "<html>...</html>"
result.clean_text     # => "Example Domain\n\nThis domain is..." (no nav/ads)
result.raw_text       # => "Example Domain\nThis domain is..." (full body text)
result.metadata       # => { "final_url" => "https://example.com", "title" => "..." }

Multi-Page Crawling

Crawl an entire site following links with BFS (breadth-first search):

# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
  # Each page is yielded as it's crawled (streaming)
  puts "Crawled: #{page.url} (depth: #{page.depth})"

  # Save to database
  Page.create!(
    url:      page.url,
    html:     page.html,
    markdown: page.clean_markdown,
    depth:    page.depth
  )
end

Real-world example: Building a RAG knowledge base

require "rubycrawl"

RubyCrawl.configure(
  wait_until: "networkidle",  # Ensure JS content loads
  block_resources: true       # Skip images/fonts for speed
)

pages_crawled = RubyCrawl.crawl_site(
  "https://docs.example.com",
  max_pages: 500,
  max_depth: 5,
  same_host_only: true
) do |page|
  VectorDB.upsert(
    id:       Digest::SHA256.hexdigest(page.url),
    content:  page.clean_markdown,
    metadata: {
      url:   page.url,
      title: page.metadata["title"],
      depth: page.depth
    }
  )
end

puts "Indexed #{pages_crawled} pages"

Multi-Page Options

Option	Default	Description
`max_pages`	50	Maximum number of pages to crawl
`max_depth`	3	Maximum link depth from start URL
`same_host_only`	true	Only follow links on the same domain
`wait_until`	inherited	Page load strategy
`block_resources`	inherited	Block images/fonts/CSS
`respect_robots_txt`	false	Honour robots.txt rules and auto-sleep `Crawl-delay`

robots.txt Support

When respect_robots_txt: true, RubyCrawl fetches robots.txt once at the start of the crawl and:

Skips any URL disallowed for User-agent: *
Automatically sleeps the Crawl-delay specified in robots.txt between pages

RubyCrawl.crawl_site("https://example.com",
  respect_robots_txt: true,
  max_pages: 100
) do |page|
  puts page.url
end

Or enable globally:

RubyCrawl.configure(respect_robots_txt: true)

If robots.txt is unreachable or missing, crawling proceeds normally (fail open).

Page Result Object

The block receives a PageResult with:

page.url            # String: Final URL after redirects
page.html           # String: Full raw HTML content
page.clean_html     # String: Noise-stripped HTML (no nav/header/footer/ads)
page.clean_text     # String: Noise-stripped plain text (derived from clean_html)
page.raw_text       # String: Full body.innerText (unfiltered)
page.clean_markdown # String: Lazy-converted Markdown from clean_html
page.links          # Array: URLs extracted from page
page.metadata       # Hash: final_url, title, OG tags, etc.
page.depth          # Integer: Link depth from start URL

Configuration

Global Configuration

RubyCrawl.configure(
  wait_until:      "networkidle",
  block_resources: true,
  timeout:         60,
  headless:        true
)

# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")

Per-Request Options

# Use global defaults
result = RubyCrawl.crawl("https://example.com")

# Override for this request only
result = RubyCrawl.crawl(
  "https://example.com",
  wait_until:      "domcontentloaded",
  block_resources: false
)

Configuration Options

Option	Values	Default	Description
`wait_until`	`"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"`	`nil`	When to consider page loaded (nil = Ferrum default)
`block_resources`	`true`, `false`	`nil`	Block images, fonts, CSS, media for faster crawls
`max_attempts`	Integer	`3`	Total number of attempts (including the first)
`timeout`	Integer (seconds)	`30`	Browser navigation timeout
`headless`	`true`, `false`	`true`	Run Chrome headlessly
`respect_robots_txt`	`true`, `false`	`false`	Honour robots.txt rules and auto-sleep Crawl-delay

Wait strategies explained:

load — Wait for the load event (good for static sites)
domcontentloaded — Wait for DOM ready (faster)
networkidle — Wait until no network requests for 500ms (best for SPAs)
commit — Wait until the first response bytes are received (fastest)

Result Object

result = RubyCrawl.crawl("https://example.com")

result.html           # String: Full raw HTML
result.clean_html     # String: Noise-stripped HTML (nav/header/footer/ads removed)
result.clean_text     # String: Plain text derived from clean_html — ideal for RAG
result.raw_text       # String: Full body.innerText (unfiltered)
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
result.links          # Array: Extracted links with url/text/title/rel
result.metadata       # Hash: See below
result.final_url      # String: Shortcut for metadata['final_url']

Links Format

result.links
# => [
#   { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
#   { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
# ]

URLs are automatically resolved to absolute form by the browser.

Markdown Conversion

Markdown is lazy — conversion only happens on first access of .clean_markdown:

result.clean_html     # ✅ Already available, no overhead
result.clean_markdown # Converts clean_html → Markdown here (first call only)
result.clean_markdown # ✅ Cached, instant on subsequent calls

Uses reverse_markdown with GitHub-flavored output.

Metadata Fields

result.metadata
# => {
#   "final_url"           => "https://example.com",
#   "title"               => "Page Title",
#   "description"         => "...",
#   "keywords"            => "ruby, web",
#   "author"              => "Author Name",
#   "og_title"            => "...",
#   "og_description"      => "...",
#   "og_image"            => "https://...",
#   "og_url"              => "https://...",
#   "og_type"             => "website",
#   "twitter_card"        => "summary",
#   "twitter_title"       => "...",
#   "twitter_description" => "...",
#   "twitter_image"       => "https://...",
#   "canonical"           => "https://...",
#   "lang"                => "en",
#   "charset"             => "UTF-8",
#   "extractor"           => "readability"  # or "heuristic"
# }

Error Handling

begin
  result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
  # Invalid URL or option value
rescue RubyCrawl::TimeoutError => e
  # Page load timed out
rescue RubyCrawl::NavigationError => e
  # Navigation failed (404, DNS error, SSL error)
rescue RubyCrawl::ServiceError => e
  # Browser failed to start or crashed
rescue RubyCrawl::Error => e
  # Catch-all for any RubyCrawl error
end

Exception Hierarchy:

RubyCrawl::Error
  ├── ConfigurationError  — invalid URL or option value
  ├── TimeoutError        — page load timed out
  ├── NavigationError     — navigation failed (HTTP error, DNS, SSL)
  └── ServiceError        — browser failed to start or crashed

Automatic Retry: ServiceError and TimeoutError are retried with exponential backoff. NavigationError and ConfigurationError are not retried (they won't succeed on retry).

RubyCrawl.configure(max_attempts: 5)     # 5 total attempts
RubyCrawl.crawl(url, max_attempts: 1)    # Disable retries

Rails Integration

Installation

bundle exec rake rubycrawl:install

This creates config/initializers/rubycrawl.rb:

RubyCrawl.configure(
  wait_until:      "load",
  block_resources: true
)

Usage in Rails

Background Jobs with ActiveJob

class CrawlPageJob < ApplicationJob
  queue_as :crawlers

  retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
  retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
  discard_on RubyCrawl::ConfigurationError

  def perform(url)
    result = RubyCrawl.crawl(url)

    Page.create!(
      url:        result.final_url,
      title:      result.metadata['title'],
      content:    result.clean_text,
      markdown:   result.clean_markdown,
      crawled_at: Time.current
    )
  end
end

Multi-page RAG knowledge base:

class BuildKnowledgeBaseJob < ApplicationJob
  queue_as :crawlers

  def perform(documentation_url)
    RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
      embedding = OpenAI.embed(page.clean_markdown)

      Document.create!(
        url:       page.url,
        title:     page.metadata['title'],
        content:   page.clean_markdown,
        embedding: embedding,
        depth:     page.depth
      )
    end
  end
end

Best Practices

Use background jobs to avoid blocking web requests
Configure retry logic based on error type
Store clean_markdown for RAG applications (preserves heading structure for chunking)
Rate limit external crawling to be respectful

Production Deployment

Pre-deployment Checklist

Ensure Chrome is installed on your production servers
Run installer during deployment: bash bundle exec rake rubycrawl:install

Docker Example

FROM ruby:3.2

# Install Chrome
RUN apt-get update && apt-get install -y \
    chromium \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY Gemfile* ./
RUN bundle install

COPY . .
CMD ["rails", "server"]

Ferrum will detect chromium automatically. To specify a custom path:

RubyCrawl.configure(
  browser_options: { "browser-path": "/usr/bin/chromium" }
)

Architecture

RubyCrawl uses a single-process architecture:

RubyCrawl (public API)
  ↓
Browser (lib/rubycrawl/browser.rb)       ← Ferrum wrapper
  ↓
Ferrum::Browser                          ← Chrome DevTools Protocol (pure Ruby)
  ↓
Chromium                                 ← headless browser
  ↓
Readability.js → heuristic fallback      ← content extraction (inside browser)

Chrome launches once lazily and is reused across all crawls
Each crawl gets an isolated page context (own cookies/storage)
Content extraction runs inside the browser via page.evaluate():
- Primary: Mozilla Readability.js — article-quality extraction for blogs, docs, news
- Fallback: link-density heuristic — covers marketing pages, homepages, SPAs
result.metadata['extractor'] tells you which path was used ("readability" or "heuristic")
No separate processes, no HTTP boundary, no Node.js

Performance

Resource blocking: Keep block_resources: true (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
Wait strategy: Use wait_until: "load" for static sites, "networkidle" for SPAs
Browser reuse: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)

Parallelism

RubyCrawl does not support parallel page loading within a single process — Ferrum uses one Chrome instance and concurrent access is not thread-safe.

The recommended pattern is job-level parallelism: each background job gets its own RubyCrawl instance and Chrome process, with natural rate limiting via your job queue's concurrency setting:

# Enqueue independent crawls — each job runs its own Chrome
urls.each { |url| CrawlJob.perform_later(url) }

# Control concurrency via your queue worker config (Sidekiq, GoodJob, etc.)
# e.g. Sidekiq concurrency: 3 → 3 Chrome processes crawling in parallel

This also works naturally with respect_robots_txt: true — each job respects Crawl-delay independently.

Development

git clone [email protected]:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup

# Run all tests (Chrome required — installed as a gem dependency)
bundle exec rspec

# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")
> RubyCrawl.crawl("https://example.com").clean_text
> RubyCrawl.crawl("https://example.com").clean_markdown

Contributing

Contributions are welcome! Please read our contribution guidelines first.

Simplicity over cleverness: Prefer clear, explicit code
Stability over speed: Correctness first, optimization second
Hide complexity: Users should never need to know Ferrum exists

License

The gem is available as open source under the terms of the MIT License.

Credits

Built with Ferrum — pure Ruby Chrome DevTools Protocol client.

Content extraction powered by Mozilla Readability.js — the algorithm behind Firefox Reader View.

Markdown conversion powered by reverse_markdown for GitHub-flavored output.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [email protected]

RubyCrawl 🎭

Features

Table of Contents

Installation

Requirements

Add to Gemfile

Install Chrome

Quick Start

Use Cases

Usage

Basic Crawling

Multi-Page Crawling

Multi-Page Options

robots.txt Support

Page Result Object

Configuration

Global Configuration

Per-Request Options

Configuration Options

Result Object

Links Format

Markdown Conversion

Metadata Fields

Error Handling

Rails Integration

Installation

Usage in Rails

Background Jobs with ActiveJob

Best Practices

Production Deployment

Pre-deployment Checklist

Docker Example

Architecture

Performance

Parallelism

Development

Contributing

License

Credits

Support