RubyDoc.info: File: README – Documentation for parsekit (0.1.2)

parsekit

Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.

Features

📄 Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX)
🖼️ OCR Support: Extract text from images using Tesseract OCR
🚀 High Performance: Native Rust performance with Ruby convenience
🔧 Unified API: Single interface for multiple document formats
📦 Cross-Platform: Works on Linux, macOS, and Windows
🧪 Well Tested: Comprehensive test suite with RSpec

Installation

Add this line to your application's Gemfile:

gem 'parsekit'

And then execute:

$ bundle install

Or install it yourself as:

gem install parsekit

Requirements

Ruby >= 3.0.0
Rust toolchain (stable)
C compiler (for linking)

That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.

Usage

Basic Usage

require 'parsekit'

# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text  # Extracted text from the PDF

# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text  # Extracted text from all sheets

# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text

# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text

Module-Level Convenience Methods

# Parse files directly
content = ParseKit.parse_file('document.pdf')

# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)

# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]

# Check if a file is supported
ParseKit.supports_file?('document.pdf')  # => true

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

Format-Specific Parsing

parser = ParseKit::Parser.new

# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)

Supported Formats

Format	Extensions	Method	Notes
PDF	.pdf	`parse_pdf`	Text extraction via MuPDF
Word	.docx	`parse_docx`	Office Open XML format
Excel	.xlsx, .xls	`parse_xlsx`	Both modern and legacy formats
PowerPoint	.pptx	`parse_pptx`	Text extraction from slides and notes
Images	.png, .jpg, .jpeg, .tiff, .bmp	`ocr_image`	OCR via bundled Tesseract
JSON	.json	`parse_json`	Pretty-printed output
XML/HTML	.xml, .html	`parse_xml`	Extracts text content
Text	.txt, .csv, .md	`parse_text`	With encoding detection

Performance

ParseKit is built with performance in mind:

Native Rust implementation for speed
Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
Efficient memory usage with streaming where possible
Configurable size limits to prevent memory issues

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests.

To compile the Rust extension:

rake compile

To run tests with coverage:

rake dev:coverage

OCR Mode Configuration

By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:

Using system Tesseract during installation:

gem install parsekit -- --no-default-features

For development with system Tesseract:

rake compile CARGO_FEATURES=""  # Disables bundled-tesseract feature

System Tesseract requirements:

macOS: brew install tesseract
Ubuntu/Debian: sudo apt-get install libtesseract-dev
Fedora/RHEL: sudo dnf install tesseract-devel

The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.

Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

Ruby Layer: Provides convenient API and format detection
Rust Layer: Implements high-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- tesseract-rs for OCR (with bundled Tesseract by default)
- Pure Rust libraries for DOCX/XLSX parsing
- Magnus for Ruby-Rust FFI bindings

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/parsekit.

License

The gem is available as open source under the terms of the MIT License.

Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.