Class: Tahweel::Converter

Inherits:
Object
  • Object
show all
Defined in:
lib/tahweel/converter.rb

Overview

Orchestrates the full conversion process:

  1. Splits a PDF into images.

  2. Performs OCR on each image concurrently.

  3. Returns the aggregated text.

  4. Cleans up temporary files.

Constant Summary collapse

DEFAULT_CONCURRENCY =

Max concurrent OCR operations to avoid hitting API rate limits too hard.

12

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(pdf_path, dpi: PdfSplitter::DEFAULT_DPI, processor: :google_drive, concurrency: DEFAULT_CONCURRENCY) ⇒ Converter

Initializes the Converter.

Parameters:

  • pdf_path (String)

    Path to the PDF file.

  • dpi (Integer) (defaults to: PdfSplitter::DEFAULT_DPI)

    DPI for PDF to image conversion.

  • processor (Symbol) (defaults to: :google_drive)

    OCR processor to use.

  • concurrency (Integer) (defaults to: DEFAULT_CONCURRENCY)

    Max concurrent OCR operations.



45
46
47
48
49
50
# File 'lib/tahweel/converter.rb', line 45

def initialize(pdf_path, dpi: PdfSplitter::DEFAULT_DPI, processor: :google_drive, concurrency: DEFAULT_CONCURRENCY)
  @pdf_path = pdf_path
  @dpi = dpi
  @processor_type = processor
  @concurrency = concurrency
end

Class Method Details

.convert(pdf_path, dpi: PdfSplitter::DEFAULT_DPI, processor: :google_drive, concurrency: DEFAULT_CONCURRENCY) {|Hash| ... } ⇒ Array<String>

Convenience method to convert a PDF file to text.

}

Parameters:

  • pdf_path (String)

    Path to the PDF file.

  • dpi (Integer) (defaults to: PdfSplitter::DEFAULT_DPI)

    DPI for PDF to image conversion (default: 150).

  • processor (Symbol) (defaults to: :google_drive)

    OCR processor to use (default: :google_drive).

  • concurrency (Integer) (defaults to: DEFAULT_CONCURRENCY)

    Max concurrent OCR operations (default: 12).

  • &block (Proc)

    A block that will be yielded with progress info.

Yields:

  • (Hash)

    Progress info: { stage: :splitting or :ocr, current_page: Integer, percentage: Float, remaining_pages: Integer

Returns:

  • (Array<String>)

    An array containing the text of each page.



31
32
33
34
35
36
37
# File 'lib/tahweel/converter.rb', line 31

def self.convert(
  pdf_path,
  dpi: PdfSplitter::DEFAULT_DPI,
  processor: :google_drive,
  concurrency: DEFAULT_CONCURRENCY,
  &
) = new(pdf_path, dpi:, processor:, concurrency:).convert(&)

Instance Method Details

#convert {|Hash| ... } ⇒ Array<String>

Executes the conversion process.

}

Parameters:

  • &block (Proc)

    A block that will be yielded with progress info.

Yields:

  • (Hash)

    Progress info: { stage: :splitting or :ocr, current_page: Integer, percentage: Float, remaining_pages: Integer

Returns:

  • (Array<String>)

    An array containing the text of each page.



62
63
64
65
66
67
68
69
70
# File 'lib/tahweel/converter.rb', line 62

def convert(&)
  image_paths, temp_dir = PdfSplitter.split(@pdf_path, dpi: @dpi, &).values_at(:image_paths, :folder_path)

  begin
    process_images(image_paths, Ocr.new(processor: @processor_type), &)
  ensure
    FileUtils.rm_rf(temp_dir)
  end
end