Module: Bidi2pdf::TestHelpers::PDFReaderUtils

Included in:
Images::Extractor
Defined in:
lib/bidi2pdf/test_helpers/pdf_reader_utils.rb

Defined Under Namespace

Modules: InstanceMethods

Class Method Summary collapse

Class Method Details

.convert_data_to_io(pdf_data) ⇒ IO

rubocop: disable Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity Converts various input formats into an IO object for PDF::Reader.

Parameters:

  • pdf_data (String, StringIO, File)

    The PDF data to be converted.

Returns:

  • (IO)

    An IO object containing the PDF data.



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/bidi2pdf/test_helpers/pdf_reader_utils.rb', line 51

def convert_data_to_io(pdf_data)
  # rubocop:disable Lint/DuplicateBranch
  if pdf_data.is_a?(String) && (pdf_data.start_with?("JVBERi") || pdf_data.start_with?("JVBER"))
    StringIO.new(Base64.decode64(pdf_data))
  elsif pdf_data.start_with?("%PDF-")
    StringIO.new(pdf_data)
  elsif pdf_data.is_a?(StringIO)
    pdf_data
  elsif pdf_data.is_a?(String) && File.exist?(pdf_data)
    File.open(pdf_data, "rb")
  else
    StringIO.new(pdf_data)
  end
  # rubocop:enable Lint/DuplicateBranch
end

.included(base) ⇒ Object



84
85
86
# File 'lib/bidi2pdf/test_helpers/pdf_reader_utils.rb', line 84

def self.included(base)
  base.include(InstanceMethods)
end

.pdf_reader_for(pdf_data) ⇒ PDF::Reader

Converts the input PDF data into an IO object and initializes a PDF::Reader.

Parameters:

  • pdf_data (String, StringIO, File)

    The PDF data to be read.

Returns:

  • (PDF::Reader)

    A PDF::Reader instance for the given data.

Raises:

  • (PDF::Reader::MalformedPDFError)

    If the PDF data is invalid.



41
42
43
44
# File 'lib/bidi2pdf/test_helpers/pdf_reader_utils.rb', line 41

def pdf_reader_for(pdf_data)
  io = convert_data_to_io(pdf_data)
  PDF::Reader.new(io)
end

.pdf_text(pdf_data) ⇒ Array<String>, Object

Extracts text content from a PDF document.

This method accepts various PDF input formats and attempts to extract text content from all pages. If extraction fails due to malformed PDF data, it returns the original input.

Examples:

Extract text from a PDF file

text_content = pdf_text('path/to/document.pdf')

Extract text from Base64-encoded string

text_content = pdf_text(base64_encoded_pdf_data)

Parameters:

  • pdf_data (String, StringIO, File)

    The PDF data in one of the following formats:

    • Base64-encoded PDF string

    • Raw PDF data beginning with “%PDF-”

    • StringIO object containing PDF data

    • Path to a PDF file as String

    • Raw PDF data as String

Returns:

  • (Array<String>)

    An array of strings, with each string representing the text content of a page

  • (Object)

    The original input if PDF extraction fails



25
26
27
28
29
30
31
32
33
34
# File 'lib/bidi2pdf/test_helpers/pdf_reader_utils.rb', line 25

def pdf_text(pdf_data)
  return pdf_data unless pdf_data.is_a?(String) || pdf_data.is_a?(StringIO) || pdf_data.is_a?(File)

  begin
    reader = pdf_reader_for pdf_data
    reader.pages.map(&:text)
  rescue PDF::Reader::MalformedPDFError
    [pdf_data]
  end
end