Module: ParseKit

Defined in:
lib/parsekit.rb,
lib/parsekit/error.rb,
lib/parsekit/parser.rb,
lib/parsekit/version.rb

Overview

ParseKit is a Ruby document parsing toolkit with PDF and OCR support

Defined Under Namespace

Classes: Parser

Constant Summary collapse

SUPPORTED_FORMATS =

Supported file formats and their extensions

{
  pdf: ['.pdf'],
  docx: ['.docx'],
  xlsx: ['.xlsx'],
  xls: ['.xls'],
  pptx: ['.pptx'],
  png: ['.png'],
  jpeg: ['.jpg', '.jpeg'],
  tiff: ['.tiff', '.tif'],
  bmp: ['.bmp'],
  json: ['.json'],
  xml: ['.xml', '.html'],
  text: ['.txt', '.md', '.csv']
}.freeze
VERSION =
"0.1.2"

Class Method Summary collapse

Class Method Details

.detect_format(filename) ⇒ Symbol

Detect file format from filename/extension

Parameters:

  • filename (String, nil)

    The filename to check

Returns:

  • (Symbol)

    The detected format, or :unknown



72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/parsekit.rb', line 72

def detect_format(filename)
  return :unknown if filename.nil? || filename.empty?
  
  ext = File.extname(filename).downcase
  return :unknown if ext.empty?
  
  SUPPORTED_FORMATS.each do |format, extensions|
    return format if extensions.include?(ext)
  end
  
  :unknown
end

.native_versionString

Get the native library version

Returns:

  • (String)

    Version of the native library



87
88
89
90
91
# File 'lib/parsekit.rb', line 87

def native_version
  version
rescue StandardError
  "unknown"
end

.parse(input, options = {}) ⇒ String

Convenience method to parse input directly (for text)

Parameters:

  • input (String)

    The input string to parse

  • options (Hash) (defaults to: {})

    Optional configuration options

Options Hash (options):

  • :encoding (String)

    Input encoding (default: UTF-8)

Returns:

  • (String)

    The parsed result



42
43
44
# File 'lib/parsekit.rb', line 42

def parse(input, options = {})
  Parser.new(options).parse(input)
end

.parse_bytes(data, options = {}) ⇒ String

Parse binary data

Parameters:

  • data (String, Array)

    Binary data to parse

  • options (Hash) (defaults to: {})

    Optional configuration options

Returns:

  • (String)

    The extracted text



50
51
52
53
54
# File 'lib/parsekit.rb', line 50

def parse_bytes(data, options = {})
  # Convert string to bytes if needed
  byte_data = data.is_a?(String) ? data.bytes : data
  Parser.new(options).parse_bytes(byte_data)
end

.supported_formatsArray<String>

Get supported file formats

Returns:

  • (Array<String>)

    List of supported file extensions



58
59
60
# File 'lib/parsekit.rb', line 58

def supported_formats
  Parser.supported_formats
end

.supports_file?(path) ⇒ Boolean

Check if a file format is supported

Parameters:

  • path (String)

    File path to check

Returns:

  • (Boolean)

    True if the file format is supported



65
66
67
# File 'lib/parsekit.rb', line 65

def supports_file?(path)
  Parser.new.supports_file?(path)
end