Module: Iguvium

Defined in:
lib/iguvium.rb,
lib/iguvium/cv.rb,
lib/iguvium/row.rb,
lib/iguvium/page.rb,
lib/iguvium/image.rb,
lib/iguvium/table.rb,
lib/iguvium/labeler.rb,
lib/iguvium/version.rb

Overview

PDF tables extractor. For more details please look Iguvium.read and Page#extract_tables!

Examples:

Get all the tables in 2D text array format

pages = Iguvium.read('filename.pdf') #=> [Array<Iguvium::Page>]
tables = pages.flat_map { |page| page.extract_tables! } #=> [Array<Iguvium::Table>]
tables.map(&:to_a)

Get first table from the page 8

pages = Iguvium.read('filename.pdf')
tables = pages[7].extract_tables!
tables.first.to_a

Defined Under Namespace

Classes: CV, Image, Labeler, Page, Row, Table

Constant Summary collapse

VERSION =
'0.9.0'

Class Method Summary collapse

Class Method Details

.loggerLogger

Creates and gives access to Ruby Logger. Default [Logger::Level] is Logger::ERROR.

To set another level call ‘Iguvium.logger.level = Logger::INFO` or some other standard logger level

It is possible to redefine Iguvium’s logger, for example to replace it with a global one like ‘Iguvium.logger = Rails.logger`

Returns:

  • (Logger)


87
88
89
90
91
92
93
94
95
96
# File 'lib/iguvium.rb', line 87

def logger
  return @logger if @logger

  @logger = Logger.new(STDOUT)
  @logger.formatter = proc do |severity, _, _, msg|
    "#{severity}: #{msg}\n"
  end
  @logger.level = Logger::ERROR
  @logger
end

.logger=(new_logger) ⇒ Object



97
98
99
# File 'lib/iguvium.rb', line 97

def logger=(new_logger)
  @logger = new_logger
end

.read(path, **opts) ⇒ Array <Iguvium::Page>

It’s main method. Usually this is where you start.

It returns an array of Page.

Tables on those pages are neither extracted nor detected yet, all the heavy lifting is done in Iguvium::Page#extract_tables! method.

This typically makes sense in a rare case when table grid in your pdf is filled with rasterized texture or is actually a background picture. Usually you don’t want to use it.

Examples:

prepare pages, consider images meaningful

pages = Iguvium.read('filename.pdf', images: true)

set nonstandard gs path, get pages starting with the one which contains keyword

pages = Iguvium.read('nixon.pdf', gspath: '/usr/bin/gs')
pages = pages.drop_while { |page| !page.text.match?(/Watergate/) }
# {Iguvium::Page#text} does not require optical page scan and thus is relatively cheap.
# It uses an underlying PDF::Reader::Page#text which is fast but not completely free though.

Parameters:

  • path (String)

    path to PDF file to be read

  • opts (Hash)

    a customizable set of options

Options Hash (**opts):

  • :gspath (String) — default: nil

    explicit path to the GhostScript executable. Use it in case of non-standard gs executable placement. If not specified, gem tries standard options like ‘C:\Program Files\gs\gs*\bin\gswin??c.exe` on Windows or just `gs` on Mac and Linux

  • :loglevel (Logger::Level)

    level like Logger::INFO, default is Logger::ERROR

  • :images (Boolean) — default: false

    consider pictures in PDF as possible table separators.

Returns:



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/iguvium.rb', line 60

def read(path, **opts)
  if windows?
    unless opts[:gspath]
      gspath = Dir.glob('C:/Program Files/gs/gs*/bin/gswin??c.exe').first.tr('/', '\\')
      opts[:gspath] = "\"#{gspath}\""
    end

    if opts[:gspath].empty?
      puts "There's no gs utility in your $PATH.
  Please install GhostScript: https://www.ghostscript.com/download/gsdnld.html"
      exit
    end
  else
    opts[:gspath] ||= gs_nix?
  end

  PDF::Reader.new(path, opts).pages.map { |page| Page.new(page, path, opts) }
end