Class: Iguvium::Page

Inherits:

Object

Object
Iguvium::Page

show all

Defined in:: lib/iguvium/page.rb

Overview

It’s document page, you can extract tables from here. to do so, use #extract_tables!.

#text method is handy in order to pre-analyze whether you need this page.

Examples:

pages = Iguvium.read('nixon.pdf', gspath: '/usr/bin/gs')
pages = pages.select { |page| page.text.match?(/[Tt]able.+\d+/) }
tables = pages.map(&:extract_tables!)

Instance Method Summary collapse

#extract_tables!(images: @opts[:images]) ⇒ Array<Iguvium::Table>

This method does all the heavy lifting which include optical recognition of table borders.
#initialize(page, path, **opts) ⇒ Page constructor

Typically you don’t need it, prefer Page creation from read.
#text ⇒ String

It takes ~150 ms for it to work, so it’s handy for picking up pages before trying to extract tables, which is an expensive operation.

Constructor Details

#initialize(page, path, **opts) ⇒ `Page`

Typically you don’t need it, prefer Iguvium::Page creation from Iguvium.read

Parameters:

page (PDF::Reader::Page)
path (String) —

path to PDF file to be read
opts (Hash) —

a customizable set of options

# File 'lib/iguvium/page.rb', line 17

def initialize(page, path, **opts)
  @opts = opts
  @reader_page = page
  @path = path
end

Instance Method Details

#extract_tables!(images: @opts[:images]) ⇒ `Array<Iguvium::Table>`

TODO:

Further speed improvements should be done, expecting at least 30% speedup on multicore systems

This method does all the heavy lifting which include optical recognition of table borders. It returns an array of Table or an empty array if it fails to recognize any. To get structured data from parsed Table, just call Table#to_a.

Due to the nature of PDF document which is generally a collection of independent pages, #extract_tables! is suitable for parallel processing. Concurrent processing (think fork as parallel vs. thread as concurrent) on the other hand would be not a great idea, because it’s a CPU-intensive task.

On some older CPUs it takes up to 2 seconds per page for it to work (up to 1 second on more modern ones), so use it wisely.

Examples:

extract tables using pictures as possible borders

tables = page.extract_tables! images: true #=> [Array<Iguvium::Table>]

Returns:

(Array<Iguvium::Table>)

# File 'lib/iguvium/page.rb', line 45

def extract_tables!(images: @opts[:images])
  return @tables if @tables

  @opts[:images] = images
  recognize!
  @tables
end

#text ⇒ `String`

It takes ~150 ms for it to work, so it’s handy for picking up pages before trying to extract tables, which is an expensive operation