Class: Iguvium::Page

Inherits:
Object
  • Object
show all
Defined in:
lib/iguvium/page.rb

Overview

It’s document page, you can extract tables from here. to do so, use #extract_tables!.

#text method is handy in order to pre-analyze whether you need this page.

Examples:

pages = Iguvium.read('nixon.pdf', gspath: '/usr/bin/gs')
pages = pages.select { |page| page.text.match?(/[Tt]able.+\d+/) }
tables = pages.map(&:extract_tables!)

Instance Method Summary collapse

Constructor Details

#initialize(page, path, **opts) ⇒ Page

Typically you don’t need it, prefer Iguvium::Page creation from Iguvium.read

Parameters:

  • page (PDF::Reader::Page)
  • path (String)

    path to PDF file to be read

  • opts (Hash)

    a customizable set of options



17
18
19
20
21
# File 'lib/iguvium/page.rb', line 17

def initialize(page, path, **opts)
  @opts = opts
  @reader_page = page
  @path = path
end

Instance Method Details

#extract_tables!(images: @opts[:images]) ⇒ Array<Iguvium::Table>

TODO:

Further speed improvements should be done, expecting at least 30% speedup on multicore systems

This method does all the heavy lifting which include optical recognition of table borders. It returns an array of Table or an empty array if it fails to recognize any. To get structured data from parsed Table, just call Table#to_a.

Due to the nature of PDF document which is generally a collection of independent pages, #extract_tables! is suitable for parallel processing. Concurrent processing (think fork as parallel vs. thread as concurrent) on the other hand would be not a great idea, because it’s a CPU-intensive task.

On some older CPUs it takes up to 2 seconds per page for it to work (up to 1 second on more modern ones), so use it wisely.

Examples:

extract tables using pictures as possible borders

tables = page.extract_tables! images: true #=> [Array<Iguvium::Table>]

Returns:



45
46
47
48
49
50
51
# File 'lib/iguvium/page.rb', line 45

def extract_tables!(images: @opts[:images])
  return @tables if @tables

  @opts[:images] = images
  recognize!
  @tables
end

#textString

It takes ~150 ms for it to work, so it’s handy for picking up pages before trying to extract tables, which is an expensive operation

Returns:

  • (String)

    rendered page text, result of underlying PDF::Reader::Page#text call



56
57
58
# File 'lib/iguvium/page.rb', line 56

def text
  @text ||= @reader_page.text
end