PDF Table Data Extractor

Simple tool to extract Table Data from PDFs

Presentation

This library is able to understand stuff that looks like tables in PDF files:

  • Table Headers
  • Table Rows
  • Sub-Table Names (Partial tables)

Also, a set of filters are included to ensure that the output produced by the library is "clean" and free of false-positives or unusable / garbage information.

Installation

Gemfile

gem 'pdftdx'

Terminal

gem install -V pdftdx

Usage example

Reading a PDF file:

require 'pdftdx'
tables = PDFTDX::extract_data 'path to your PDF file'
puts tables.inspect

Output:

=> [{ head: ['trauma.eresse.net', 'durjaya.dooba.io', 'suessmost.eresse.net'], data: [{ name: 'System', data: [['Machine OS', 'Win32', 'Linux', 'MacOS'], ['IP Address', '10.0.232.48', '10.0.232.134', '10.0.232.108']] }] }]

License

The gem is available as open source under the terms of the MIT License.