Module: PDFTDX::Parser

Defined in:: lib/pdftdx/parser.rb

Overview

Parser Module

Constant Summary collapse

LINE_REGEX = Line Regex

/^<p style[^>]+top:([0-9]+)px[^>]+left:([0-9]+)px[^>]+>(.*)<\/p>/

MAX_CELL_LEN = Maximum Cell Length (to be considered usable data)

PAGE_OFF = Page Offset

PAGE_MAX_TOP = Maximum Allowed Offset from Page Top

TITLE_CELL_REGEX = Title Cell Regex

/<b>/

Class Method Summary collapse

.build_table(data) ⇒ Hash

Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.
.collect_data(data) ⇒ Array

Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.
.contains_unusable?(row_data) ⇒ Boolean

Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.
.filter_rows(data) ⇒ Array

Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc…
.hfilter(s) ⇒ String

HTML Filter Replaces HTML newlines by UNIX-style newlines.
.htable_length(table, headers, h, i) ⇒ Fixnum

Determine Headered Table Length Computes the number of rows to be included in a given headered table.
.is_all_same?(row_data) ⇒ Boolean

Is All Same Data Determine whether a row’s cells all contain the same data.
.process(page_data) ⇒ Array

Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.
.sort_row(r) ⇒ Hash

Sort Row Sorts Cells according to their x-offset.
.sub_tab_len(table, stables, t, i) ⇒ Fixnum

Sub Table Length Computes the number of rows to be included in a given sub-table.
.sub_tablize(htable_data) ⇒ Array

Sub-Tablize Splits a table into multiple named tables.
.touch_up(table) ⇒ Array

Touch up Table Splits Table into multiple headered tables.

Class Method Details

.build_table(data) ⇒ `Hash`

Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.

Parameters:

data (Array) —

An array of document chunks, each represented as a hash containing the position and body of the chunk. Example: [{ top: 10, left: 100, data: ‘Machine OS’ }, { top: 10, left: 220, data: ‘Win32’ }, { top: 10, left: 340, data: ‘Linux’ }, { top: 10, left: 460, data: ‘MacOS’ }]

Returns:

(Hash) —

A hash of table rows, mapped by their offset from the top, where each row is represented as a hash of table cells, mapped by their offset from the left. Example: { 10 => { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, 35 => { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ } }

# File 'lib/pdftdx/parser.rb', line 80

def self.build_table data
  table = {}
  data.each { |d| table[d[:top]] ||= {}; table[d[:top]][d[:left]] = d[:data] }
  table
end

.collect_data(data) ⇒ `Array`

Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.

Parameters:

data (Hash) —

A hash of document pages, mapped by their page index. Each page is an array of chomp’d lines of HTML data. Example: { 1 => [‘<h1>Hello World!</h1>’, ‘This is page one.’], 2 => [‘Wow, another page of data !’, ‘Important stuff’, ‘That’s it for page 2 !‘] }

Returns:

(Array) —

An array of HTML chunks, each represented as a hash containing the chunk position and data. Example: [{ top: 10, left: 100, data: ‘Machine OS’ }, { top: 10, left: 220, data: ‘Win32’ }, { top: 10, left: 340, data: ‘Linux’ }, { top: 10, left: 460, data: ‘MacOS’ }]

# File 'lib/pdftdx/parser.rb', line 60

def self.collect_data data

  # Build HTML Entity Decoder
  coder = HTMLEntities.new

  # Collect File Data
  off = 0
  data.collect do |_idx, page|
    off = off + PAGE_OFF
    page
      .select { |l| LINE_REGEX =~ l }                                                                                             # Collect Table-like data
      .collect { |l| LINE_REGEX.match l }                                                                                         # Extract Table Element Metadata (Position)
      .collect { |d| { top: off + d[1].to_i, left: d[2].to_i, data: hfilter(coder.decode(d[3])) } }                               # Produce Hash of Raw Table Data
  end.flatten
end

.contains_unusable?(row_data) ⇒ `Boolean`

Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.

Parameters:

row_data (Hash) —

A hash of table cells, mapped by their offset from the left. Example: { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }

Returns:

(Boolean) —

True if at least one cell is unusable (empty, oversize), False otherwise



44
45
46

# File 'lib/pdftdx/parser.rb', line 44

def self.contains_unusable? row_data
  row_data.inject(false) { |b, e| b || (e[1].length == 0) || (e[1].length > MAX_CELL_LEN) }
end

.filter_rows(data) ⇒ `Array`

Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc… Also, strips Top Offset info from Table Rows.

Parameters:

data (Hash) —

A hash of table rows, mapped by their offset from the top, where each row is represented as a hash of table cells, mapped by their offset from the left. Example: { 10 => { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, 35 => { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ } }

Returns:

(Array) —

An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ }]

# File 'lib/pdftdx/parser.rb', line 91

def self.filter_rows data
  data
    .reject { |top, row| row.size < 2 || (top % PAGE_OFF) >= PAGE_MAX_TOP || is_all_same?(row) || contains_unusable?(row) }         # Drop Single-Element Rows, Footer Data, Useless Rows (all cells identical) & Unusable Rows (Empty / Oversize Cells)
    .collect { |_top, r| r }.reject { |r| r.size < 2 }                                                                              # Remove 'top offset' information and re-drop single-element rows
end

.hfilter(s) ⇒ `String`

HTML Filter Replaces HTML newlines by UNIX-style newlines.

Parameters:

s (String) —

A string of HTML data

Returns:

(String) —

The same string of HTML data, with all newlines (<br/> tags) converted to UNIX newlines.



52
53
54

# File 'lib/pdftdx/parser.rb', line 52

def self.hfilter s
  s.gsub '<br/>', "\n"
end

.htable_length(table, headers, h, i) ⇒ `Fixnum`

Determine Headered Table Length Computes the number of rows to be included in a given headered table.

Parameters:

table (Array) —

An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ }]
headers (Array) —

An array of header rows, each represented as a hash containing the header row’s index within the table array, and the actual row data. Example: [{ idx: 0, row: [‘trauma.eresse.net’, ‘durjaya.dooba.io’, ‘suessmost.eresse.net’] }]
h (Hash) —

The current header row (determine htable length from this)
i (Fixnum) —

The current header’s index within the headers array

Returns:

(Fixnum) —

The number of rows



104
105
106

# File 'lib/pdftdx/parser.rb', line 104

def self.htable_length table, headers, h, i
  (headers[i + 1] ? headers[i + 1][:idx] : table.length) - h[:idx]
end

.is_all_same?(row_data) ⇒ `Boolean`

Is All Same Data Determine whether a row’s cells all contain the same data.

Parameters:

row_data (Hash) —

A hash of table cells, mapped by their offset from the left. Example: { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }

Returns:

(Boolean) —

True if all cells contain the same data, False otherwise.

# File 'lib/pdftdx/parser.rb', line 35

def self.is_all_same? row_data
  n = row_data[row_data.keys[0]]
  row_data.inject(true) { |b, e| b && (e[1] == n) }
end

.process(page_data) ⇒ `Array`

Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.

Parameters:

page_data (Hash) —

A hash of document pages, mapped by their page index. Each page is an array of chomp’d lines of HTML data. Example: { 1 => [‘<h1>Hello World!</h1>’, ‘This is page one.’], 2 => [‘Wow, another page of data !’, ‘Important stuff’, ‘That’s it for page 2 !‘] }

Returns:

(Array) —

An array of tables, each represented as a hash containing an optional header and table data, in the form of either one single array of rows, or a hash of sub-tables (arrays of rows) mapped by name. Table rows are represented as an array of table cells. Example: [{ head: [‘trauma.eresse.net’, ‘durjaya.dooba.io’, ‘suessmost.eresse.net’], data: { ‘System’ => [[‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]] } }]

# File 'lib/pdftdx/parser.rb', line 204

def self.process page_data

  # Collect Data
  data = collect_data page_data

  # Build Data Table
  table = build_table data

  # Filter Rows
  table = filter_rows table

  # Filter Table Cells & Touch up
  touch_up table
end

.sort_row(r) ⇒ `Hash`

Sort Row Sorts Cells according to their x-offset

Parameters:

r (Hash) —

A row of data in the form { xoffset => cell } (Example: { 120 => ‘cell 0’, 200 => ‘cell 1’, 280 => ‘cell 2’ })

Returns:

(Hash) —

The same row of data, but sorted according to x-offset



151
152
153

# File 'lib/pdftdx/parser.rb', line 151

def self.sort_row r
  Hash[*(r.to_a.sort { |a, b| ((a[0] == b[0]) ? 0 : (a[0] > b[0] ? 1 : -1)) }.flatten)]
end

.sub_tab_len(table, stables, t, i) ⇒ `Fixnum`

Sub Table Length Computes the number of rows to be included in a given sub-table.

Parameters:

table (Array) —

An array of table rows, each represented as an array of table cells. Example: [[‘System’, ‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]]
stables (Array) —

An array of named tables, each represented as a hash containing the name and its starting index within the table array. Example: [{ title: ‘System Info’, idx: 0 }]
t (Hash) —

The current sub-table title row (determine stable length from this)
i (Fixnum) —

The current sub-table title’s index within the stable array

Returns:

(Fixnum) —

The number of rows



115
116
117

# File 'lib/pdftdx/parser.rb', line 115

def self.sub_tab_len table, stables, t, i
  (stables[i + 1] ? stables[i + 1][:idx] : table.length) - t[:idx]
end

.sub_tablize(htable_data) ⇒ `Array`

Sub-Tablize Splits a table into multiple named tables.

Parameters:

htable_data (Array) —

An array of table rows, each represented as an array of table cells. Example: [[‘System’, ‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]]

Returns:

(Array) —

An array of named tables, each represented as a hash containing the name and the table itself. May also contain a single array, containing all remaining table data (unnamed). Example: [{ name: ‘System’, data: [[‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]] }, [[‘32.40 $’, ‘34.00 $’, ‘88.40 $’], [‘21.40 km’, ‘12.00 km’, ‘99.10 km’]]]

# File 'lib/pdftdx/parser.rb', line 123

def self.sub_tablize htable_data

  # Collect Sub-table Title Rows
  subtab_titles = htable_data.collect.with_index { |r, i| { idx: i, row: r } }.select { |e| TITLE_CELL_REGEX =~ e[:row][0] }.collect { |e| { title: e[:row][0], idx: e[:idx] } }

  # Pull up Sub-tables
  stables = subtab_titles.collect.with_index do |t, i|
    {
      name: t[:title].gsub(/<\/?b>/, ''),                                                             # Extract Sub-Table Name
      data: htable_data                                                                               # Extract Sub-Table Data
        .slice(t[:idx], sub_tab_len(htable_data, subtab_titles, t, i))                              # Slice Table Data until next Sub-Table
        .collect { |e| e.reject.with_index { |c, ii| ii == 0 && TITLE_CELL_REGEX =~ c } }           # Reject Table Headers
    }
  end

  # Data until first sub-table index is considered 'unsorted'
  unsorted_end = subtab_titles.empty? ? htable_data.length : subtab_titles[0][:idx]

  # Insert last part (Unsorted)
  stables << htable_data.slice(0, unsorted_end) if unsorted_end > 0

  stables
end

.touch_up(table) ⇒ `Array`

Touch up Table Splits Table into multiple headered tables. Also, strips Left Offset info from Table Cells.

Parameters:

table (Array) —

An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ }]

Returns:

(Array) —

An array of tables, each represented as either a single array of rows, or a hash containing a header and table data, in the form of either one single array of rows, or a hash of sub-tables (arrays of rows) mapped by name. Table rows are represented as an array of table cells. Example: [{ head: [‘trauma.eresse.net’, ‘durjaya.dooba.io’, ‘suessmost.eresse.net’], data: [{ name: ‘System’, data: [[‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]] }] }]

# File 'lib/pdftdx/parser.rb', line 160

def self.touch_up table

  # Split Table into multiple Headered Tables
  headers = table
    .collect.with_index { |r, i| { idx: i, row: r } }
    .select { |e| e[:row].inject(true) { |b, c| b && (TITLE_CELL_REGEX =~ c[1]) } }
    .collect { |r| { idx: r[:idx], row: r[:row].collect { |o, v| { o => v.gsub(/<\/?b>/, '') } } } }

  # Pull up Headered Tables
  htables = headers.collect.with_index { |h, i| { head: h[:row], data: table.slice(h[:idx] + 1, htable_length(table, headers, h, i) - 1) } }

  # Fix Rows
  nh = htables.collect do |t|

    # Acquire Column Offsets
    cols = t[:head].collect { |o| o.first[0] }.sort

    # Compute Row Base (Default Columns)
    row_base = Hash[*(cols.collect { |c| [c, ''] }.flatten)]

    # Tables
    { head: t[:head], data: t[:data].collect { |r| sort_row row_base.merge(Hash[*(r.collect { |o, c| [(cols.reverse.find { |co| co <= o }) || o, c] }.flatten)]) } }
  end

  # Drop Offsets
  htables = nh.collect { |t| { head: t[:head].collect { |h| h.first[1] }, data: t[:data].collect { |r| r.collect { |_o, c| c } } } }
  ntable = table.collect { |r| r.collect { |_o, c| c } }

  # Split Headered Tables into multiple Named Sub-Tables
  htables.collect! { |ht| { head: ht[:head], data: sub_tablize(ht[:data]) } }

  # Data until first Header index is considered 'unsorted'
  unsorted_end = headers.empty? ? ntable.length : headers[0][:idx]

  # Insert last part (Unsorted)
  htables << sub_tablize(ntable.slice(0, unsorted_end)) if unsorted_end > 0

  htables
end

Module: PDFTDX::Parser

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.build_table(data) ⇒ Hash

.collect_data(data) ⇒ Array

.contains_unusable?(row_data) ⇒ Boolean

.filter_rows(data) ⇒ Array

.hfilter(s) ⇒ String

.htable_length(table, headers, h, i) ⇒ Fixnum

.is_all_same?(row_data) ⇒ Boolean

.process(page_data) ⇒ Array

.sort_row(r) ⇒ Hash

.sub_tab_len(table, stables, t, i) ⇒ Fixnum

.sub_tablize(htable_data) ⇒ Array

.touch_up(table) ⇒ Array

.build_table(data) ⇒ `Hash`

.collect_data(data) ⇒ `Array`

.contains_unusable?(row_data) ⇒ `Boolean`

.filter_rows(data) ⇒ `Array`

.hfilter(s) ⇒ `String`

.htable_length(table, headers, h, i) ⇒ `Fixnum`

.is_all_same?(row_data) ⇒ `Boolean`

.process(page_data) ⇒ `Array`

.sort_row(r) ⇒ `Hash`

.sub_tab_len(table, stables, t, i) ⇒ `Fixnum`

.sub_tablize(htable_data) ⇒ `Array`

.touch_up(table) ⇒ `Array`