Module: PDFTDX::Parser

Defined in:
lib/pdftdx/parser.rb

Overview

Parser Module

Constant Summary collapse

LINE_REGEX =

Line Regex

/^<p style[^>]+top:([0-9]+)px[^>]+left:([0-9]+)px[^>]+>(.*)<\/p>/
MAX_CELL_LEN =

Maximum Cell Length (to be considered usable data)

100
PAGE_OFF =

Page Offset

10000
PAGE_MAX_TOP =

Maximum Allowed Offset from Page Top

1100
TITLE_CELL_REGEX =

Title Cell Regex

/<b>/

Class Method Summary collapse

Class Method Details

.build_table(data) ⇒ Hash

Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.

Parameters:

  • data (Array)

    An array of document chunks, each represented as a hash containing the position and body of the chunk. Example: [{ top: 10, left: 100, data: ‘Machine OS’ }, { top: 10, left: 220, data: ‘Win32’ }, { top: 10, left: 340, data: ‘Linux’ }, { top: 10, left: 460, data: ‘MacOS’ }]

Returns:

  • (Hash)

    A hash of table rows, mapped by their offset from the top, where each row is represented as a hash of table cells, mapped by their offset from the left. Example: { 10 => { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, 35 => { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ } }



80
81
82
83
84
# File 'lib/pdftdx/parser.rb', line 80

def self.build_table data
	table = {}
	data.each { |d| table[d[:top]] ||= {}; table[d[:top]][d[:left]] = d[:data] }
	table
end

.collect_data(data) ⇒ Array

Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.

Parameters:

  • data (Hash)

    A hash of document pages, mapped by their page index. Each page is an array of chomp’d lines of HTML data. Example: { 1 => [‘<h1>Hello World!</h1>’, ‘This is page one.’], 2 => [‘Wow, another page of data !’, ‘Important stuff’, ‘That’s it for page 2 !‘] }

Returns:

  • (Array)

    An array of HTML chunks, each represented as a hash containing the chunk position and data. Example: [{ top: 10, left: 100, data: ‘Machine OS’ }, { top: 10, left: 220, data: ‘Win32’ }, { top: 10, left: 340, data: ‘Linux’ }, { top: 10, left: 460, data: ‘MacOS’ }]



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# File 'lib/pdftdx/parser.rb', line 60

def self.collect_data data

	# Build HTML Entity Decoder
	coder = HTMLEntities.new

	# Collect File Data
	off = 0
	data.collect do |_idx, page|
		off = off + PAGE_OFF
		page
			.select { |l| LINE_REGEX =~ l }                                                                                             # Collect Table-like data
			.collect { |l| LINE_REGEX.match l }                                                                                         # Extract Table Element Metadata (Position)
			.collect { |d| { top: off + d[1].to_i, left: d[2].to_i, data: hfilter(coder.decode(d[3])) } }                               # Produce Hash of Raw Table Data
	end.flatten
end

.contains_unusable?(row_data) ⇒ Boolean

Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.

Parameters:

  • row_data (Hash)

    A hash of table cells, mapped by their offset from the left. Example: { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }

Returns:

  • (Boolean)

    True if at least one cell is unusable (empty, oversize), False otherwise



44
45
46
# File 'lib/pdftdx/parser.rb', line 44

def self.contains_unusable? row_data
	row_data.inject(false) { |b, e| b || (e[1].length == 0) || (e[1].length > MAX_CELL_LEN) }
end

.filter_rows(data) ⇒ Array

Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc… Also, strips Top Offset info from Table Rows.

Parameters:

  • data (Hash)

    A hash of table rows, mapped by their offset from the top, where each row is represented as a hash of table cells, mapped by their offset from the left. Example: { 10 => { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, 35 => { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ } }

Returns:

  • (Array)

    An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ }]



91
92
93
94
95
# File 'lib/pdftdx/parser.rb', line 91

def self.filter_rows data
	data
		.reject { |top, row| row.size < 2 || (top % PAGE_OFF) >= PAGE_MAX_TOP || is_all_same?(row) || contains_unusable?(row) }         # Drop Single-Element Rows, Footer Data, Useless Rows (all cells identical) & Unusable Rows (Empty / Oversize Cells)
		.collect { |_top, r| r }.reject { |r| r.size < 2 }                                                                              # Remove 'top offset' information and re-drop single-element rows
end

.fix_dupes(r) ⇒ Object

Fix Dupes Shifts Duplicate Cells (Cells which share their x-offset with others) to the right (so they don’t get overwritten)

Parameters:

  • r (Array)

    A row of data in the form [[xoffset, cell]] (Example: [[120, ‘cell 0’], [200, ‘cell 1’], [280, ‘cell 2’]])

  • The (Array)

    same row of data, but with duplicate cells shifted so that no x-offset-collisions occur



159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# File 'lib/pdftdx/parser.rb', line 159

def self.fix_dupes r

	# Deep-Duplicate Row
	nr = r.collect { |e| e.clone }

	# Run through Cells
	nr.length.times do |i|

		# Acquire Duplicate Length
		dupes = nr.slice(i + 1, nr.length).inject(0) { |a, c| a + (c[0] == nr[i][0] ? 1 : 0) }

		# Fix Dupes
		dupes.times { |j| nr[i + j + 1][0] = nr[i + j + 1][0] + 1 }
	end

	nr
end

.hfilter(s) ⇒ String

HTML Filter Replaces HTML newlines by UNIX-style newlines.

Parameters:

  • s (String)

    A string of HTML data

Returns:

  • (String)

    The same string of HTML data, with all newlines (<br/> tags) converted to UNIX newlines.



52
53
54
# File 'lib/pdftdx/parser.rb', line 52

def self.hfilter s
	s.gsub '<br/>', "\n"
end

.htable_length(table, headers, h, i) ⇒ Fixnum

Determine Headered Table Length Computes the number of rows to be included in a given headered table.

Parameters:

  • table (Array)

    An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ }]

  • headers (Array)

    An array of header rows, each represented as a hash containing the header row’s index within the table array, and the actual row data. Example: [{ idx: 0, row: [‘trauma.eresse.net’, ‘durjaya.dooba.io’, ‘suessmost.eresse.net’] }]

  • h (Hash)

    The current header row (determine htable length from this)

  • i (Fixnum)

    The current header’s index within the headers array

Returns:

  • (Fixnum)

    The number of rows



104
105
106
# File 'lib/pdftdx/parser.rb', line 104

def self.htable_length table, headers, h, i
	(headers[i + 1] ? headers[i + 1][:idx] : table.length) - h[:idx]
end

.is_all_same?(row_data) ⇒ Boolean

Is All Same Data Determine whether a row’s cells all contain the same data.

Parameters:

  • row_data (Hash)

    A hash of table cells, mapped by their offset from the left. Example: { 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }

Returns:

  • (Boolean)

    True if all cells contain the same data, False otherwise.



35
36
37
38
# File 'lib/pdftdx/parser.rb', line 35

def self.is_all_same? row_data
	n = row_data[row_data.keys[0]]
	row_data.inject(true) { |b, e| b && (e[1] == n) }
end

.process(page_data) ⇒ Array

Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.

Parameters:

  • page_data (Hash)

    A hash of document pages, mapped by their page index. Each page is an array of chomp’d lines of HTML data. Example: { 1 => [‘<h1>Hello World!</h1>’, ‘This is page one.’], 2 => [‘Wow, another page of data !’, ‘Important stuff’, ‘That’s it for page 2 !‘] }

Returns:

  • (Array)

    An array of tables, each represented as a hash containing an optional header and table data, in the form of either one single array of rows, or a hash of sub-tables (arrays of rows) mapped by name. Table rows are represented as an array of table cells. Example: [{ head: [‘trauma.eresse.net’, ‘durjaya.dooba.io’, ‘suessmost.eresse.net’], data: { ‘System’ => [[‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]] } }]



226
227
228
229
230
231
232
233
234
235
236
237
238
239
# File 'lib/pdftdx/parser.rb', line 226

def self.process page_data

	# Collect Data
	data = collect_data page_data

	# Build Data Table
	table = build_table data

	# Filter Rows
	table = filter_rows table

	# Filter Table Cells & Touch up
	touch_up table
end

.sort_row(r) ⇒ Hash

Sort Row Sorts Cells according to their x-offset

Parameters:

  • r (Hash)

    A row of data in the form { xoffset => cell } (Example: { 120 => ‘cell 0’, 200 => ‘cell 1’, 280 => ‘cell 2’ })

Returns:

  • (Hash)

    The same row of data, but sorted according to x-offset



151
152
153
# File 'lib/pdftdx/parser.rb', line 151

def self.sort_row r
	Hash[*(r.to_a.sort { |a, b| ((a[0] == b[0]) ? 0 : (a[0] > b[0] ? 1 : -1)) }.flatten)]
end

.sub_tab_len(table, stables, t, i) ⇒ Fixnum

Sub Table Length Computes the number of rows to be included in a given sub-table.

Parameters:

  • table (Array)

    An array of table rows, each represented as an array of table cells. Example: [[‘System’, ‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]]

  • stables (Array)

    An array of named tables, each represented as a hash containing the name and its starting index within the table array. Example: [{ title: ‘System Info’, idx: 0 }]

  • t (Hash)

    The current sub-table title row (determine stable length from this)

  • i (Fixnum)

    The current sub-table title’s index within the stable array

Returns:

  • (Fixnum)

    The number of rows



115
116
117
# File 'lib/pdftdx/parser.rb', line 115

def self.sub_tab_len table, stables, t, i
	(stables[i + 1] ? stables[i + 1][:idx] : table.length) - t[:idx]
end

.sub_tablize(htable_data) ⇒ Array

Sub-Tablize Splits a table into multiple named tables.

Parameters:

  • htable_data (Array)

    An array of table rows, each represented as an array of table cells. Example: [[‘System’, ‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]]

Returns:

  • (Array)

    An array of named tables, each represented as a hash containing the name and the table itself. May also contain a single array, containing all remaining table data (unnamed). Example: [{ name: ‘System’, data: [[‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]] }, [[‘32.40 $’, ‘34.00 $’, ‘88.40 $’], [‘21.40 km’, ‘12.00 km’, ‘99.10 km’]]]



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/pdftdx/parser.rb', line 123

def self.sub_tablize htable_data

	# Collect Sub-table Title Rows
	subtab_titles = htable_data.collect.with_index { |r, i| { idx: i, row: r } }.select { |e| TITLE_CELL_REGEX =~ e[:row][0] }.collect { |e| { title: e[:row][0], idx: e[:idx] } }

	# Pull up Sub-tables
	stables = subtab_titles.collect.with_index do |t, i|
		{
			name: t[:title].gsub(/<\/?b>/, ''),                                                             # Extract Sub-Table Name
			data: htable_data                                                                               # Extract Sub-Table Data
				.slice(t[:idx], sub_tab_len(htable_data, subtab_titles, t, i))                              # Slice Table Data until next Sub-Table
				.collect { |e| e.reject.with_index { |c, ii| ii == 0 && TITLE_CELL_REGEX =~ c } }           # Reject Table Headers
		}
	end

	# Data until first sub-table index is considered 'unsorted'
	unsorted_end = subtab_titles.empty? ? htable_data.length : subtab_titles[0][:idx]

	# Insert last part (Unsorted)
	stables << htable_data.slice(0, unsorted_end) if unsorted_end > 0

	stables
end

.touch_up(table) ⇒ Array

Touch up Table Splits Table into multiple headered tables. Also, strips Left Offset info from Table Cells.

Parameters:

  • table (Array)

    An array of table rows, each represented as a hash of table cells, mapped by their offset from the left. Example: [{ 100 => ‘Machine OS’, 220 => ‘Win32’, 340 => ‘Linux’, 460 => ‘MacOS’ }, { 100 => ‘IP Address’, 220 => ‘10.0.232.48’, 340 => ‘10.0.232.134’, 460 => ‘10.0.232.108’ }]

Returns:

  • (Array)

    An array of tables, each represented as either a single array of rows, or a hash containing a header and table data, in the form of either one single array of rows, or a hash of sub-tables (arrays of rows) mapped by name. Table rows are represented as an array of table cells. Example: [{ head: [‘trauma.eresse.net’, ‘durjaya.dooba.io’, ‘suessmost.eresse.net’], data: [{ name: ‘System’, data: [[‘Machine OS’, ‘Win32’, ‘Linux’, ‘MacOS’], [‘IP Address’, ‘10.0.232.48’, ‘10.0.232.134’, ‘10.0.232.108’]] }] }]



182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# File 'lib/pdftdx/parser.rb', line 182

def self.touch_up table

	# Split Table into multiple Headered Tables
	headers = table
		.collect.with_index { |r, i| { idx: i, row: r } }
		.select { |e| e[:row].inject(true) { |b, c| b && (TITLE_CELL_REGEX =~ c[1]) } }
		.collect { |r| { idx: r[:idx], row: r[:row].collect { |o, v| { o => v.gsub(/<\/?b>/, '') } } } }

	# Pull up Headered Tables
	htables = headers.collect.with_index { |h, i| { head: h[:row], data: table.slice(h[:idx] + 1, htable_length(table, headers, h, i) - 1) } }

	# Fix Rows
	nh = htables.collect do |t|

		# Acquire Column Offsets
		cols = t[:head].collect { |o| o.first[0] }.sort

		# Compute Row Base (Default Columns)
		row_base = Hash[*(cols.collect { |c| [c, ''] }.flatten)]

		# Re-Build Table
		{ head: t[:head], data: t[:data].collect { |r| sort_row row_base.merge(Hash[*((fix_dupes r.collect { |o, c| [(cols.reverse.find { |co| co <= o }) || o, c] }).flatten)]) } }
	end

	# Drop Offsets
	htables = nh.collect { |t| { head: t[:head].collect { |h| h.first[1] }, data: t[:data].collect { |r| r.collect { |_o, c| c } } } }
	ntable = table.collect { |r| r.collect { |_o, c| c } }

	# Split Headered Tables into multiple Named Sub-Tables
	htables.collect! { |ht| { head: ht[:head], data: sub_tablize(ht[:data]) } }

	# Data until first Header index is considered 'unsorted'
	unsorted_end = headers.empty? ? ntable.length : headers[0][:idx]

	# Insert last part (Unsorted)
	htables << sub_tablize(ntable.slice(0, unsorted_end)) if unsorted_end > 0

	htables
end