Module: PDFTDX::Parser
- Defined in:
- lib/pdftdx/parser.rb
Overview
Parser Module
Constant Summary collapse
- LINE_REGEX =
Line Regex
/^<p style[^>]+top:([0-9]+)px[^>]+left:([0-9]+)px[^>]+>(.*)<\/p>/- MAX_CELL_LEN =
Maximum Cell Length (to be considered usable data)
100- PAGE_OFF =
Page Offset
10000- PAGE_MAX_TOP =
Maximum Allowed Offset from Page Top
1100- TITLE_CELL_REGEX =
Title Cell Regex
/<b>/
Class Method Summary collapse
-
.build_table(data) ⇒ Hash
Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.
-
.collect_data(data) ⇒ Array
Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.
-
.contains_unusable?(row_data) ⇒ Boolean
Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.
-
.filter_rows(data) ⇒ Array
Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc…
-
.hfilter(s) ⇒ String
HTML Filter Replaces HTML newlines by UNIX-style newlines.
-
.htable_length(table, headers, h, i) ⇒ Fixnum
Determine Headered Table Length Computes the number of rows to be included in a given headered table.
-
.is_all_same?(row_data) ⇒ Boolean
Is All Same Data Determine whether a row’s cells all contain the same data.
-
.process(page_data) ⇒ Array
Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.
-
.sub_tab_len(table, stables, t, i) ⇒ Fixnum
Sub Table Length Computes the number of rows to be included in a given sub-table.
-
.sub_tablize(htable_data) ⇒ Array
Sub-Tablize Splits a table into multiple named tables.
-
.touch_up(table) ⇒ Array
Touch up Table Splits Table into multiple headered tables.
Class Method Details
.build_table(data) ⇒ Hash
Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.
80 81 82 83 84 |
# File 'lib/pdftdx/parser.rb', line 80 def self.build_table data table = {} data.each { |d| table[d[:top]] ||= {}; table[d[:top]][d[:left]] = d[:data] } table end |
.collect_data(data) ⇒ Array
Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/pdftdx/parser.rb', line 60 def self.collect_data data # Build HTML Entity Decoder coder = HTMLEntities.new # Collect File Data off = 0 data.collect do |_idx, page| off = off + PAGE_OFF page .select { |l| LINE_REGEX =~ l } # Collect Table-like data .collect { |l| LINE_REGEX.match l } # Extract Table Element Metadata (Position) .collect { |d| { top: off + d[1].to_i, left: d[2].to_i, data: hfilter(coder.decode(d[3])) } } # Produce Hash of Raw Table Data end.flatten end |
.contains_unusable?(row_data) ⇒ Boolean
Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.
44 45 46 |
# File 'lib/pdftdx/parser.rb', line 44 def self.contains_unusable? row_data row_data.inject(false) { |b, e| b || (e[1].length == 0) || (e[1].length > MAX_CELL_LEN) } end |
.filter_rows(data) ⇒ Array
Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc… Also, strips Top Offset info from Table Rows.
91 92 93 94 95 |
# File 'lib/pdftdx/parser.rb', line 91 def self.filter_rows data data .reject { |top, row| row.size < 2 || (top % PAGE_OFF) >= PAGE_MAX_TOP || is_all_same?(row) || contains_unusable?(row) } # Drop Single-Element Rows, Footer Data, Useless Rows (all cells identical) & Unusable Rows (Empty / Oversize Cells) .collect { |_top, r| r }.reject { |r| r.size < 2 } # Remove 'top offset' information and re-drop single-element rows end |
.hfilter(s) ⇒ String
HTML Filter Replaces HTML newlines by UNIX-style newlines.
52 53 54 |
# File 'lib/pdftdx/parser.rb', line 52 def self.hfilter s s.gsub '<br/>', "\n" end |
.htable_length(table, headers, h, i) ⇒ Fixnum
Determine Headered Table Length Computes the number of rows to be included in a given headered table.
104 105 106 |
# File 'lib/pdftdx/parser.rb', line 104 def self.htable_length table, headers, h, i (headers[i + 1] ? headers[i + 1][:idx] : table.length) - h[:idx] end |
.is_all_same?(row_data) ⇒ Boolean
Is All Same Data Determine whether a row’s cells all contain the same data.
35 36 37 38 |
# File 'lib/pdftdx/parser.rb', line 35 def self.is_all_same? row_data n = row_data[row_data.keys[0]] row_data.inject(true) { |b, e| b && (e[1] == n) } end |
.process(page_data) ⇒ Array
Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.
166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/pdftdx/parser.rb', line 166 def self.process page_data # Collect Data data = collect_data page_data # Build Data Table table = build_table data # Filter Rows table = filter_rows table # Filter Table Cells & Touch up touch_up table end |
.sub_tab_len(table, stables, t, i) ⇒ Fixnum
Sub Table Length Computes the number of rows to be included in a given sub-table.
115 116 117 |
# File 'lib/pdftdx/parser.rb', line 115 def self.sub_tab_len table, stables, t, i (stables[i + 1] ? stables[i + 1][:idx] : table.length) - t[:idx] end |
.sub_tablize(htable_data) ⇒ Array
Sub-Tablize Splits a table into multiple named tables.
123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/pdftdx/parser.rb', line 123 def self.sub_tablize htable_data # Collect Sub-table Title Rows subtab_titles = htable_data.collect.with_index { |r, i| { idx: i, row: r } }.select { |e| TITLE_CELL_REGEX =~ e[:row][0] }.collect { |e| { title: e[:row][0], idx: e[:idx] } } # Pull up Sub-tables stables = subtab_titles.collect.with_index { |t, i| { name: t[:title].gsub(/<\/?b>/, ''), data: htable_data.slice(t[:idx], sub_tab_len(htable_data, subtab_titles, t, i)).collect { |e| e.reject.with_index { |c, ii| ii == 0 && TITLE_CELL_REGEX =~ c } } } } # Data until first sub-table index is considered 'unsorted' unsorted_end = subtab_titles.empty? ? htable_data.length : subtab_titles[0][:idx] stables << htable_data.slice(0, unsorted_end) end |
.touch_up(table) ⇒ Array
Touch up Table Splits Table into multiple headered tables. Also, strips Left Offset info from Table Cells.
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'lib/pdftdx/parser.rb', line 142 def self.touch_up table # Remove Column Offsets table.collect! { |r| r.collect { |_left, cell| cell } } # Split Table into multiple Headered Tables headers = table.collect.with_index { |r, i| { idx: i, row: r } }.select { |e| e[:row].inject(true) { |b, c| b && (TITLE_CELL_REGEX =~ c) } }.collect { |r| { idx: r[:idx], row: r[:row].collect { |v| v.gsub /<\/?b>/, '' } } } # Pull up Headered Tables htables = headers.collect.with_index { |h, i| { head: h[:row], data: table.slice(h[:idx] + 1, htable_length(table, headers, h, i) - 1) } } # Split Headered Tables into multiple Named Sub-Tables htables.collect! { |ht| { head: ht[:head], data: sub_tablize(ht[:data]) } } # Data until first Header index is considered 'unsorted' unsorted_end = headers.empty? ? table.length : headers[0][:idx] htables << sub_tablize(table.slice(0, unsorted_end)) end |