Module: PDFTDX::Parser
- Defined in:
- lib/pdftdx/parser.rb
Overview
Parser Module
Constant Summary collapse
- LINE_REGEX =
Line Regex
/^<p style[^>]+top:([0-9]+)px[^>]+left:([0-9]+)px[^>]+>(.*)<\/p>/- MAX_CELL_LEN =
Maximum Cell Length (to be considered usable data)
100- PAGE_OFF =
Page Offset
10000- PAGE_MAX_TOP =
Maximum Allowed Offset from Page Top
1100- TITLE_CELL_REGEX =
Title Cell Regex
/<b>/
Class Method Summary collapse
-
.build_table(data) ⇒ Hash
Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.
-
.collect_data(data) ⇒ Array
Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.
-
.contains_unusable?(row_data) ⇒ Boolean
Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.
-
.filter_rows(data) ⇒ Array
Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc…
-
.hfilter(s) ⇒ String
HTML Filter Replaces HTML newlines by UNIX-style newlines.
-
.htable_length(table, headers, h, i) ⇒ Fixnum
Determine Headered Table Length Computes the number of rows to be included in a given headered table.
-
.is_all_same?(row_data) ⇒ Boolean
Is All Same Data Determine whether a row’s cells all contain the same data.
-
.process(page_data) ⇒ Array
Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.
-
.sort_row(r) ⇒ Hash
Sort Row Sorts Cells according to their x-offset.
-
.sub_tab_len(table, stables, t, i) ⇒ Fixnum
Sub Table Length Computes the number of rows to be included in a given sub-table.
-
.sub_tablize(htable_data) ⇒ Array
Sub-Tablize Splits a table into multiple named tables.
-
.touch_up(table) ⇒ Array
Touch up Table Splits Table into multiple headered tables.
Class Method Details
.build_table(data) ⇒ Hash
Build Data Table Produces an organized Table (in the form a 2-level nested hash) from an array of HTML chunks.
80 81 82 83 84 |
# File 'lib/pdftdx/parser.rb', line 80 def self.build_table data table = {} data.each { |d| table[d[:top]] ||= {}; table[d[:top]][d[:left]] = d[:data] } table end |
.collect_data(data) ⇒ Array
Collect Data Extracts table-like chunks of HTML data from a hash of HTML pages.
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/pdftdx/parser.rb', line 60 def self.collect_data data # Build HTML Entity Decoder coder = HTMLEntities.new # Collect File Data off = 0 data.collect do |_idx, page| off = off + PAGE_OFF page .select { |l| LINE_REGEX =~ l } # Collect Table-like data .collect { |l| LINE_REGEX.match l } # Extract Table Element Metadata (Position) .collect { |d| { top: off + d[1].to_i, left: d[2].to_i, data: hfilter(coder.decode(d[3])) } } # Produce Hash of Raw Table Data end.flatten end |
.contains_unusable?(row_data) ⇒ Boolean
Contains Unusable Data (Empty / Long Strings) Determines whether a row contains unusable data.
44 45 46 |
# File 'lib/pdftdx/parser.rb', line 44 def self.contains_unusable? row_data row_data.inject(false) { |b, e| b || (e[1].length == 0) || (e[1].length > MAX_CELL_LEN) } end |
.filter_rows(data) ⇒ Array
Filter Table Rows Filters out rows considered unusable, empty, oversize, footers, etc… Also, strips Top Offset info from Table Rows.
91 92 93 94 95 |
# File 'lib/pdftdx/parser.rb', line 91 def self.filter_rows data data .reject { |top, row| row.size < 2 || (top % PAGE_OFF) >= PAGE_MAX_TOP || is_all_same?(row) || contains_unusable?(row) } # Drop Single-Element Rows, Footer Data, Useless Rows (all cells identical) & Unusable Rows (Empty / Oversize Cells) .collect { |_top, r| r }.reject { |r| r.size < 2 } # Remove 'top offset' information and re-drop single-element rows end |
.hfilter(s) ⇒ String
HTML Filter Replaces HTML newlines by UNIX-style newlines.
52 53 54 |
# File 'lib/pdftdx/parser.rb', line 52 def self.hfilter s s.gsub '<br/>', "\n" end |
.htable_length(table, headers, h, i) ⇒ Fixnum
Determine Headered Table Length Computes the number of rows to be included in a given headered table.
104 105 106 |
# File 'lib/pdftdx/parser.rb', line 104 def self.htable_length table, headers, h, i (headers[i + 1] ? headers[i + 1][:idx] : table.length) - h[:idx] end |
.is_all_same?(row_data) ⇒ Boolean
Is All Same Data Determine whether a row’s cells all contain the same data.
35 36 37 38 |
# File 'lib/pdftdx/parser.rb', line 35 def self.is_all_same? row_data n = row_data[row_data.keys[0]] row_data.inject(true) { |b, e| b && (e[1] == n) } end |
.process(page_data) ⇒ Array
Process Transforms a hash of page data (as produced by pdftohtml) into a usable information table tree structure.
204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/pdftdx/parser.rb', line 204 def self.process page_data # Collect Data data = collect_data page_data # Build Data Table table = build_table data # Filter Rows table = filter_rows table # Filter Table Cells & Touch up touch_up table end |
.sort_row(r) ⇒ Hash
Sort Row Sorts Cells according to their x-offset
151 152 153 |
# File 'lib/pdftdx/parser.rb', line 151 def self.sort_row r Hash[*(r.to_a.sort { |a, b| ((a[0] == b[0]) ? 0 : (a[0] > b[0] ? 1 : -1)) }.flatten)] end |
.sub_tab_len(table, stables, t, i) ⇒ Fixnum
Sub Table Length Computes the number of rows to be included in a given sub-table.
115 116 117 |
# File 'lib/pdftdx/parser.rb', line 115 def self.sub_tab_len table, stables, t, i (stables[i + 1] ? stables[i + 1][:idx] : table.length) - t[:idx] end |
.sub_tablize(htable_data) ⇒ Array
Sub-Tablize Splits a table into multiple named tables.
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/pdftdx/parser.rb', line 123 def self.sub_tablize htable_data # Collect Sub-table Title Rows subtab_titles = htable_data.collect.with_index { |r, i| { idx: i, row: r } }.select { |e| TITLE_CELL_REGEX =~ e[:row][0] }.collect { |e| { title: e[:row][0], idx: e[:idx] } } # Pull up Sub-tables stables = subtab_titles.collect.with_index do |t, i| { name: t[:title].gsub(/<\/?b>/, ''), # Extract Sub-Table Name data: htable_data # Extract Sub-Table Data .slice(t[:idx], sub_tab_len(htable_data, subtab_titles, t, i)) # Slice Table Data until next Sub-Table .collect { |e| e.reject.with_index { |c, ii| ii == 0 && TITLE_CELL_REGEX =~ c } } # Reject Table Headers } end # Data until first sub-table index is considered 'unsorted' unsorted_end = subtab_titles.empty? ? htable_data.length : subtab_titles[0][:idx] # Insert last part (Unsorted) stables << htable_data.slice(0, unsorted_end) if unsorted_end > 0 stables end |
.touch_up(table) ⇒ Array
Touch up Table Splits Table into multiple headered tables. Also, strips Left Offset info from Table Cells.
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/pdftdx/parser.rb', line 160 def self.touch_up table # Split Table into multiple Headered Tables headers = table .collect.with_index { |r, i| { idx: i, row: r } } .select { |e| e[:row].inject(true) { |b, c| b && (TITLE_CELL_REGEX =~ c[1]) } } .collect { |r| { idx: r[:idx], row: r[:row].collect { |o, v| { o => v.gsub(/<\/?b>/, '') } } } } # Pull up Headered Tables htables = headers.collect.with_index { |h, i| { head: h[:row], data: table.slice(h[:idx] + 1, htable_length(table, headers, h, i) - 1) } } # Fix Rows nh = htables.collect do |t| # Acquire Column Offsets cols = t[:head].collect { |o| o.first[0] }.sort # Compute Row Base (Default Columns) row_base = Hash[*(cols.collect { |c| [c, ''] }.flatten)] # Tables { head: t[:head], data: t[:data].collect { |r| sort_row row_base.merge(Hash[*(r.collect { |o, c| [(cols.reverse.find { |co| co <= o }) || o, c] }.flatten)]) } } end # Drop Offsets htables = nh.collect { |t| { head: t[:head].collect { |h| h.first[1] }, data: t[:data].collect { |r| r.collect { |_o, c| c } } } } ntable = table.collect { |r| r.collect { |_o, c| c } } # Split Headered Tables into multiple Named Sub-Tables htables.collect! { |ht| { head: ht[:head], data: sub_tablize(ht[:data]) } } # Data until first Header index is considered 'unsorted' unsorted_end = headers.empty? ? ntable.length : headers[0][:idx] # Insert last part (Unsorted) htables << sub_tablize(ntable.slice(0, unsorted_end)) if unsorted_end > 0 htables end |