Module: Pdftocsv

Defined in:
lib/pdftocsv.rb,
lib/pdftocsv/version.rb

Overview

Parsing PDF files to CSV-like data

Defined Under Namespace

Classes: Error

Constant Summary collapse

VERSION =
"0.1.0"

Class Method Summary collapse

Class Method Details

.parse(file_path) ⇒ Object

Parsing PDF files to CSV-like data

Example:

>> Pdftocsv.parse("example.pdf")
=> [[['a1', 'b1', 'c1'], ['a2', 'b2', 'c2']], [['A1', 'B1', 'C1'], ['A2', 'B2', 'C2']]]

Arguments:

file_path: (String)


22
23
24
25
26
27
28
29
# File 'lib/pdftocsv.rb', line 22

def self.parse(file_path)
  @pages = []
  File.open(file_path, "rb") do |io|
    reader = PDF::Reader.new(io)
    reader.pages.each { |page| @pages << to_page_csv(page) }
  end
  @pages
end

.to_page_csv(page) ⇒ Object

Separating a whole page text by line

Arguments:

page: (String)


36
37
38
39
40
41
42
43
44
# File 'lib/pdftocsv.rb', line 36

def to_page_csv(page)
  page_csv = []
  text_lines = page.text.split("\n")
  text_lines.each do |text_line|
    text_list = to_text_list(text_line)
    page_csv << text_list if text_list.any?
  end
  page_csv
end

.to_text_list(text_line) ⇒ Object

Separating a line by unit

Arguments:

text_line: (String)


50
51
52
53
54
# File 'lib/pdftocsv.rb', line 50

def to_text_list(text_line)
  text_list = text_line.split("\s\s")
  text_list.delete_if { |text| text.nil? || text.empty? }
  text_list.each(&:strip!)
end