Class: UploadConvert
- Inherits:
-
Object
- Object
- UploadConvert
- Defined in:
- lib/uploadconvert.rb
Instance Method Summary collapse
-
#cleanPDF(text) ⇒ Object
Removes numbers from edges of legal documents.
-
#detectPDFType ⇒ Object
Use embedded fonts to detect the type of PDF.
-
#embedPDF ⇒ Object
Extract text from embedded text PDFs.
-
#extractMetadataPDF ⇒ Object
Extract PDF metadata.
-
#handleDoc ⇒ Object
Sends the document to the appropriate method.
-
#initialize(input) ⇒ UploadConvert
constructor
A new instance of UploadConvert.
-
#ocrPDF ⇒ Object
OCR PDFs and turn that text into a JSON.
-
#pdfTojson ⇒ Object
Convert PDFs to JSON.
-
#xmlTojson(xmlin) ⇒ Object
Convert XML files to JSONs.
Constructor Details
#initialize(input) ⇒ UploadConvert
Returns a new instance of UploadConvert.
8 9 10 11 12 |
# File 'lib/uploadconvert.rb', line 8 def initialize(input) @input = input @output = "" @text = "" end |
Instance Method Details
#cleanPDF(text) ⇒ Object
Removes numbers from edges of legal documents
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/uploadconvert.rb', line 94 def cleanPDF(text) text.gsub!(/\r?\n/, "\n") text.each_line do |l| lflag = 0 (1..28).each do |i| if l == i.to_s+"\n" lflag = 1 end end if lflag != 1 && l @text += l end end return @text end |
#detectPDFType ⇒ Object
Use embedded fonts to detect the type of PDF
49 50 51 52 53 54 55 56 |
# File 'lib/uploadconvert.rb', line 49 def detectPDFType out = `pdffonts #{@input}`.split("\n") if out.length > 4 return else # return ocrPDF end end |
#embedPDF ⇒ Object
Extract text from embedded text PDFs
59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/uploadconvert.rb', line 59 def begin Docsplit.extract_text(@input, :ocr => false, :output => "public/uploads") outfile = @input.split(".pdf") path = "public/uploads/" + outfile[0] text = File.read(path+".txt") # Clean up text and delete file File.delete(path+".txt") cleanPDF(text) rescue end end |
#extractMetadataPDF ⇒ Object
Extract PDF metadata
113 114 115 116 117 118 119 120 121 122 123 124 |
# File 'lib/uploadconvert.rb', line 113 def extractMetadataPDF = Hash.new [:author] = Docsplit.(@input) [:creator] = Docsplit.extract_creator(@input) [:producer] = Docsplit.extract_producer(@input) [:title] = Docsplit.extract_title(@input) [:subject] = Docsplit.extract_subject(@input) [:date] = Docsplit.extract_date(@input) [:keywords] = Docsplit.extract_keywords(@input) [:length] = Docsplit.extract_length(@input) return end |
#handleDoc ⇒ Object
Sends the document to the appropriate method
15 16 17 18 19 20 21 22 23 24 25 26 |
# File 'lib/uploadconvert.rb', line 15 def handleDoc if @input.include? "http" `wget #{@input}` path = @input.split("/") @input = path[path.length-1].chomp.strip handleDoc elsif @input.include? ".pdf" pdfTojson elsif @input.include? ".xml" xmlTojson(File.read(@input)) end end |
#ocrPDF ⇒ Object
OCR PDFs and turn that text into a JSON
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/uploadconvert.rb', line 74 def ocrPDF # Extract individual pages Docsplit.extract_images(@input) # OCR docs = Dir["*.png"] Docsplit.extract_text(@input, :ocr => true, :output => 'text') outfile = @input.split(".") text = File.read("text/" + outfile[0] + ".txt") # Clean up text and files File.delete("text/" + outfile[0]+".txt") Dir.delete("text") docs.each do |d| File.delete(d) end cleanPDF(text) end |
#pdfTojson ⇒ Object
Convert PDFs to JSON
35 36 37 38 39 40 41 42 43 44 45 46 |
# File 'lib/uploadconvert.rb', line 35 def pdfTojson # Extract and clean text @text = detectPDFType # Extract metadata and generate output extractMetadataPDF outhash = Hash.new .each{|k, v| outhash[k] = v} outhash[:text] = @text outhash[:input] = @input @output = JSON.pretty_generate(outhash) end |
#xmlTojson(xmlin) ⇒ Object
Convert XML files to JSONs
29 30 31 32 |
# File 'lib/uploadconvert.rb', line 29 def xmlTojson(xmlin) xml = Crack::XML.parse(xmlin) JSON.pretty_generate(xml) end |