Tabula was written for those cases where you can’t easily copy-and-paste tables from a PDF to a spreadsheet. Surprisingly, Tabula sometimes fails where copy-and-pasting succeeds. This project is for those cases when copy-and-pasting is all you need (and where nothing else works).
This gem only works on OS X.
PDF to CSV
gem install --no-wrappers copy_paste_pdf
If you omit the
--no-wrappers switch, the AppleScript will not install properly. You may run the script with:
copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv
- The script will open the PDF in Preview and copy the contents of the PDF
- The script will open Microsoft Excel, paste the contents and save as CSV
If you want the script to quit Preview and Excel once it's done, pass a third argument, like:
copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv true
The script may pinwheel while copying the contents of the PDF and while pasting the contents to the spreadsheet. If it looks like nothing is happening, wait a few seconds.
You can work in other applications while the script is running - just don't use the clipboard as it may interfere with the script.
This method is admittedly not very efficient. Running time averages under 2 seconds per page but varies considerably depending on your system's load.
The Ruby gem defines helper methods for cleaning the CSV. In most cases, the PDF to CSV conversion will create many empty rows. You can easily remove those rows with:
require 'csv' require 'copy_paste_pdf' table = ::.new(CSV.read('/path/to/output.csv')) table.remove_empty_rows! CSV.open('/path/to/clean.csv', 'w') do |csv| table.each do |row| csv << row end end
If the table in the PDF contained vertically-merged cells, then, in the CSV, the first of the merged cells will have a value and the others will be empty. To copy the value of the first cell to the others, use the
copy_into_cell_below method, which accepts the indices of columns containing merged cells:
table.copy_into_cell_below(0, 3, 4)
Sometimes, if a cell contains multiple lines of text, the PDF to CSV conversion will incorrectly break the cell into multiple rows. To remove the spurious row and merge its values into the row above, use the
merge_into_cell_above method, which accepts the indices of columns in which this error occurs:
With additional time and effort, these two methods can be made to operate without needing columns as hints.
If you see warnings on the command-line like:
2013-10-09 14:30:03.704 osascript[2056:707] Error loading /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: dlopen(/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types, 262): no suitable image found. Did find: /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: no matching architecture in universal wrapper osascript: OpenScripting.framework - scripting addition "/Library/ScriptingAdditions/Adobe Unit Types.osax" declares no loadable handlers.
If, like me, you almost never write AppleScript, you can access much of AppleScript's documentation through Apple's AppleScript Editor. See, for example, how to access the entries about Microsoft Excel.
Most of the PDFs I work with contain no tables. In those cases I either:
pdftotext filename.pdfto convert the PDF to text, and write a script using regular expressions to parse the output.
pdftotext -layout filename.pdfto convert the PDF to text while preserving the text layout – very useful when working with two-column layouts.
- Use commercial software like Adobe Acrobat Pro to save the PDF to another format, usually Excel.
- I recently learned that Apple's Automator has an
Extract PDF Textaction which performs well.
For PDFs containing tables, I discovered that copy-pasting from Apple's Preview to Microsoft Excel worked better than all alternatives tested, for the PDFs I was interested in.
This project's main repository is on GitHub: http://github.com/opennorth/copy_paste_pdf, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
Copyright (c) 2013 Open North Inc., released under the MIT license