DocRipper
Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf, .sketch) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.
For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion.
Need OCR support or in-image text parsing? Take a look at Docsplit.
Supported File Formats
.doc
.docx
.pdf
.txt
.sketch
| File format | Supported? | Dependencies | 
|---|---|---|
| .doc | x | Antiword | 
| .docx | x | |
| x | Poppler-utils | |
| .txt | x | |
| .sketch | x | Sqlite3 | 
Quickstart
  gem install doc_ripper
Specify a file path of a file
  require 'doc_ripper'
  DocRipper::rip('/path/to/file')
If the file cannot be read, nil will be returned.
  DocRipper::rip('/path/to/missing/file')
  => nil
Want to raise an exception? Use #rip!
#rip! will raise an exception if rip returns nil or the file type isn't supported
  # invalid file type
  DocRipper::rip!('/path/to/invalide/file.type')
  => DocRipper::UnsupportedFileType
  # missing file
  DocRipper::rip!('/path/to/missing/file.doc')
  => DocRipper::FileNotFound
Dependencies
- Ruby version >= 1.9.2
- Poppler-utils/(pdftotext) (PDF)
- Antiword (docx) more info: http://linux.die.net/man/1/antiword
- Sketch support requires sqlite3 and the sqlite3 gem gem