textminer

gem version Build Status codecov.io

textminer helps you text mine through Crossref's TDM (Text & Data Mining) services:

Changes

For changes see the CHANGELOG

gem API

  • Textiner.search - search by DOI, query string, filters, etc. to get Crossref metadata, which you can use downstream to get full text links. This method essentially wraps Serrano.works(), but only a subset of params - this interface may change depending on feedback.
  • Textiner.fetch - Fetch full text given a url, supports Crossref's Text and Data Mining service
  • Textiner.extract - Extract text from a pdf

Install

Release version

gem install textminer

Development version

git clone git@github.com:sckott/textminer.git
cd textminer
rake install

Examples

Within Ruby

Search by DOI

require 'textminer'
# link to full text available
Textminer.search(doi: '10.7554/elife.06430')
# no link to full text available
Textminer.search(doi: "10.1371/journal.pone.0000308")

Many DOIs at once

require 'serrano'
dois = Serrano.random_dois(sample: 6)
Textminer.search(doi: dois)

Search with filters

Textminer.search(filter: {has_full_text: true})

The object returned form Textminer.search is a class, which has methods for pulling out all links, xml only, pdf only, or plain text only

x = Textminer.search(filter: {has_full_text: true})
x.links_xml
x.links_pdf
x.links_plain

Fetch full text

Textminer.fetch() gets full text based on URL input. We determine how to pull down and parse the content based on content type.

# get some metadata
res = Textminer.search(member: 2258, filter: {has_full_text: true});
# get links
links = res.links_xml(true);
# Get full text for an article
res = Textminer.fetch(url: links[0]);
# url
res.url
# file path
res.path
# content type
res.type
# parse content
res.parse

Extract text from PDF

Textminer.extract() extracts text from a pdf, given a path for a pdf

res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_pdf(true);
res = Textminer.fetch(url: links[0]);
Textminer.extract(res.path)

On the CLI

Coming soon...

To do

  • CLI executable
  • better test suite
  • better documentation