rbbt

Rbbt stands for Ruby Bio-Text, it started as an API for text mining developed for SENT, but its functionality has been used for other applications as well, such as MARQ.

Important Note

Some unexpected gem dependencies may appear.

Rbbt covers several functionalities, some will work right away, some require to install dependencies or download and process data from the internet. Since not all users are likely to need all the functionalities, this gems dependencies include only the very basic requirements. Dependencies may appear unexpectedly when using new parts of the API.

Functionality

Data sources interface

PubMed: Making queries and retrieving articles.
BioMart: Making queries to BioMart programmatically. It can divide a large query into smaller ones and merge the results.
Entrez: Retrieving gene entries, associated articles, and gene synonyms and aliases.
Biocreative: Using the competition test and training data to train and evaluate Named Entity Extraction models and Gene Mention Normalization.

Text mining tasks

BagOfWords: Bag-of-words representation of text. Chunk text into terms, which can be unigrams or bi-grams, remove stopwords, build a term thesaurus using a TF_IDF (term frequency inverse document frequency) or a KL (Kullback-Leibler divergence) Dictionary, and extract a bag-of-words representations suitable for the Classifier.
Classifier: Using R to build classification models and to use them to classify new entires. Currently the models are Support Vector Machines.
NER: Named Entity Extraction. Currently there are 3 alternatives to do this Abner, Banner, RegExpNER, and NER. The first two are third party Java systems that require the rjb (Ruby Java Bridge) gem to be installed. The third one, RegExpNER, is a simple regular-expression based system which can be used when there is not enough data to train a CRF based system, for example, to find Polysearch terms. The last one, the default, is a reimplementation of a CRF-based system, such as Abner and Banner, completely configurable using a simple DSL (domain specific language).
Normalizer: Resolve gene mentions to the actual genes they refer to. It compares the gene mention to all possible gene names and synonyms to find the best match. It is configurable using a DSL.

Organisms support

Using configuration files rbbt can support different organisms. The system is prepared to parse organism specific database files and merge them with Entrez and BioMart. Basically producing the following information

Lexicon: Listing the synonyms for each gene
Identifiers: Listing different identifiers for each gene like Entrez Gene Ids, Unigene, Affymetrix probe ids, etc. This is not the same as the lexicon which holds names, not identifiers.
GO: Listing associations of genes to GO terms.
PubMed articles: List articles associated to each gene, as listed in Entrez or listed to support of GO associations.

With this information rbbt offers the following functionality via the Organism class

NER and Normalization: Loads custom models for Named Entity Extraction and Gene Mention Normalization
Identifiers translation: Translates gene identifiers between formats.

Organisms in rbbt are identified using a keyword. This is the list of organisms currently supported with their associated keywords:

Candida albicans: Cal
Mus musculus: Mmu
Rattus norvegicus: Rno
Saccharomyces cerevisiae: Sce
Arabidopsis thaliana: Ata
Caenorhabditis elegans: Cel
Homo sapiens: Hsa
Schizosaccharomyces pombe: Spo

Other

Cache: The system caches PubMed articles and Entrez gene entries, this is considered a persistent cache since these items are unlikely to change. Also caches any data downloaded from the internet, like BioMart queries for example, into a non-persistent cache that can be purged to perform updates to the system.
Tab separated file helpers: The data in rbbt is saved into tab separated files and is loaded into Hash. Modules like Open or ArrayHash help dealing with these files and data structures.

Installation

Install the gem normally gem install rbbt. The gem includes a configuration tool rbbt_config. The first time you run it it will ask you to configure some paths. After that you may use it to process data for different tasks. Lets see some scenarios:

Using rbbt to translate identifiers

Do rbbt_config prepare identifiers to do deploy the configuration files and download entrez data, this needs to be done just once.
Now you may do rbbt_config install organisms toprocess all the organisms, or rbbt_config install organisms -o Sce to process only yeast (Sce).
You may now use a script like this to translate gene identifiers from yeast feed from the standard input

require 'rbbt/sources/organism'

index = Organism.id_index('Sce', :native => 'Entrez Gene Id')

STDIN.each_line{|l| puts "#{l.chomp} => #{index[l.chomp]}"}

Using rbbt to find gene mentions in text

First prepare the organisms as you did in the previous section. Next, if you want to use the default NER module:

Install the Biocreative data used to train the model and compile the CRF++ plugin, rbbt_config prepare rner. You may need at this point to install ParseTree and ruby2ruby
Build the module for a particular organism rbbt_config install ner -o Sce. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.

Or, if you wan to use Abner or Banner:

Download and install the packages rbbt_config prepare java_ner

You may now, for example, find mentions to genes in articles from a PubMed query using this script

require 'rbbt/sources/organism'
require 'rbbt/sources/pubmed'

# type = :abner
# type = :banner
type = :rner

ner = Organism.ner('Sce', type )
pmids = PubMed.query(ARGV[0], 500)

PubMed.get_article(pmids).each{|pmid,article|
  mentions = ner.extract(article.text)
  puts pmid
  puts article.text
  puts "Mentions: " << mentions.uniq.join(", ")
  puts
}

More Installation Guidelines

This is the complete list of gem requirements: ParseTree ruby2ruby simpleconsole rjb rsruby stemmer rand rake progress-monitor. Some of these gems to not work with ruby 1.9 at the time, or may be a bit more complicated to install, for that reason *they are not reported as dependencies and are only required when they are about to be used*. Note that some of these gems are in the gemcutter repository, you may need to install the gemcutter gem and do gem tumble

Some of the API requires to have some data processed using rbbt_config. This command is used to install third party software, download data from the internet, or build models. The command rbbt_config prepare all will install and process everything, this will take a long time, specially building the NER models. So you might want to start with the basic install and include more things as they are needed.

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so I don’t break it in a future version unintentionally.
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine, but bump version in a commit by itself that I can ignore when I pull)
Send me a pull request. Bonus points for topic branches.