TreeTagger for Ruby

RubyGems | RTT Project Page | Source Code | Bug Tracker

<img src=“https://badge.fury.io/rb/treetagger-ruby.png” alt=“Gem Version” /> <img src=“https://travis-ci.org/arbox/treetagger-ruby.png” alt=“Build Status” /> <img src=“https://codeclimate.com/github/arbox/treetagger-ruby.png” alt=“Code Climate” />

DESCRIPTION

A Ruby based wrapper for the TreeTagger by Helmut Schmid.

Check it out if you are interested in Natural Language Processing (NLP) and/or Human Language Technology (HLT).

This library provides comprehensive bindings for the TreeTagger, a statistical language independed POS tagging and chunking software.

TreeTagger is language agnostic, it will never guess what language you’re going to use.

TODO:

  • References to Schmid’s publications;

  • How to use TreeTagger in the wild;

  • Input and output format, tokenization;

  • The actual german parameter file has been estimated on one byte encoded data.

Implemented Features

Simple tagging.

Please have a look at the CHANGELOG file for details on implemented and planned features.

INSTALLATION

Before you install the treetagger-ruby package please ensure you have downloaded and installed the TreeTagger itself.

The TreeTagger is a copyrighted software by Helmut Schmid and IMS, please read the license agreament before you download the TreeTagger package and language models.

After the installation of the TreeTagger set the environment variable TREETAGGER_BINARY to the location where the binary tree-tagger resides. Usually this binary is located under the bin directory in the main installation directory of the TreeTagger.

Also you have to set the variable TREETAGGER_MODEL to the location of the appropriate language model you have acquired in the training step.

For instance you may add the following lines to your .profile file:

export TREETAGGER_BINARY='/path/to/your/TreeTagger/bin/tree-tagger'
export TREETAGGER_MODEL='/path/to/your/TreeTagger/lib/german.par'

It is convinient to work with a default language model, but you can change it every time during the instantiation of a new tagger instance.

If you want to feed a lexicon file into your tagger you can do it globally through the environment variable TREETAGGER_LEXICON.

treetagger-ruby is provided as a .gem package. Simply install it via RubyGems. To install treetagger-ruby issue the following command:

$ gem install treetagger-ruby

If you want to do a system wide installation, do this as root (possibly using sudo).

Alternatively use your Gemfile for dependency management.

SYNOPSIS

Basic Usage

Basic usage is very simple:

$ require 'treetagger'
$ # Instantiate a tagger instance with default values.
$ tagger = TreeTagger::Tagger.new
$ # Process an array of tokens.
$ tagger.process(%w{Ich gehe in die Schule})
$ # Flush the pipeline.
$ tagger.flush
$ # Get the processed data.
$ tagger.get_output

Input Format

Basically you have to provide a tokenized sequence with possibly some additional information on lexical classes of tokens and on their probabilities. Every token has to be on a separate line. Due to technical limitations SGML tags (i.e. sequences with heading < and trailing >) cannot be valid tokes since they are used internally for delimiting meaningful content from flush tokens. It implies the use of the -sgml option which cannot be changes by user. It is a limitation of this library. If you do need to process tags, fall back and use the TreeTagger as a standalone programm possibly employing temp files to store your input and output. This behaviour will be also implemented in futher versions of treetagger-ruby.

Every token may occure alone on the line or be followed by additional information:

  • token;

  • token (\tab tag)+;

  • token (\tab tag \space lemma)+;

  • token (\tab tag \space probability)+;

  • token (\tab tag \space probability \space lemma)+.

You input may look like the following sentence:

Die     ART 0.99
neuen   ADJA neu
Hunde  NN NP
stehen  VVFIN 0.99 stehen
an
den
Mauern  NN Mauer
.

This wrapper accepts the input as String or Array.

If you want to use strings, you are responsible for the proper delimiters inside the string: "Die\tART 0.99\nneuen\tADJA neu\nHunde\tNN NP\nstehen\t VVFIN 0.99 stehen\nan\nden\nMauern\tNN Mauer\n.\n" Now treetagger-ruby does not check your markup for correctness and will possibly report a TreeTagger::ExternalError if the TreeTagger process die due to input errors.

Using arrays is more convinient since they can be built programmatically.

Arrays should have the following structure:

  • [‘token’, ‘token’, ‘token’];

  • [‘token’, [‘token’, [‘POS’, ‘lemma’], [‘POS’, ‘lemma’]], ‘token’];

  • [‘token’, [‘token’, [‘POS’, prob], [‘POS’, ‘prob’]], ‘token’];

  • [‘token’, [‘token’, [‘POS’, prob, ‘lemma’], [‘POS’, ‘prob’, ‘lemma’]]].

It is internally converted in the sequence token\ntoken\tPOS lemma\t POS lemma\ntoken\n, i.e. in the enriched version alternatives are tab separated and entries a blank separated.

Note that probabilities may be strings or integers.

The lexicon lookup is not implemented for now, that’s the latter three forms of input arrays are not supported yet.

Output Format

For now you’ll get an array with strings elements. However the precise string structure depends on the cmd arguments you’ve provided during the tagger instantiation.

For instanse for the input ["Veruntreute", "die", "AWO", "Spendengeld", "?"] you’ll get the following output with default cmd argumetns:

["Veruntreute\tNN\tVeruntreute", "die\tART\td", "AWO\tNN\t<unknown>", "Spendengeld\tNN\tSpendengeld", "?\t$.\t?"]

See documentation in the TreeTagger::Tagger class for details on particular methods.

EXCEPTION HIERARCHY

While using TreeTagger you can face following errors:

  • TreeTagger::UserError;

  • TreeTagger::RuntimeError;

  • TreeTagger::ExternalError.

This three kinds of errors all subclass TreeTagger::Error, which in turn is a subclass of StandardError. For an end user this means that it is possible to intercept all errors from treetagger-ruby with a simple rescue clause.

SUPPORT

If you have question, bug reports or any suggestions, please drop me an email :)

HOW TO CONTRIBUTE

Please contact me and suggest your ideas, report bugs, talk to me, if you want to implement some features in the future releases of this library.

Please don’t feel offended if I cannot accept all your pull requests, I have to review them and find the appropriate time and place in the code base to incorporate your valuable changes.

Any help is deeply appreciated!

CHANGELOG

For details on future plan and working progress see CHANGELOG.

CAUTION

This library is work in process! Though the interface is mostly complete, you might face some not implemented features.

Please contact me with your suggestions, bug reports and feature requests.

LICENSE

RTT is a copyrighted software by Andrei Beliankou, 2011-

You may use, redistribute and change it under the terms provided in the LICENSE file.