Lingo - A full-featured automatic indexing system

Version
Description
- Introduction
- Attendees
- Filters
- Markup
- Inline annotation
- Plugins
- Server
Example
Installation and Usage
- Dictionary and configuration file lookup
File formats
Issues and Contributions
Links
Literature
- Background and theoretical foundations
- Research publications
Credits
- Authors
- Contributors
License and Copyright

VERSION

This documentation refers to Lingo version 1.10.2

DESCRIPTION

Lingo is an open source indexing system for research and teachings. The main functions of Lingo are:

identification of (i.e. reduction to) basic word form by means of dictionaries and suffix lists
algorithmic decomposition
dictionary-based synonymization and identification of phrases
generic identification of phrases/word sequences based on patterns of word classes

Introduction

Lingo allows flexible and extendable linguistic analysis of text files. Here is a minimal configuration example to analyse this README file:

meeting:
  attendees:
    - text_reader: { files: 'README' }
    - debugger:    { eval: 'true', ceval: 'cmd!=:EOL', prompt: '<debug>:  ' }

Lingo is told to invite two attendees and wants them to talk to each other, hence the name Lingo (= the technical language).

The first attendee is the text_reader. It can read files and communicates their content to other attendees. For this purpose, the text_reader is given an output channel. Everything that the text_reader has to say is steered through this channel. It will do nothing further until Lingo tells the first attendee to speak. Then the text_reader will open the file README (as per the files parameter) and pass the content to the other attendees via its output channel.

The second attendee, debugger, does nothing else than to put everything on the console (standard error) that comes into its input channel. If you write the Lingo configuration which is shown above as an example into the file readme.cfg and then run lingo -c readme -l en, the result will look something like this:

<debug>:  *FILE('README')
<debug>:  "= Lingo - [...]"
...
<debug>:  "Lingo allows flexible and extendable linguistic analysis [...]"
<debug>:  "is a minimal configuration example to analyse this README [...]"
...
<debug>:  *EOF('README')

What we see are lines beginning with an asterisk (*) and lines without. That’s because Lingo distinguishes between commands and data. The text_reader did not only read the content of the file, but also communicated through the commands when a file began and when it ended. This can (and will) be an important piece of information for other attendees that will be added later.

To try out Lingo’s functionality without installing it first, have a look at Lingo Web. There you can enter some text and see the debug output Lingo generated – including tokenization, word identification, decomposition, etc.

Attendees

Available attendees that can be used for solving a specific problem (for more information see each attendee’s documentation):

text_reader: Reads files (or standard input) and puts their content into the channels line by line. (see Lingo::Attendee::TextReader)
tokenizer: Dissects lines into defined character strings, i.e. tokens. (see Lingo::Attendee::Tokenizer)
abbreviator: Identifies abbreviations and produces the long form if listed in a dictionary. (see Lingo::Attendee::Abbreviator)
word_searcher: Identifies tokens and turns them into words for further processing. To this end, it consults the dictionaries. (see Lingo::Attendee::WordSearcher)
stemmer: Identifies tokens not identified by the word_searcher by means of stemming. (see Lingo::Attendee::Stemmer)
decomposer: Tests any tokens not identified by the word_searcher for being compounds. (see Lingo::Attendee::Decomposer)
synonymer: Extends words with their synonyms. (see Lingo::Attendee::Synonymer)
vector_filter: Filters out everything and lets through only those tokens that are considered useful for indexing. (see Lingo::Attendee::VectorFilter)
object_filter: Similar to the vector_filter. (see Lingo::Attendee::ObjectFilter)
text_writer: Writes anything that it receives into a file (or to standard output). (see Lingo::Attendee::TextWriter)
formatter: Similar to the text_writer, but allows for custom output formats. (see Lingo::Attendee::Formatter)
debugger: Shows everything for debugging. (see Lingo::Attendee::Debugger)
variator: Tries to correct spelling errors and the like. (see Lingo::Attendee::Variator)
multi_worder: Identifies phrases (word sequences) based on a multiword dictionary. (see Lingo::Attendee::MultiWorder)
sequencer: Identifies phrases (word sequences) based on patterns of word classes. (see Lingo::Attendee::Sequencer)

Furthermore, it may be useful to have a look at the configuration files lingo.cfg and en.lang.

Filters

Lingo is able to read HTML, XML, and PDF in addition to plain text.

Examples:

Read any file, guessing the correct type automatically:

- text_reader:     { files: $(files), filter: true }

Read HTML files specifically (accordingly for XML):

- text_reader:     { files: $(files), filter: 'html' }

Read PDF files, either with the pdf-reader gem (default):

- text_reader:     { files: $(files), filter: 'pdf' }

or with the pdftotext command line tool:

- text_reader:     { files: $(files), filter: 'pdftotext' }

Markup

Lingo is able to, in a limited form, parse HTML/XML and MediaWiki markup.

Examples:

Identify HTML/XML tags in the input stream:

- tokenizer:       { tags: true }

Identify MediaWiki markup in the input stream:

- tokenizer:       { wiki: true }

Inline annotation

Lingo is able to annotate input text inline, instead of printing results out of context to external files.

Example:

# read files
- text_reader:   { files: $(files) }
# keep whitespace
- tokenizer:     { space: true }
# do processing...
- word_searcher: { source: sys-dic, mode: first }
# insert formatted results (e.g. "[[Name::lingo|Lingo]] finds [[Noun::word|words]].")
- formatter:     { ext: out, format: '[[%3$s::%2$s|%1$s]]', map: { e: Name, s: Noun } }

Plugins

Lingo has a plugin system that allows you to implement additional features (e.g. add new attendees) or modify existing ones. Just create a file named lingo_plugin.rb in your Gem’s lib directory or any directory that’s in $LOAD_PATH. You can also define an environment variable LINGO_PLUGIN_PATH (by default ~/.lingo/plugins) with additional directories to load plugins from (*.rb).

A dedicated API to support writing and integrating plugins will be added in the future.

Server

Lingo comes with a server daemon Lingo::Srv that exposes an HTTP interface to Lingo’s functionality. The configuration needs to ensure that input is read from standard input (files: STDIN on text_reader) and output is written to standard output (ext: STDOUT on text_writer).

Example: Start Lingo server on port 6789 with language configuration en and default configuration file; server options come before --, Lingo options come after.

> lingosrv -p 6789 -- -l en

You can also pass Lingo options through the LINGO_SRV_OPTS environment variable (e.g., LINGO_SRV_OPTS='-l en -c /path/to/your/srv.cfg').

JSON endpoint

Example: Ask the server about “Lingo server”; returns JSON data (output formatted for clarity).

> curl 'http://localhost:6789/?q=Lingo+server'
{
  "Lingo server" : [
    " <Lingo = [(lingo/s), (lingo/e)]>",
    " <server = [(server/s)]>"
  ]
}

Example: Ask the server about “Lingo” and “server”; returns JSON data (output formatted for clarity).

> curl -g 'http://localhost:6789/?q[]=Lingo&q[]=server'
{
  "[\"Lingo\", \"server\"]" : {
    "Lingo" : [
      " <Lingo = [(lingo/s), (lingo/e)]>"
    ],
    "server" : [
      " <server = [(server/s)]>"
    ]
  }
}

Raw endpoint

Example: Ask the server about “Lingo server”; returns raw Lingo response.

> curl --data 'Lingo server' http://localhost:6789/raw
<Lingo = [(lingo/s), (lingo/e)]>
<server = [(server/s)]>

Example: Ask the server about this file; returns raw Lingo response (output truncated for clarity).

> curl --data @README -H 'Content-Type: text/plain' http://localhost:6789/raw
:=/OTHR:
<Lingo = [(lingo/s), (lingo/e)]>
<-|?>
<A|?>
<full-featured|COM = [(full-featured/k), (full/s+), (full/a+), (full/v+), (featured/a+)]>
<automatic = [(automatic/s), (automatic/a)]>
<indexing = [(index/v)]>
<system = [(system/s)]>
[...]

Deployment

Lingo::Srv can be started directly through the provided command-line executable lingosrv (see above) or through any other Rack -compatible deployment option; a rackup file is included (see lingoctl rackup srv).

Example: To deploy Lingo::Srv with Passenger on Apache, create a symlink in the DocumentRoot pointing to the app’s public/ directory; adjust the paths according to your environment (you can use current_gem to create a stable gem path):

/var/www
  |
  +-- lingo-srv -> /usr/lib/ruby/gems/2.1.0/gems/lingo-x.y.z/lib/lingo/srv/public

Then put the following snippet in Apache’s VirtualHost configuration:

<VirtualHost *:80>
  ...
  RackBaseURI /lingo-srv
  <Directory /var/www/lingo-srv>
    Options -MultiViews
    SetEnv LINGO_SRV_OPTS "-l en"  # <-- Optionally set Lingo options
  </Directory>
</VirtualHost>

In order to provide your own rackup file and Lingo configuration, create a directory with those files:

/srv/lingo-srv
  |
  +-- config.ru
  |
  +-- lingosrv.cfg

And then point Passenger at it:

<VirtualHost *:80>
  ...
  RackBaseURI /lingo-srv
  <Directory /var/www/lingo-srv>
    Options -MultiViews
    PassengerAppRoot /srv/lingo-srv  # <-- Add this line
  </Directory>
</VirtualHost>

Restart Apache and test the result (output formatted for clarity):

> curl http://localhost/lingo-srv/about
{
  "Lingo::Srv" : {
    "version" : "x.y.z"
  }
}

EXAMPLE

TODO: Full-fledged example to show off Lingo’s features and provide a basis for further discussion.

INSTALLATION AND USAGE

Since version 1.8.0, Lingo is available as a RubyGem. So a simple gem install lingo will install Lingo and its dependencies. You might want to run that command with administrator privileges, depending on your environment. Then you can call the lingo executable to process your text files. See lingo --help for available options.

Please note that Lingo requires Ruby version 2.1 or higher to run (2.6 is the currently recommended version).

Since Lingo depends on native extensions, you need to make sure that development files for your Ruby version are installed. On Debian-based Linux platforms, they are included in the package ruby-dev; other distributions may have a similarly named package. On Windows, those development files are currently not required.

On JRuby, install gdbm for efficient database operations: gem install gdbm.

Dictionary and configuration file lookup

Lingo will search different locations to find dictionaries and configuration files. By default, these are the current working directory, your personal Lingo directory (~/.lingo) and the installation directory (in that order). You can control this lookup path by either moving files up the chain (using the lingoctl executable) or by setting various environment variables.

With lingoctl you can copy dictionaries and configuration files from your personal Lingo directory or the installation directory to the current working directory so you can modify them and they will take precedence over the original ones. See lingoctl --help for usage information.

In order to change the search path itself, you can define the LINGO_PATH environment variable as a whole or its individual parts LINGO_CURR (the local Lingo directory), LINGO_HOME (your personal Lingo directory), and LINGO_BASE (the system-wide Lingo directory).

Inside of any of these directories, dictionaries and configuration files are typically organized in the following directory structure:

config/: Configuration files (*.cfg).
dict/: Dictionary source files (*.txt) in language-specific subdirectories (de/, en/, …).
lang/: Language definition files (*.lang).
store/: Compiled dictionaries, generated from source files.

But for compatibility reasons these naming conventions are not enforced.

FILE FORMATS

Lingo uses three different types of files to determine its behaviour: configuration files control the details of the indexing process; language definitions specify grammar rules and dictionaries available for indexing; dictionaries, finally, hold the vocabulary used in indexing the input text and producing the results.

Configuration

Configuration files are defined in the YAML syntax. They specify the attendees to call in order and the options to provide them with. The first attendee in any indexing process is the text_reader, who reads the input text and passes it on to the other attendees. Every attendee transforms or extends the input stream and automatically sends everything down to the next attendee. This process may be customized by explicitly specifying the input and/or output channels of individual attendees with the in and out options.

Example:

# input is taken from the previous attendee,
# output is sent to the named channel "syn"
- synonymer:     { skip: '?,t', source: sys-syn, out: syn }

Language definition

Language definitions, like configuration files, are defined in the YAML syntax. They specify the dictionaries to be used as well as the grammar rules according to which the input shall be processed. These settings do not necessarily have to coincide with an existing language, they are application-specific.

Dictionaries

Dictionaries come in different varieties and encode the knowledge about the vocabulary used for indexing and analysis.

Supported dictionary formats:

SingleWord: One word (projection) per line. E.g. open source. (see Lingo::Database::Source::SingleWord)
MultiValue: Multiple words per line (separated with a unique symbol), all of which are interpreted as belonging to a single equivalence class. E.g. fax;telefax;facsimile. (see Lingo::Database::Source::MultiValue)
MultiKey: Similar to MultiValue, except that the first word will be treated as the preferred term (descriptor). E.g. fax;telefax;facsimile. (see Lingo::Database::Source::MultiKey)
KeyValue: One word and its associated projection per line, separated with a unique symbol. E.g. abfrage*query. (see Lingo::Database::Source::KeyValue)
WordClass: Similar to KeyValue, except that the projection may consist of multiple lexicalizations, each with its own word class and (optional) gender information. E.g. abort,abort #s|v, which is equivalent to abort,abort #s abort #v. (see Lingo::Database::Source::WordClass)

Encoding word classes and gender information

TODO…

Lexicalizing multiword expressions

TODO…

Lexicalizing compounds

TODO…

ISSUES AND CONTRIBUTIONS

If you find bugs or want to suggest new features, please report them on GitHub. Include your Ruby version (ruby --version) and the version of Lingo you are using (lingo --version).

If you want to contribute to Lingo, please fork the project on GitHub and submit a pull request (bonus points for topic branches).

To make sure that Lingo’s tests pass, install hen (typically gem install hen) and all development dependencies (either with gem install --development lingo or manually; see rake gem:dependencies). Then run rake test for the basic tests or rake test:all for the full test suite.

LITERATURE

Background and theoretical foundations

Gödert, W.; Lepsky, K.; Nagelschmidt, M.: und Automatisches Indexieren: ein Lehr- und Arbeitsbuch[http://dx.doi.org/10.1007/978-3-642-23513-9]. (German) Berlin etc.: Springer, 2012.
Lepsky, K.; Vorhauer, J.: – ein open source System für die automatische Indexierung deutschsprachiger Dokumente[http://dx.doi.org/10.1515/ABITECH.2006.26.1.18]. (German) In: ABI Technik 26 (1), 2006. pp 18-29.
Nohr, H.: der automatischen Indexierung: ein Lehrbuch[http://logos-verlag.de/cgi-bin/buch/isbn/0121]. (German) Berlin: Logos, 2005.
Hausser, R.: der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache[http://zbmath.org/?q=an:0956.68141]. (German) Berlin etc.: Springer, 2000.
Allen, J.: language understanding[http://zbmath.org/?q=an:0851.68106]. (English) Redwood City, CA: Benjamin/Cummings, 1995.
Grishman, R.: linguistics: an introduction[http://cambridge.org/9780521310383]. (English) Cambridge: Cambridge Univ. Press, 1986.
Salton, G.; McGill, M.: to modern information retrieval[http://zbmath.org/?q=an:0523.68084]. (English) New York etc.: McGraw-Hill, 1983.
Porter, M.: algorithm for suffix stripping[http://tartarus.org/~martin/PorterStemmer/]. (English) In: Program 14 (3), 1980. pp 130-137.

Research publications

Grün, S.: Mehrwortbegriffe und Latent Semantic Analysis: Bewertung automatisch extrahierter Mehrwortgruppen mit LSA. (German) Düsseldorf: Heinrich-Heine-Universität Düsseldorf, 2017.
Siebenkäs, A.; Markscheffel, B.: of a workflow for the semi-automatic construction of a thesaurus for the German printing industry[https://zenodo.org/record/17945]. (English) In: Re:inventing Information Science in the Networked Society. Proceedings of the 14th International Symposium on Information Science (ISI 2015), Zadar, Croatia, 19th-21st May 2015. Eds.: F. Pehar, C. Schlögl, C. Wolff. Glückstadt: Verlag Werner Hülsbusch, 2015. pp 217-229
Grün, S.: Bildung von Komposita-Indextermen auf der Basis einer algorithmischen Mehrwortgruppenanalyse mit Lingo. (German) Köln: Fachhochschule Köln, 2015.
Bredack, J.; Lepsky, K.: Extraktion von Fachterminologie aus Volltexten[http://dx.doi.org/10.1515/abitech-2014-0002]. (German) In: ABI Technik 34 (1), 2014. pp 2-12.
Bredack, J.: von Mehrwortgruppen in kunsthistorischen Fachtexten[http://ixtrieve.fh-koeln.de/lehre/bredack-2013.pdf]. (German) Köln: Fachhochschule Köln, 2013.
Maylein, L.; Langenstein, A.: vom Relevanz-Ranking im HEIDI-Katalog der Universitätsbibliothek Heidelberg[http://b-i-t-online.de/heft/2013-03-fachbeitrag-maylein.pdf]. (German) In: b.i.t.online 16 (3), 2013. pp 190-200.
Gödert, W.: multiword phrases in mathematical text corpora[http://arxiv.org/abs/1210.0852]. (English) arXiv:1210.0852 [cs.CL], 2012.
Jersek, T.: DDC-Klassifizierung mit Lingo: Vorgehensweise und Ergebnisse[http://www.citeulike.org/user/klaus-lepsky/article/12476139]. (German) Köln: Fachhochschule Köln, 2012.
Glaesener, L.: Indexieren einer informationswissenschaftlichen Datenbank mit Mehrwortgruppen[http://www.citeulike.org/user/klaus-lepsky/article/12476133]. (German) Köln: Fachhochschule Köln, 2012.
Schiffer, R.: Indexieren technischer Kongressschriften[http://ixtrieve.fh-koeln.de/lehre/schiffer-2007.pdf]. (German) Köln: Fachhochschule Köln, 2007.

CREDITS

Lingo is based on a collective development by Klaus Lepsky and John Vorhauer.

Authors

John Vorhauer <[email protected]>
Jens Wille <[email protected]>

Contributors

Klaus Lepsky <[email protected]>
Jan-Helge Jacobs <[email protected]>
Thomas Müller <[email protected]>
Yulia Dorokhova <[email protected]>

LICENSE AND COPYRIGHT

Lingo is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Lingo is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with Lingo. If not, see <www.gnu.org/licenses/>.

Lingo - A full-featured automatic indexing system

VERSION

DESCRIPTION

Introduction

Attendees

Filters

Markup

Inline annotation

Plugins

Server

JSON endpoint

Raw endpoint

Deployment

EXAMPLE

INSTALLATION AND USAGE

Dictionary and configuration file lookup

FILE FORMATS

Configuration

Language definition

Dictionaries

Encoding word classes and gender information

Lexicalizing multiword expressions

Lexicalizing compounds

ISSUES AND CONTRIBUTIONS

LINKS

LITERATURE

Background and theoretical foundations

Research publications

CREDITS

Authors

Contributors

LICENSE AND COPYRIGHT