Rley

Linux Build Status Build status Coverage Status Gem Version Dependency Status Inline docs License

A Ruby library for constructing general parsers for any context-free language.

====

What is Rley?

Rley uses the Earley algorithm which is a general parsing algorithm that can handle any context-free grammar. Earley parsers can literally swallow anything that can be described by a context-free grammar. That's why Earley parsers find their place in so many NLP (Natural Language Processing) libraries/toolkits.

In addition, Rley goes beyond most Earley parser implementations by providing support for ambiguous parses. Indeed, it delivers the results of a parse as a Shared Packed Parse Forest (SPPF). A SPPF is a data structure that allows to encode efficiently all the possible parse trees that result from an ambiguous grammar.

As another distinctive mark, Rley is also the first Ruby implementation of a parsing library based on the new Grammar Flow Graph approach References on GFG.

What it can do?

Maybe parsing algorithms and internal implementation details are of lesser interest to you and the good question to ask is "what Rley can really do?".

In a nutshell:

  • Rley can parse context-free languages that other well-known libraries cannot handle
  • Built-in support for ambiguous grammars that typically occur in NLP

In short, the foundations of Rley are strong enough to be useful in a large application range such as:

  • computer languages,
  • artificial intelligence and
  • Natural Language Processing.

Features

  • Simple API for context-free grammar definition,
  • Allows ambiguous grammars,
  • Generates shared packed parse forests,
  • Accepts left-recursive rules/productions,
  • Provides syntax error detection and reporting.

Compatibility

Rley supports the following Ruby implementations:

  • MRI 2.0
  • MRI 2.1
  • MRI 2.2
  • MRI 2.3
  • JRuby 9.0+

Getting Started

Installation

Installing the latest stable version is simple:

$ gem install rley

A whirlwind tour of Rley

The purpose of this section is show how to create a parser for a minimalistic English language subset. The tour is organized as follows:

  1. Defining the language grammar
  2. Creating a lexicon
  3. Creating a tokenizer
  4. Building the parser
  5. Parsing some input
  6. Generating the parse forest

The complete source code of the example used in this tour can be found in the examples directory

Defining the language grammar

The subset of English grammar is based on an example from the NLTK book.

    require 'rley'  # Load Rley library

    # Instantiate a builder object that will build the grammar for us
    builder = Rley::Syntax::GrammarBuilder.new do
      # Terminal symbols (= word categories in lexicon)
      add_terminals('Noun', 'Proper-Noun', 'Verb')
      add_terminals('Determiner', 'Preposition')

      # Here we define the productions (= grammar rules)
      rule 'S' => %w[NP VP]
      rule 'NP' => 'Proper-Noun'
      rule 'NP' => %w[Determiner Noun]
      rule 'NP' => %w[Determiner Noun PP]
      rule 'VP' => %w[Verb NP]
      rule 'VP' => %w[Verb NP PP]
      rule 'PP' => %w[Preposition NP]
    end
    # And now, let's build the grammar...
    grammar = builder.grammar

Creating a lexicon

    # To simplify things, lexicon is implemented as a Hash with pairs of the form:
    # word => terminal symbol name
    Lexicon = {
      'man' => 'Noun',
      'dog' => 'Noun',
      'cat' => 'Noun',
      'telescope' => 'Noun',
      'park' => 'Noun',  
      'saw' => 'Verb',
      'ate' => 'Verb',
      'walked' => 'Verb',
      'John' => 'Proper-Noun',
      'Mary' => 'Proper-Noun',
      'Bob' => 'Proper-Noun',
      'a' => 'Determiner',
      'an' => 'Determiner',
      'the' => 'Determiner',
      'my' => 'Determiner',
      'in' => 'Preposition',
      'on' => 'Preposition',
      'by' => 'Preposition',
      'with' => 'Preposition'
    }

Creating a tokenizer

    # A tokenizer reads the input string and converts it into a sequence of tokens
    # Highly simplified tokenizer implementation.
    def tokenizer(aText, aGrammar)
      tokens = aText.scan(/\S+/).map do |word|
        term_name = Lexicon[word]
        if term_name.nil?
          raise StandardError, "Word '#{word}' not found in lexicon"
        end
        terminal = aGrammar.name2symbol[term_name]
        Rley::Tokens::Token.new(word, terminal)
      end

      return tokens
    end

More ambitious NLP applications will surely rely on a Part-of-Speech tagger instead of creating a lexicon and tokenizer from scratch. Here are a few Ruby Part-of-Speech gems:

Building the parser

  # Easy with Rley...
  parser = Rley::Parser::GFGEarleyParser.new(grammar)

Parsing some input

    input_to_parse = 'John saw Mary with a telescope'
    # Convert input text into a sequence of token objects...
    tokens = tokenizer(input_to_parse, grammar)
    result = parser.parse(tokens)

    puts "Parsing successful? #{result.success?}" # => Parsing successful? true

Generating the parse forest

    pforest = result.parse_forest

Error reporting

Rley is a non-violent parser, that is, it won't throw an exception when it detects a syntax error. Instead, the parse result will be marked as non-successful. The parse error can then be identified by calling the GFGParsing#failure_reason method. This method returns an error reason object which can help to produce an error message.

Consider the example from the Parsing some input section above and, as an error, we delete the verb saw in the sentence to parse.

    # Verb has been removed from the sentence on next line
    input_to_parse = 'John Mary with a telescope'
    # Convert input text into a sequence of token objects...
    tokens = tokenizer(input_to_parse, grammar)
    result = parser.parse(tokens)

    puts "Parsing successful? #{result.success?}" # => Parsing successful? false
    exit(1)

As expected, the parse is now failing.
To get an error message, one just need to retrieve the error reason and ask it to generate a message.

    # Show error message if parse fails...
    puts result.failure_reason.message unless result.success?

Re-running the example with the error, results in the error message:

  Syntax error at or near token 2 >>>Mary<<<
  Expected one 'Verb', found a 'Proper-Noun' instead.

The standard Rley message not only inform about the location of the mistake, it also provides some hint by disclosing its expectations.

Let's experiment again with the original sentence but without the word telescope.

    # Last word has been removed from the sentence on next line
    input_to_parse = 'John saw Mary with a '
    # Convert input text into a sequence of token objects...
    tokens = tokenizer(input_to_parse, grammar)
    result = parser.parse(tokens)

    puts "Parsing successful? #{result.success?}" # => Parsing successful? false
    unless result.success?
      puts result.failure_reason.message
      exit(1)
    end

This time, the following output is displayed:

  Parsing successful? false
  Premature end of input after 'a' at position 5
  Expected one 'Noun'.

Again, the resulting error message is user-friendly.
Remark: currently, Rley reports an error position as the index of the
input token with which the error was detected.

Examples

The project source directory contains several example scripts that demonstrate how grammars are to be constructed and used.

Other similar Ruby projects

Rley isn't the sole implementation of the Earley parser algorithm in Ruby.
Here are a few other ones:

  • Kanocc gem -- Advertised as a Ruby based parsing and translation framework.
    Although the gem dates from 2009, the author still maintains its in a public repository in Github
    The grammar symbols (tokens and non-terminals) must be represented as (sub)classes. Grammar rules are methods of the non-terminal classes. A rule can have a block code argument that specifies the semantic action when that rule is applied.
  • lc1 project -- Advertised as a combination of Earley and Viterbi algorithms for [Probabilistic] Context Free Grammars
    Aimed in parsing brazilian portuguese.
    earley project -- An Earley parser (grammar rules are specified in JSON format).
    The code doesn't seem to be maintained: latest commit dates from Nov. 2011.
  • linguist project -- Advertised as a library for parsing context-free languages.
    It is a recognizer not a parser. In other words it can only tell whether a given input conforms to the grammar rules or not. As such it cannot build parse trees.
    The code doesn't seem to be maintained: latest commit dates from Oct. 2011.

Thanks to:

  • Professor Keshav Pingali, one of the creators of the Grammar Flow Graph parsing approach for his encouraging e-mail exchanges.

References on GFG

Since the Grammar Flow Graph parsing approach is quite new, it has not yet taken a place in standard parser textbooks. Here are a few references (and links) of papers on GFG:

Copyright (c) 2014-2017, Dimitri Geshef.
Rley is released under the MIT License see LICENSE.txt for details.