Class: Slaw::Parse::Builder

Inherits:

Object

Object
Slaw::Parse::Builder

show all

Includes:: Logging, Namespace

Defined in:: lib/slaw/parse/builder.rb

Overview

The primary class for building Akoma Ntoso documents from plain text documents.

The builder uses a grammar to break down a plain-text version of an act into a syntax tree. This tree can then be serialized into an Akoma Ntoso compatible XML document.

Examples:

Parse some text into a well-formed document

builder = Slaw::Builder.new(parser: parser)
xml = builder.parse_text(text)
doc = builder.parse_xml(xml)
builder.postprocess(doc)

A quicker way to build a well-formed document

doc = builder.parse_and_process_text(text)

Constant Summary collapse

@@parsers =

{}

Constants included from Namespace

Namespace::NS

Instance Attribute Summary collapse

#fragment_id_prefix ⇒ Object

Prefix to use when generating IDs for fragments.
#parse_options ⇒ Object

Additional hash of options to be provided to the parser when parsing.

Instance Method Summary collapse

#add_terms_to_references(doc, terms) ⇒ Object
#find_definitions(doc) ⇒ Hash{String, String}

Find ‘def` elements in the document and return a Hash from term ids to the text of each term.
#find_short_title(doc) ⇒ Object

Find the short title and add it as an FRBRalias element in the meta section.
#find_term_references(doc, terms) ⇒ Object

Find and decorate references to terms in the document.
#guess_at_definitions(doc) ⇒ Object
#initialize(opts = {}) ⇒ Builder constructor

Create a new builder.
#link_definitions(doc) ⇒ Object

Find definitions of terms and introduce them into the meta section of the document.
#nest_blocklists(doc) ⇒ Object

Correctly nest blocklists.
#normalise_headings(doc) ⇒ Object

Change CAPCASE headings into Sentence case.
#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document

Do all the work necessary to parse text into a well-formed XML document.
#parse_text(text, parse_options = {}) ⇒ String

Parse text into XML.
#parse_xml(xml) ⇒ Nokogiri::XML::Document

Parse a string into a Nokogiri::XML::Document.
#postprocess(doc) ⇒ Nokogiri::XML::Document

Postprocess an XML document.
#renumber_terms(doc) ⇒ Object

recalculate ids for <term> elements.
#text_to_syntax_tree(text, parse_options = {}) ⇒ Object

Parse plain text into a syntax tree.
#to_xml(doc) ⇒ String

Serialise a Nokogiri::XML::Document into a string.
#xml_from_syntax_tree(tree) ⇒ String

Generate an XML document from the given syntax tree.

Methods included from Logging

#logger

Constructor Details

#initialize(opts = {}) ⇒ `Builder`

Create a new builder.

Specify either ‘:parser` or `:grammar_file` and `:grammar_class`.

Parameters:

opts (Hash) (defaults to: {}) —

a customizable set of options

Options Hash (opts):

:parser (Treetop::Runtime::CompiledParser) —

parser to use
:grammar_file (String) —

grammar filename to load a parser from
:grammar_class (String) —

name of the class that the grammar will generate

# File 'lib/slaw/parse/builder.rb', line 42

def initialize(opts={})
  if opts[:parser]
    @parser = opts[:parser]
  elsif opts[:grammar_file] and opts[:grammar_class]
    if @@parsers[opts[:grammar_class]]
      # already compiled the grammar, just use it
      @parser = @@parsers[opts[:grammar_class]]
    else
      # load the grammar
      Treetop.load(opts[:grammar_file])
      cls = eval(opts[:grammar_class])
      @parser = cls.new
    end
  else
    raise ArgumentError.new("Specify either :parser or :grammar_file and :grammar_class")
  end

  @parse_options = {}
end

Instance Attribute Details

#fragment_id_prefix ⇒ `Object`

Prefix to use when generating IDs for fragments



33
34
35

# File 'lib/slaw/parse/builder.rb', line 33

def fragment_id_prefix
  @fragment_id_prefix
end

#parse_options ⇒ `Object`

Additional hash of options to be provided to the parser when parsing.



30
31
32

# File 'lib/slaw/parse/builder.rb', line 30

def parse_options
  @parse_options
end

Instance Method Details

#add_terms_to_references(doc, terms) ⇒ `Object`

# File 'lib/slaw/parse/builder.rb', line 275

def add_terms_to_references(doc, terms)
  refs = doc.at_xpath('//a:meta/a:references', a: NS)
  unless refs
    refs = doc.create_element('references', source: "#this")
    doc.at_xpath('//a:meta/a:identification', a: NS).after(refs)
  end

  # nuke all existing term reference elements
  refs.xpath('a:TLCTerm', a: NS).each { |el| el.remove }

  for id, term in terms
    # <TLCTerm id="term-applicant" href="/ontology/term/this.eng.applicant" showAs="Applicant"/>
    refs << doc.create_element('TLCTerm',
                               id: id,
                               href: "/ontology/term/this.eng.#{id.gsub(/^term-/, '')}",
                               showAs: term)
  end
end

#find_definitions(doc) ⇒ `Hash{String, String}`

Find ‘def` elements in the document and return a Hash from term ids to the text of each term

Parameters:

doc (Nokogiri::XML::Document)

Returns:

(Hash{String, String})

# File 'lib/slaw/parse/builder.rb', line 222

def find_definitions(doc)
  guess_at_definitions(doc)

  terms = {}
  doc.xpath('//a:def', a: NS).each do |defn|
    # <p>"<def refersTo="#term-affected_land">affected land</def>" means land in respect of which an application has been lodged in terms of section 17(1);</p>
    id = defn['refersTo'].sub(/^#/, '')
    term = defn.content
    terms[id] = term

    logger.info("+ Found definition for: #{term}")
  end

  terms
end

#find_short_title(doc) ⇒ `Object`

Find the short title and add it as an FRBRalias element in the meta section

Parameters:

doc (Nokogiri::XML::Document)

# File 'lib/slaw/parse/builder.rb', line 182

def find_short_title(doc)
  logger.info("Finding short title")

  # Short title and commencement 
  # 8. This Act shall be called the Legal Aid Amendment Act, 1996, and shall come 
  # into operation on a date fixed by the President by proclamation in the Gazette. 

  doc.xpath('//a:body//a:heading[contains(text(), "hort title")]', a: NS).each do |heading|
    section = heading.parent.at_xpath('a:subsection', a: NS)
    if section and section.text =~ /this act (is|shall be called) the (([a-zA-Z\(\)]\s*)+, \d\d\d\d)/i
      short_title = $2

      logger.info("+ Found title: #{short_title}")

      node = doc.at_xpath('//a:meta//a:FRBRalias', a: NS)
      node['value'] = short_title
      break
    end
  end
end

#find_term_references(doc, terms) ⇒ `Object`

Find and decorate references to terms in the document. The terms param is a hash from term_id to actual term.

# File 'lib/slaw/parse/builder.rb', line 296

def find_term_references(doc, terms)
  logger.info("+ Finding references to terms")

  i = 0

  # sort terms by the length of the defined term, desc,
  # so that we don't find short terms inside longer
  # terms
  terms = terms.to_a.sort_by { |pair| -pair[1].size }

  # look for each term
  for term_id, term in terms
    doc.xpath('//a:body//text()', a: NS).each do |text|
      # replace all occurrences in this text node

      # unless we're already inside a def or term element
      next if (["def", "term"].include?(text.parent.name))

      # don't link to a term inside its own definition
      owner = find_up(text, 'subsection')
      next if owner and owner.at_xpath(".//a:def[@refersTo='##{term_id}']", a: NS)

      while posn = (text.content =~ /\b#{Regexp::escape(term)}\b/)
        # <p>A delegation under subsection (1) shall not prevent the <term refersTo="#term-Minister" id="trm357">Minister</term> from exercising the power himself or herself.</p>
        node = doc.create_element('term', term, refersTo: "##{term_id}", id: "trm#{i}")

        pre = (posn > 0) ? text.content[0..posn-1] : nil
        post = text.content[posn+term.length..-1]

        text.before(node)
        node.before(doc.create_text_node(pre)) if pre
        text.content = post

        i += 1
      end
    end
  end
end

#guess_at_definitions(doc) ⇒ `Object`

# File 'lib/slaw/parse/builder.rb', line 238

def guess_at_definitions(doc)
  doc.xpath('//a:section', a: NS).select do |section|
    # sections with headings like Definitions
    heading = section.at_xpath('a:heading', a: NS)
    heading && heading.content =~ /definitions|interpretation/i
  end.each do |section|
    # find items like "foo" means blah...
    
    section.xpath('.//a:p|.//a:listIntroduction', a: NS).each do |container|
      # only if we don't already have a definition here
      next if container.at_xpath('a:def', a: NS)

      # get first text node
      text = container.children.first
      next if (not text or not text.text?)

      match = /^\s*["“”](.+?)["“”]/.match(text.text)
      if match
        term = match.captures[0]
        term_id = 'term-' + term.gsub(/[^a-zA-Z0-9_-]/, '_')

        # <p>"<def refersTo="#term-affected_land">affected land</def>" means land in respect of which an application has been lodged in terms of section 17(1);</p>
        defn = doc.create_element('def', term, refersTo: "##{term_id}")
        rest = match.post_match

        text.before(defn)
        defn.before(doc.create_text_node('"'))
        text.content = '"' + rest

        # adjust the container's id
        parent = find_up(container, ['blockList', 'point']) || find_up(container, ['subsection', 'section'])
        parent['id'] = "def-#{term_id}"
      end
    end
  end
end

#link_definitions(doc) ⇒ `Object`

Find definitions of terms and introduce them into the meta section of the document.

Parameters:

doc (Nokogiri::XML::Document)

# File 'lib/slaw/parse/builder.rb', line 207

def link_definitions(doc)
  logger.info("Finding and linking definitions")

  terms = find_definitions(doc)
  add_terms_to_references(doc, terms)
  find_term_references(doc, terms)
  renumber_terms(doc)
end

#nest_blocklists(doc) ⇒ `Object`

Correctly nest blocklists.

The grammar gives us flat blocklists, we need to introspect the numbering of the lists to correctly nest them.

Parameters:

doc (Nokogiri::XML::Document)

# File 'lib/slaw/parse/builder.rb', line 350

def nest_blocklists(doc)
  logger.info("Nesting blocklists")

  Slaw::Parse::Blocklists.nest_blocklists(doc)
end

#normalise_headings(doc) ⇒ `Object`

Change CAPCASE headings into Sentence case.

Parameters:

doc (Nokogiri::XML::Document)

# File 'lib/slaw/parse/builder.rb', line 166

def normalise_headings(doc)
  logger.info("Normalising headings")

  nodes = doc.xpath('//a:body//a:heading/text()', a: NS) +
          doc.xpath('//a:component/a:doc[@name="schedules"]//a:heading/text()', a: NS)

  nodes.each do |heading|
    if !(heading.content =~ /[a-z]/)
      heading.content = heading.content.downcase.gsub(/^\w/) { $&.upcase }
    end
  end
end

#parse_and_process_text(text, parse_options = {}) ⇒ `Nokogiri::XML::Document`

Do all the work necessary to parse text into a well-formed XML document.

Parameters:

text (String) —

the text to parse
parse_options (Hash) (defaults to: {}) —

options to parse to the parser

Returns:

(Nokogiri::XML::Document) —

a well formed document



68
69
70

# File 'lib/slaw/parse/builder.rb', line 68

def parse_and_process_text(text, parse_options={})
  postprocess(parse_xml(parse_text(text, parse_options)))
end

#parse_text(text, parse_options = {}) ⇒ `String`

Parse text into XML. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

text (String) —

the text to parse
parse_options (Hash) (defaults to: {}) —

options to pass to the parser

Returns:

(String) —

an XML string

# File 'lib/slaw/parse/builder.rb', line 79

def parse_text(text, parse_options={})
  tree = text_to_syntax_tree(text, parse_options)
  xml_from_syntax_tree(tree)
end

#parse_xml(xml) ⇒ `Nokogiri::XML::Document`

Parse a string into a Nokogiri::XML::Document

Parameters:

xml (String) —

string to parse

Returns:

(Nokogiri::XML::Document)



137
138
139

# File 'lib/slaw/parse/builder.rb', line 137

def parse_xml(xml)
  Nokogiri::XML(xml, &:noblanks)
end

#postprocess(doc) ⇒ `Nokogiri::XML::Document`

Postprocess an XML document.

Parameters:

doc (Nokogiri::XML::Document)

Returns:

(Nokogiri::XML::Document) —

the updated document

# File 'lib/slaw/parse/builder.rb', line 155

def postprocess(doc)
  normalise_headings(doc)
  find_short_title(doc)
  nest_blocklists(doc)

  doc
end

#renumber_terms(doc) ⇒ `Object`

recalculate ids for <term> elements

# File 'lib/slaw/parse/builder.rb', line 336

def renumber_terms(doc)
  logger.info("Renumbering terms")

  doc.xpath('//a:term', a: NS).each_with_index do |term, i|
    term['id'] = "trm#{i}"
  end
end

#text_to_syntax_tree(text, parse_options = {}) ⇒ `Object`

Parse plain text into a syntax tree.

Parameters:

text (String) —

the text to parse
parse_options (Hash) (defaults to: {}) —

options to pass to the parser

Returns:

(Object) —

the root of the resulting parse tree, usually a Treetop::Runtime::SyntaxNode object

# File 'lib/slaw/parse/builder.rb', line 90

def text_to_syntax_tree(text, parse_options={})
  logger.info("Parsing...")
  parse_options = @parse_options.dup.update(parse_options)
  tree = @parser.parse(text, parse_options)
  logger.info("Parsed!")

  if tree.nil?
    raise Slaw::Parse::ParseError.new(@parser.failure_reason || "Couldn't match to grammar",
                                      line: @parser.failure_line || 0,
                                      column: @parser.failure_column || 0)
  end

  tree
end

#to_xml(doc) ⇒ `String`

Serialise a Nokogiri::XML::Document into a string

Parameters:

doc (Nokogiri::XML::Document) —

document

Returns:

(String) —

pretty printed string



146
147
148

# File 'lib/slaw/parse/builder.rb', line 146

def to_xml(doc)
  doc.to_xml(indent: 2)
end

#xml_from_syntax_tree(tree) ⇒ `String`

Generate an XML document from the given syntax tree. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

tree (Object) —

a Treetop::Runtime::SyntaxNode object

Returns:

(String) —

an XML string

# File 'lib/slaw/parse/builder.rb', line 111

def xml_from_syntax_tree(tree)
  s = ""
  builder = ::Builder::XmlMarkup.new(indent: 2, target: s)

  builder.instruct! :xml, :version=>"1.0", :encoding=>"UTF-8"
  builder.akomaNtoso("xmlns:xsi"=> "http://www.w3.org/2001/XMLSchema-instance", 
                     "xsi:schemaLocation" => "http://www.akomantoso.org/2.0 akomantoso20.xsd",
                     "xmlns" => NS) { |b|
    args = [b]

    # should we provide an id prefix?
    arity = tree.method('to_xml').arity 
    arity = arity.abs-1 if arity < 0
    args << (fragment_id_prefix || "") if arity > 1

    tree.to_xml(*args)
  }

  s
end

Class: Slaw::Parse::Builder

Overview

Examples:

Parse some text into a well-formed document

A quicker way to build a well-formed document

Constant Summary collapse

Constants included from Namespace

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Logging

Constructor Details

#initialize(opts = {}) ⇒ Builder

Instance Attribute Details

#fragment_id_prefix ⇒ Object

#parse_options ⇒ Object

Instance Method Details

#add_terms_to_references(doc, terms) ⇒ Object

#find_definitions(doc) ⇒ Hash{String, String}

#find_short_title(doc) ⇒ Object

#find_term_references(doc, terms) ⇒ Object

#guess_at_definitions(doc) ⇒ Object

#link_definitions(doc) ⇒ Object

#nest_blocklists(doc) ⇒ Object

#normalise_headings(doc) ⇒ Object

#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document

#parse_text(text, parse_options = {}) ⇒ String

#parse_xml(xml) ⇒ Nokogiri::XML::Document

#postprocess(doc) ⇒ Nokogiri::XML::Document

#renumber_terms(doc) ⇒ Object

#text_to_syntax_tree(text, parse_options = {}) ⇒ Object

#to_xml(doc) ⇒ String

#xml_from_syntax_tree(tree) ⇒ String

#initialize(opts = {}) ⇒ `Builder`

#fragment_id_prefix ⇒ `Object`

#parse_options ⇒ `Object`

#add_terms_to_references(doc, terms) ⇒ `Object`

#find_definitions(doc) ⇒ `Hash{String, String}`

#find_short_title(doc) ⇒ `Object`

#find_term_references(doc, terms) ⇒ `Object`

#guess_at_definitions(doc) ⇒ `Object`

#link_definitions(doc) ⇒ `Object`

#nest_blocklists(doc) ⇒ `Object`

#normalise_headings(doc) ⇒ `Object`

#parse_and_process_text(text, parse_options = {}) ⇒ `Nokogiri::XML::Document`

#parse_text(text, parse_options = {}) ⇒ `String`

#parse_xml(xml) ⇒ `Nokogiri::XML::Document`

#postprocess(doc) ⇒ `Nokogiri::XML::Document`

#renumber_terms(doc) ⇒ `Object`

#text_to_syntax_tree(text, parse_options = {}) ⇒ `Object`

#to_xml(doc) ⇒ `String`

#xml_from_syntax_tree(tree) ⇒ `String`