Class: Slaw::Parse::Builder

Inherits:
Object
  • Object
show all
Includes:
Logging, Namespace
Defined in:
lib/slaw/parse/builder.rb

Overview

The primary class for building Akoma Ntoso documents from plain text documents.

The builder uses a grammar to break down a plain-text version of an act into a syntax tree. This tree can then be serialized into an Akoma Ntoso compatible XML document.

Examples:

Parse some text into a well-formed document

builder = Slaw::Builder.new(parser: parser)
xml = builder.parse_text(text)
doc = builder.parse_xml(xml)
builder.postprocess(doc)

A quicker way to build a well-formed document

doc = builder.parse_and_process_text(text)

Constant Summary collapse

@@parsers =
{}

Constants included from Namespace

Namespace::NS

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Logging

#logger

Constructor Details

#initialize(opts = {}) ⇒ Builder

Create a new builder.

Specify either ‘:parser` or `:grammar_file` and `:grammar_class`.

Parameters:

  • opts (Hash) (defaults to: {})

    a customizable set of options

Options Hash (opts):

  • :parser (Treetop::Runtime::CompiledParser)

    parser to use

  • :grammar_file (String)

    grammar filename to load a parser from

  • :grammar_class (String)

    name of the class that the grammar will generate



41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# File 'lib/slaw/parse/builder.rb', line 41

def initialize(opts={})
  if opts[:parser]
    @parser = opts[:parser]
  elsif opts[:grammar_file] and opts[:grammar_class]
    if @@parsers[opts[:grammar_class]]
      # already compiled the grammar, just use it
      @parser = @@parsers[opts[:grammar_class]]
    else
      # load the grammar
      Treetop.load(opts[:grammar_file])
      cls = eval(opts[:grammar_class])
      @parser = cls.new
    end
  else
    raise ArgumentError.new("Specify either :parser or :grammar_file and :grammar_class")
  end

  @parse_options = {}
end

Instance Attribute Details

#fragment_id_prefixObject

Prefix to use when generating IDs for fragments



32
33
34
# File 'lib/slaw/parse/builder.rb', line 32

def fragment_id_prefix
  @fragment_id_prefix
end

#parse_optionsObject

Additional hash of options to be provided to the parser when parsing.



29
30
31
# File 'lib/slaw/parse/builder.rb', line 29

def parse_options
  @parse_options
end

Instance Method Details

#add_terms_to_references(doc, terms) ⇒ Object



287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
# File 'lib/slaw/parse/builder.rb', line 287

def add_terms_to_references(doc, terms)
  refs = doc.at_xpath('//a:meta/a:references', a: NS)
  unless refs
    refs = doc.create_element('references', source: "#this")
    doc.at_xpath('//a:meta/a:identification', a: NS).after(refs)
  end

  # nuke all existing term reference elements
  refs.xpath('a:TLCTerm', a: NS).each { |el| el.remove }

  for id, term in terms
    # <TLCTerm id="term-applicant" href="/ontology/term/this.eng.applicant" showAs="Applicant"/>
    refs << doc.create_element('TLCTerm',
                               id: id,
                               href: "/ontology/term/this.eng.#{id.gsub(/^term-/, '')}",
                               showAs: term)
  end
end

#adjust_blocklists(doc) ⇒ Object

Adjust blocklists:

  • nest them correctly

  • change preceding p tags into listIntroductions

Parameters:

  • doc (Nokogiri::XML::Document)


362
363
364
365
366
367
# File 'lib/slaw/parse/builder.rb', line 362

def adjust_blocklists(doc)
  logger.info("Adjusting blocklists")

  Slaw::Parse::Blocklists.nest_blocklists(doc)
  Slaw::Parse::Blocklists.fix_intros(doc)
end

#find_definitions(doc) ⇒ Hash{String, String}

Find ‘def` elements in the document and return a Hash from term ids to the text of each term

Parameters:

  • doc (Nokogiri::XML::Document)

Returns:

  • (Hash{String, String})


219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# File 'lib/slaw/parse/builder.rb', line 219

def find_definitions(doc)
  guess_at_definitions(doc)

  terms = {}
  doc.xpath('//a:def', a: NS).each do |defn|
    # <p>"<def refersTo="#term-affected_land">affected land</def>" means land in respect of which an application has been lodged in terms of section 17(1);</p>
    if defn['refersTo']
      id = defn['refersTo'].sub(/^#/, '')
      term = defn.content
      terms[id] = term

      logger.info("+ Found definition for: #{term}")
    end
  end

  terms
end

#find_short_title(doc) ⇒ Object

Find the short title and add it as an FRBRalias element in the meta section

Parameters:

  • doc (Nokogiri::XML::Document)


179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# File 'lib/slaw/parse/builder.rb', line 179

def find_short_title(doc)
  logger.info("Finding short title")

  # Short title and commencement 
  # 8. This Act shall be called the Legal Aid Amendment Act, 1996, and shall come 
  # into operation on a date fixed by the President by proclamation in the Gazette. 

  doc.xpath('//a:body//a:heading[contains(text(), "hort title")]', a: NS).each do |heading|
    section = heading.parent.at_xpath('a:subsection', a: NS)
    if section and section.text =~ /this act (is|shall be called) the (([a-zA-Z\(\)]\s*)+, \d\d\d\d)/i
      short_title = $2

      logger.info("+ Found title: #{short_title}")

      node = doc.at_xpath('//a:meta//a:FRBRalias', a: NS)
      node['value'] = short_title
      break
    end
  end
end

#find_term_references(doc, terms) ⇒ Object

Find and decorate references to terms in the document. The terms param is a hash from term_id to actual term.



308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
# File 'lib/slaw/parse/builder.rb', line 308

def find_term_references(doc, terms)
  logger.info("+ Finding references to terms")

  i = 0

  # sort terms by the length of the defined term, desc,
  # so that we don't find short terms inside longer
  # terms
  terms = terms.to_a.sort_by { |pair| -pair[1].size }

  # look for each term
  for term_id, term in terms
    doc.xpath('//a:body//text()', a: NS).each do |text|
      # replace all occurrences in this text node

      # unless we're already inside a def or term element
      next if (["def", "term"].include?(text.parent.name))

      # don't link to a term inside its own definition
      owner = find_up(text, 'subsection')
      next if owner and owner.at_xpath(".//a:def[@refersTo='##{term_id}']", a: NS)

      while posn = (text.content =~ /\b#{Regexp::escape(term)}\b/)
        # <p>A delegation under subsection (1) shall not prevent the <term refersTo="#term-Minister" id="trm357">Minister</term> from exercising the power himself or herself.</p>
        node = doc.create_element('term', term, refersTo: "##{term_id}", id: "trm#{i}")

        pre = (posn > 0) ? text.content[0..posn-1] : nil
        post = text.content[posn+term.length..-1]

        text.before(node)
        node.before(doc.create_text_node(pre)) if pre
        text.content = post

        i += 1
      end
    end
  end
end

#guess_at_definitions(doc) ⇒ Object

Find defined terms in the document.

This looks for heading elements with the words ‘definitions’ or ‘interpretation’, and then looks for phrases like

"this word" means something...

It identifies “this word” as a defined term and wraps it in a def tag with a refersTo attribute referencing the term being defined. The surrounding block structure is also has its refersTo attribute set to the term. This way, the term is both marked as defined, and the container element with the full definition of the term is identified.



249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# File 'lib/slaw/parse/builder.rb', line 249

def guess_at_definitions(doc)
  doc.xpath('//a:section', a: NS).select do |section|
    # sections with headings like Definitions
    heading = section.at_xpath('a:heading', a: NS)
    heading && heading.content =~ /definitions|interpretation/i
  end.each do |section|
    # find items like "foo" means blah...
    
    section.xpath('.//a:p|.//a:listIntroduction', a: NS).each do |container|
      # only if we don't already have a definition here
      next if container.at_xpath('a:def', a: NS)

      # get first text node
      text = container.children.first
      next if (not text or not text.text?)

      match = /^\s*["“”](.+?)["“”]/.match(text.text)
      if match
        term = match.captures[0]
        term_id = 'term-' + term.gsub(/[^a-zA-Z0-9_-]/, '_')

        # <p>"<def refersTo="#term-affected_land">affected land</def>" means land in respect of which an application has been lodged in terms of section 17(1);</p>
        refersTo = "##{term_id}"
        defn = doc.create_element('def', term, refersTo: refersTo)
        rest = match.post_match

        text.before(defn)
        defn.before(doc.create_text_node('"'))
        text.content = '"' + rest

        # adjust the container's refersTo attribute
        parent = find_up(container, ['item', 'point', 'blockList', 'list', 'paragraph', 'subsection', 'section', 'chapter', 'part'])
        parent['refersTo'] = refersTo
      end
    end
  end
end

Find definitions of terms and introduce them into the meta section of the document.

Parameters:

  • doc (Nokogiri::XML::Document)


204
205
206
207
208
209
210
211
# File 'lib/slaw/parse/builder.rb', line 204

def link_definitions(doc)
  logger.info("Finding and linking definitions")

  terms = find_definitions(doc)
  add_terms_to_references(doc, terms)
  find_term_references(doc, terms)
  renumber_terms(doc)
end

#normalise_headings(doc) ⇒ Object

Change CAPCASE headings into Sentence case.

Parameters:

  • doc (Nokogiri::XML::Document)


163
164
165
166
167
168
169
170
171
172
173
174
# File 'lib/slaw/parse/builder.rb', line 163

def normalise_headings(doc)
  logger.info("Normalising headings")

  nodes = doc.xpath('//a:body//a:heading/text()', a: NS) +
          doc.xpath('//a:component/a:doc[@name="schedules"]//a:heading/text()', a: NS)

  nodes.each do |heading|
    if !(heading.content =~ /[a-z]/)
      heading.content = heading.content.downcase.gsub(/^\w/) { $&.upcase }
    end
  end
end

#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document

Do all the work necessary to parse text into a well-formed XML document.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to parse to the parser

Returns:

  • (Nokogiri::XML::Document)

    a well formed document



67
68
69
# File 'lib/slaw/parse/builder.rb', line 67

def parse_and_process_text(text, parse_options={})
  postprocess(parse_xml(parse_text(text, parse_options)))
end

#parse_text(text, parse_options = {}) ⇒ String

Parse text into XML. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to pass to the parser

Returns:

  • (String)

    an XML string



78
79
80
81
# File 'lib/slaw/parse/builder.rb', line 78

def parse_text(text, parse_options={})
  tree = text_to_syntax_tree(text, parse_options)
  xml_from_syntax_tree(tree)
end

#parse_xml(xml) ⇒ Nokogiri::XML::Document

Parse a string into a Nokogiri::XML::Document

Parameters:

  • xml (String)

    string to parse

Returns:

  • (Nokogiri::XML::Document)


134
135
136
# File 'lib/slaw/parse/builder.rb', line 134

def parse_xml(xml)
  Nokogiri::XML(xml, &:noblanks)
end

#postprocess(doc) ⇒ Nokogiri::XML::Document

Postprocess an XML document.

Parameters:

  • doc (Nokogiri::XML::Document)

Returns:

  • (Nokogiri::XML::Document)

    the updated document



152
153
154
155
156
157
158
# File 'lib/slaw/parse/builder.rb', line 152

def postprocess(doc)
  normalise_headings(doc)
  find_short_title(doc)
  adjust_blocklists(doc)

  doc
end

#renumber_terms(doc) ⇒ Object

recalculate ids for <term> elements



348
349
350
351
352
353
354
# File 'lib/slaw/parse/builder.rb', line 348

def renumber_terms(doc)
  logger.info("Renumbering terms")

  doc.xpath('//a:term', a: NS).each_with_index do |term, i|
    term['id'] = "trm#{i}"
  end
end

#text_to_syntax_tree(text, parse_options = {}) ⇒ Object

Parse plain text into a syntax tree.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to pass to the parser

Returns:

  • (Object)

    the root of the resulting parse tree, usually a Treetop::Runtime::SyntaxNode object



89
90
91
92
93
94
95
96
97
98
99
100
101
102
# File 'lib/slaw/parse/builder.rb', line 89

def text_to_syntax_tree(text, parse_options={})
  logger.info("Parsing...")
  parse_options = @parse_options.dup.update(parse_options)
  tree = @parser.parse(text, parse_options)
  logger.info("Parsed!")

  if tree.nil?
    raise Slaw::Parse::ParseError.new(@parser.failure_reason || "Couldn't match to grammar",
                                      line: @parser.failure_line || 0,
                                      column: @parser.failure_column || 0)
  end

  tree
end

#to_xml(doc) ⇒ String

Serialise a Nokogiri::XML::Document into a string

Parameters:

  • doc (Nokogiri::XML::Document)

    document

Returns:

  • (String)

    pretty printed string



143
144
145
# File 'lib/slaw/parse/builder.rb', line 143

def to_xml(doc)
  doc.to_xml(indent: 2)
end

#xml_from_syntax_tree(tree) ⇒ String

Generate an XML document from the given syntax tree. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

  • tree (Object)

    a Treetop::Runtime::SyntaxNode object

Returns:

  • (String)

    an XML string



110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/slaw/parse/builder.rb', line 110

def xml_from_syntax_tree(tree)
  builder = ::Nokogiri::XML::Builder.new

  builder.akomaNtoso("xmlns:xsi"=> "http://www.w3.org/2001/XMLSchema-instance", 
                     "xsi:schemaLocation" => "http://www.akomantoso.org/2.0 akomantoso20.xsd",
                     "xmlns" => NS) do |b|
    args = [b]

    # should we provide an id prefix?
    arity = tree.method('to_xml').arity 
    arity = arity.abs-1 if arity < 0
    args << (fragment_id_prefix || "") if arity > 1

    tree.to_xml(*args)
  end

  builder.to_xml(encoding: 'UTF-8')
end