Class: Slaw::Parse::Builder

Inherits:
Object
  • Object
show all
Includes:
Logging, Namespace
Defined in:
lib/slaw/parse/builder.rb

Overview

The primary class for building Akoma Ntoso documents from plain text documents.

The builder uses a grammar to break down a plain-text version of an act into a syntax tree. This tree can then be serialized into an Akoma Ntoso compatible XML document.

Examples:

Parse some text into a well-formed document

builder = Slaw::Builder.new(parser: parser)
xml = builder.parse_text(text)
doc = builder.parse_xml(xml)
builder.postprocess(doc)

A quicker way to build a well-formed document

doc = builder.parse_and_process_text(text)

Constant Summary collapse

@@parsers =
{}

Constants included from Namespace

Namespace::NS

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Logging

#logger

Constructor Details

#initialize(opts = {}) ⇒ Builder

Create a new builder.

Specify either ‘:parser` or `:grammar_file` and `:grammar_class`.

Parameters:

  • opts (Hash) (defaults to: {})

    a customizable set of options

Options Hash (opts):

  • :parser (Treetop::Runtime::CompiledParser)

    parser to use

  • :grammar_file (String)

    grammar filename to load a parser from

  • :grammar_class (String)

    name of the class that the grammar will generate



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/slaw/parse/builder.rb', line 42

def initialize(opts={})
  if opts[:parser]
    @parser = opts[:parser]
  elsif opts[:grammar_file] and opts[:grammar_class]
    if @@parsers[opts[:grammar_class]]
      # already compiled the grammar, just use it
      @parser = @@parsers[opts[:grammar_class]]
    else
      # load the grammar
      Treetop.load(opts[:grammar_file])
      cls = eval(opts[:grammar_class])
      @parser = cls.new
    end
  else
    raise ArgumentError.new("Specify either :parser or :grammar_file and :grammar_class")
  end

  @parse_options = {}
end

Instance Attribute Details

#fragment_id_prefixObject

Prefix to use when generating IDs for fragments



33
34
35
# File 'lib/slaw/parse/builder.rb', line 33

def fragment_id_prefix
  @fragment_id_prefix
end

#parse_optionsObject

Additional hash of options to be provided to the parser when parsing.



30
31
32
# File 'lib/slaw/parse/builder.rb', line 30

def parse_options
  @parse_options
end

Instance Method Details

#add_terms_to_references(doc, terms) ⇒ Object



275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/slaw/parse/builder.rb', line 275

def add_terms_to_references(doc, terms)
  refs = doc.at_xpath('//a:meta/a:references', a: NS)
  unless refs
    refs = doc.create_element('references', source: "#this")
    doc.at_xpath('//a:meta/a:identification', a: NS).after(refs)
  end

  # nuke all existing term reference elements
  refs.xpath('a:TLCTerm', a: NS).each { |el| el.remove }

  for id, term in terms
    # <TLCTerm id="term-applicant" href="/ontology/term/this.eng.applicant" showAs="Applicant"/>
    refs << doc.create_element('TLCTerm',
                               id: id,
                               href: "/ontology/term/this.eng.#{id.gsub(/^term-/, '')}",
                               showAs: term)
  end
end

#find_definitions(doc) ⇒ Hash{String, String}

Find ‘def` elements in the document and return a Hash from term ids to the text of each term

Parameters:

  • doc (Nokogiri::XML::Document)

Returns:

  • (Hash{String, String})


222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# File 'lib/slaw/parse/builder.rb', line 222

def find_definitions(doc)
  guess_at_definitions(doc)

  terms = {}
  doc.xpath('//a:def', a: NS).each do |defn|
    # <p>"<def refersTo="#term-affected_land">affected land</def>" means land in respect of which an application has been lodged in terms of section 17(1);</p>
    id = defn['refersTo'].sub(/^#/, '')
    term = defn.content
    terms[id] = term

    logger.info("+ Found definition for: #{term}")
  end

  terms
end

#find_short_title(doc) ⇒ Object

Find the short title and add it as an FRBRalias element in the meta section

Parameters:

  • doc (Nokogiri::XML::Document)


182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# File 'lib/slaw/parse/builder.rb', line 182

def find_short_title(doc)
  logger.info("Finding short title")

  # Short title and commencement 
  # 8. This Act shall be called the Legal Aid Amendment Act, 1996, and shall come 
  # into operation on a date fixed by the President by proclamation in the Gazette. 

  doc.xpath('//a:body//a:heading[contains(text(), "hort title")]', a: NS).each do |heading|
    section = heading.parent.at_xpath('a:subsection', a: NS)
    if section and section.text =~ /this act (is|shall be called) the (([a-zA-Z\(\)]\s*)+, \d\d\d\d)/i
      short_title = $2

      logger.info("+ Found title: #{short_title}")

      node = doc.at_xpath('//a:meta//a:FRBRalias', a: NS)
      node['value'] = short_title
      break
    end
  end
end

#find_term_references(doc, terms) ⇒ Object

Find and decorate references to terms in the document. The terms param is a hash from term_id to actual term.



296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# File 'lib/slaw/parse/builder.rb', line 296

def find_term_references(doc, terms)
  logger.info("+ Finding references to terms")

  i = 0

  # sort terms by the length of the defined term, desc,
  # so that we don't find short terms inside longer
  # terms
  terms = terms.to_a.sort_by { |pair| -pair[1].size }

  # look for each term
  for term_id, term in terms
    doc.xpath('//a:body//text()', a: NS).each do |text|
      # replace all occurrences in this text node

      # unless we're already inside a def or term element
      next if (["def", "term"].include?(text.parent.name))

      # don't link to a term inside its own definition
      owner = find_up(text, 'subsection')
      next if owner and owner.at_xpath(".//a:def[@refersTo='##{term_id}']", a: NS)

      while posn = (text.content =~ /\b#{Regexp::escape(term)}\b/)
        # <p>A delegation under subsection (1) shall not prevent the <term refersTo="#term-Minister" id="trm357">Minister</term> from exercising the power himself or herself.</p>
        node = doc.create_element('term', term, refersTo: "##{term_id}", id: "trm#{i}")

        pre = (posn > 0) ? text.content[0..posn-1] : nil
        post = text.content[posn+term.length..-1]

        text.before(node)
        node.before(doc.create_text_node(pre)) if pre
        text.content = post

        i += 1
      end
    end
  end
end

#guess_at_definitions(doc) ⇒ Object



238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# File 'lib/slaw/parse/builder.rb', line 238

def guess_at_definitions(doc)
  doc.xpath('//a:section', a: NS).select do |section|
    # sections with headings like Definitions
    heading = section.at_xpath('a:heading', a: NS)
    heading && heading.content =~ /definitions|interpretation/i
  end.each do |section|
    # find items like "foo" means blah...
    
    section.xpath('.//a:p|.//a:listIntroduction', a: NS).each do |container|
      # only if we don't already have a definition here
      next if container.at_xpath('a:def', a: NS)

      # get first text node
      text = container.children.first
      next if (not text or not text.text?)

      match = /^\s*["“”](.+?)["“”]/.match(text.text)
      if match
        term = match.captures[0]
        term_id = 'term-' + term.gsub(/[^a-zA-Z0-9_-]/, '_')

        # <p>"<def refersTo="#term-affected_land">affected land</def>" means land in respect of which an application has been lodged in terms of section 17(1);</p>
        defn = doc.create_element('def', term, refersTo: "##{term_id}")
        rest = match.post_match

        text.before(defn)
        defn.before(doc.create_text_node('"'))
        text.content = '"' + rest

        # adjust the container's id
        parent = find_up(container, ['blockList', 'point']) || find_up(container, ['subsection', 'section'])
        parent['id'] = "def-#{term_id}"
      end
    end
  end
end

Find definitions of terms and introduce them into the meta section of the document.

Parameters:

  • doc (Nokogiri::XML::Document)


207
208
209
210
211
212
213
214
# File 'lib/slaw/parse/builder.rb', line 207

def link_definitions(doc)
  logger.info("Finding and linking definitions")

  terms = find_definitions(doc)
  add_terms_to_references(doc, terms)
  find_term_references(doc, terms)
  renumber_terms(doc)
end

#nest_blocklists(doc) ⇒ Object

Correctly nest blocklists.

The grammar gives us flat blocklists, we need to introspect the numbering of the lists to correctly nest them.

Parameters:

  • doc (Nokogiri::XML::Document)


350
351
352
353
354
# File 'lib/slaw/parse/builder.rb', line 350

def nest_blocklists(doc)
  logger.info("Nesting blocklists")

  Slaw::Parse::Blocklists.nest_blocklists(doc)
end

#normalise_headings(doc) ⇒ Object

Change CAPCASE headings into Sentence case.

Parameters:

  • doc (Nokogiri::XML::Document)


166
167
168
169
170
171
172
173
174
175
176
177
# File 'lib/slaw/parse/builder.rb', line 166

def normalise_headings(doc)
  logger.info("Normalising headings")

  nodes = doc.xpath('//a:body//a:heading/text()', a: NS) +
          doc.xpath('//a:component/a:doc[@name="schedules"]//a:heading/text()', a: NS)

  nodes.each do |heading|
    if !(heading.content =~ /[a-z]/)
      heading.content = heading.content.downcase.gsub(/^\w/) { $&.upcase }
    end
  end
end

#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document

Do all the work necessary to parse text into a well-formed XML document.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to parse to the parser

Returns:

  • (Nokogiri::XML::Document)

    a well formed document



68
69
70
# File 'lib/slaw/parse/builder.rb', line 68

def parse_and_process_text(text, parse_options={})
  postprocess(parse_xml(parse_text(text, parse_options)))
end

#parse_text(text, parse_options = {}) ⇒ String

Parse text into XML. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to pass to the parser

Returns:

  • (String)

    an XML string



79
80
81
82
# File 'lib/slaw/parse/builder.rb', line 79

def parse_text(text, parse_options={})
  tree = text_to_syntax_tree(text, parse_options)
  xml_from_syntax_tree(tree)
end

#parse_xml(xml) ⇒ Nokogiri::XML::Document

Parse a string into a Nokogiri::XML::Document

Parameters:

  • xml (String)

    string to parse

Returns:

  • (Nokogiri::XML::Document)


137
138
139
# File 'lib/slaw/parse/builder.rb', line 137

def parse_xml(xml)
  Nokogiri::XML(xml, &:noblanks)
end

#postprocess(doc) ⇒ Nokogiri::XML::Document

Postprocess an XML document.

Parameters:

  • doc (Nokogiri::XML::Document)

Returns:

  • (Nokogiri::XML::Document)

    the updated document



155
156
157
158
159
160
161
# File 'lib/slaw/parse/builder.rb', line 155

def postprocess(doc)
  normalise_headings(doc)
  find_short_title(doc)
  nest_blocklists(doc)

  doc
end

#renumber_terms(doc) ⇒ Object

recalculate ids for <term> elements



336
337
338
339
340
341
342
# File 'lib/slaw/parse/builder.rb', line 336

def renumber_terms(doc)
  logger.info("Renumbering terms")

  doc.xpath('//a:term', a: NS).each_with_index do |term, i|
    term['id'] = "trm#{i}"
  end
end

#text_to_syntax_tree(text, parse_options = {}) ⇒ Object

Parse plain text into a syntax tree.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to pass to the parser

Returns:

  • (Object)

    the root of the resulting parse tree, usually a Treetop::Runtime::SyntaxNode object



90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/slaw/parse/builder.rb', line 90

def text_to_syntax_tree(text, parse_options={})
  logger.info("Parsing...")
  parse_options = @parse_options.dup.update(parse_options)
  tree = @parser.parse(text, parse_options)
  logger.info("Parsed!")

  if tree.nil?
    raise Slaw::Parse::ParseError.new(@parser.failure_reason || "Couldn't match to grammar",
                                      line: @parser.failure_line || 0,
                                      column: @parser.failure_column || 0)
  end

  tree
end

#to_xml(doc) ⇒ String

Serialise a Nokogiri::XML::Document into a string

Parameters:

  • doc (Nokogiri::XML::Document)

    document

Returns:

  • (String)

    pretty printed string



146
147
148
# File 'lib/slaw/parse/builder.rb', line 146

def to_xml(doc)
  doc.to_xml(indent: 2)
end

#xml_from_syntax_tree(tree) ⇒ String

Generate an XML document from the given syntax tree. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

  • tree (Object)

    a Treetop::Runtime::SyntaxNode object

Returns:

  • (String)

    an XML string



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/slaw/parse/builder.rb', line 111

def xml_from_syntax_tree(tree)
  s = ""
  builder = ::Builder::XmlMarkup.new(indent: 2, target: s)

  builder.instruct! :xml, :version=>"1.0", :encoding=>"UTF-8"
  builder.akomaNtoso("xmlns:xsi"=> "http://www.w3.org/2001/XMLSchema-instance", 
                     "xsi:schemaLocation" => "http://www.akomantoso.org/2.0 akomantoso20.xsd",
                     "xmlns" => NS) { |b|
    args = [b]

    # should we provide an id prefix?
    arity = tree.method('to_xml').arity 
    arity = arity.abs-1 if arity < 0
    args << (fragment_id_prefix || "") if arity > 1

    tree.to_xml(*args)
  }

  s
end