Class: Slaw::Parse::Builder

Inherits:
Object
  • Object
show all
Includes:
Logging, Namespace
Defined in:
lib/slaw/parse/builder.rb

Overview

The primary class for building Akoma Ntoso documents from plain text documents.

The builder uses a grammar to break down a plain-text version of an act into a syntax tree. This tree can then be serialized into an Akoma Ntoso compatible XML document.

Examples:

Parse some text into a well-formed document

builder = Slaw::Builder.new(parser: parser)
xml = builder.parse_text(text)
doc = builder.parse_xml(xml)
builder.postprocess(doc)

A quicker way to build a well-formed document

doc = builder.parse_and_process_text(text)

Constant Summary

Constants included from Namespace

Namespace::NS

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Logging

#logger

Constructor Details

#initialize(opts = {}) ⇒ Builder

Create a new builder.

Specify either ‘:parser` or `:grammar_file` and `:grammar_class`.

Parameters:

  • opts (Hash) (defaults to: {})

    a customizable set of options

Options Hash (opts):

  • :parser (Treetop::Runtime::CompiledParser)

    parser to use

  • Hash (Object)

    :parse_options options to parse to the parser



45
46
47
48
49
# File 'lib/slaw/parse/builder.rb', line 45

def initialize(opts={})
  @parser = opts[:parser]
  @parse_options = opts[:parse_optiosn] || {}
  @force_ascii = false
end

Instance Attribute Details

#force_asciiObject

Should the parsing re-encoding the string as ASCII?



37
38
39
# File 'lib/slaw/parse/builder.rb', line 37

def force_ascii
  @force_ascii
end

#fragment_id_prefixObject

Prefix to use when generating IDs for fragments



34
35
36
# File 'lib/slaw/parse/builder.rb', line 34

def fragment_id_prefix
  @fragment_id_prefix
end

#parse_optionsObject

Additional hash of options to be provided to the parser when parsing.



28
29
30
# File 'lib/slaw/parse/builder.rb', line 28

def parse_options
  @parse_options
end

#parserObject

The parser to use



31
32
33
# File 'lib/slaw/parse/builder.rb', line 31

def parser
  @parser
end

Instance Method Details

#adjust_blocklists(doc) ⇒ Object

Adjust blocklists:

  • nest them correctly

  • change preceding p tags into listIntroductions

Parameters:

  • doc (Nokogiri::XML::Document)


199
200
201
202
203
204
# File 'lib/slaw/parse/builder.rb', line 199

def adjust_blocklists(doc)
  logger.info("Adjusting blocklists")

  Slaw::Parse::Blocklists.nest_blocklists(doc)
  Slaw::Parse::Blocklists.fix_intros(doc)
end

#escape_utf8(text) ⇒ Object

Use %-encoding to escape everything outside of the US_ASCII range, including encoding % itself.

This can have a huge performance benefit. String lookups on utf-8 strings are linear in Ruby, while string lookups on US_ASCII encoded strings are constant time.

This option can only be used if the grammar doesn’t include non-ascii literals.

See github.com/cjheath/treetop/issues/31



106
107
108
109
110
111
112
# File 'lib/slaw/parse/builder.rb', line 106

def escape_utf8(text)
  unsafe = (0..126).to_a - ['%'.ord]
  unsafe = unsafe.map { |i| '\u%04x' % i }
  unsafe = Regexp.new('[^' + unsafe.join('') + ']')

  URI::DEFAULT_PARSER.escape(text, unsafe)
end

#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document

Do all the work necessary to parse text into a well-formed XML document.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to parse to the parser

Returns:

  • (Nokogiri::XML::Document)

    a well formed document



57
58
59
# File 'lib/slaw/parse/builder.rb', line 57

def parse_and_process_text(text, parse_options={})
  postprocess(parse_xml(parse_text(text, parse_options)))
end

#parse_text(text, parse_options = {}) ⇒ String

Parse text into XML. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to pass to the parser

Returns:

  • (String)

    an XML string



83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/slaw/parse/builder.rb', line 83

def parse_text(text, parse_options={})
  text = preprocess(text)

  text = escape_utf8(text) if @force_ascii

  tree = text_to_syntax_tree(text, parse_options)
  xml = xml_from_syntax_tree(tree)

  xml = unescape_utf8(xml) if @force_ascii

  xml
end

#parse_xml(xml) ⇒ Nokogiri::XML::Document

Parse a string into a Nokogiri::XML::Document

Parameters:

  • xml (String)

    string to parse

Returns:

  • (Nokogiri::XML::Document)


169
170
171
# File 'lib/slaw/parse/builder.rb', line 169

def parse_xml(xml)
  Nokogiri::XML(xml, &:noblanks)
end

#postprocess(doc) ⇒ Nokogiri::XML::Document

Postprocess an XML document.

Parameters:

  • doc (Nokogiri::XML::Document)

Returns:

  • (Nokogiri::XML::Document)

    the updated document



187
188
189
190
191
# File 'lib/slaw/parse/builder.rb', line 187

def postprocess(doc)
  adjust_blocklists(doc)

  doc
end

#preprocess(text) ⇒ String

Pre-process text just before parsing it using the grammar.

Parameters:

  • text (String)

    the text to preprocess

Returns:

  • (String)

    text ready to parse



65
66
67
68
69
70
71
72
73
74
# File 'lib/slaw/parse/builder.rb', line 65

def preprocess(text)
  # our grammar doesn't handle inline table cells; instead, we break
  # inline cells into block-style cells

  # first, find all the tables
  text.gsub(/{\|(?!\|}).*?\|}/m) do |table|
    # on each table line, split inline cells into block cells
    table.split("\n").map { |line| line.gsub(/(\|\||!!)/) { |m| "\n" + m[0]} }.join("\n")
  end
end

#text_to_syntax_tree(text, parse_options = {}) ⇒ Object

Parse plain text into a syntax tree.

Parameters:

  • text (String)

    the text to parse

  • parse_options (Hash) (defaults to: {})

    options to pass to the parser

Returns:

  • (Object)

    the root of the resulting parse tree, usually a Treetop::Runtime::SyntaxNode object



124
125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/slaw/parse/builder.rb', line 124

def text_to_syntax_tree(text, parse_options={})
  logger.info("Parsing...")
  parse_options = @parse_options.dup.update(parse_options)
  tree = @parser.parse(text, parse_options)
  logger.info("Parsed!")

  if tree.nil?
    raise Slaw::Parse::ParseError.new(@parser.failure_reason || "Couldn't match to grammar",
                                      line: @parser.failure_line || 0,
                                      column: @parser.failure_column || 0)
  end

  tree
end

#to_xml(doc) ⇒ String

Serialise a Nokogiri::XML::Document into a string

Parameters:

  • doc (Nokogiri::XML::Document)

    document

Returns:

  • (String)

    pretty printed string



178
179
180
# File 'lib/slaw/parse/builder.rb', line 178

def to_xml(doc)
  doc.to_xml(indent: 2)
end

#unescape_utf8(xml) ⇒ Object



114
115
116
# File 'lib/slaw/parse/builder.rb', line 114

def unescape_utf8(xml)
  URI.unescape(xml)
end

#xml_from_syntax_tree(tree) ⇒ String

Generate an XML document from the given syntax tree. You should still run #postprocess on the resulting XML to normalise it.

Parameters:

  • tree (Object)

    a Treetop::Runtime::SyntaxNode object

Returns:

  • (String)

    an XML string



145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/slaw/parse/builder.rb', line 145

def xml_from_syntax_tree(tree)
  builder = ::Nokogiri::XML::Builder.new

  builder.akomaNtoso("xmlns:xsi"=> "http://www.w3.org/2001/XMLSchema-instance", 
                     "xsi:schemaLocation" => "http://www.akomantoso.org/2.0 akomantoso20.xsd",
                     "xmlns" => NS) do |b|
    args = [b]

    # should we provide an id prefix?
    arity = tree.method('to_xml').arity 
    arity = arity.abs-1 if arity < 0
    args << (fragment_id_prefix || "") if arity > 1

    tree.to_xml(*args)
  end

  builder.to_xml(encoding: 'UTF-8')
end