Class: Slaw::Parse::Builder
- Inherits:
-
Object
- Object
- Slaw::Parse::Builder
- Defined in:
- lib/slaw/parse/builder.rb
Overview
The primary class for building Akoma Ntoso documents from plain text documents.
The builder uses a grammar to break down a plain-text version of an act into a syntax tree. This tree can then be serialized into an Akoma Ntoso compatible XML document.
Constant Summary
Constants included from Namespace
Instance Attribute Summary collapse
-
#force_ascii ⇒ Object
Should the parsing re-encoding the string as ASCII?.
-
#fragment_id_prefix ⇒ Object
Prefix to use when generating IDs for fragments.
-
#parse_options ⇒ Object
Additional hash of options to be provided to the parser when parsing.
-
#parser ⇒ Object
The parser to use.
Instance Method Summary collapse
-
#adjust_blocklists(doc) ⇒ Object
Adjust blocklists:.
-
#escape_utf8(text) ⇒ Object
Use %-encoding to escape everything outside of the US_ASCII range, including encoding % itself.
-
#initialize(opts = {}) ⇒ Builder
constructor
Create a new builder.
-
#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document
Do all the work necessary to parse text into a well-formed XML document.
-
#parse_text(text, parse_options = {}) ⇒ String
Parse text into XML.
-
#parse_xml(xml) ⇒ Nokogiri::XML::Document
Parse a string into a Nokogiri::XML::Document.
-
#postprocess(doc) ⇒ Nokogiri::XML::Document
Postprocess an XML document.
-
#preprocess(text) ⇒ String
Pre-process text just before parsing it using the grammar.
-
#text_to_syntax_tree(text, parse_options = {}) ⇒ Object
Parse plain text into a syntax tree.
-
#to_xml(doc) ⇒ String
Serialise a Nokogiri::XML::Document into a string.
- #unescape_utf8(xml) ⇒ Object
-
#xml_from_syntax_tree(tree) ⇒ String
Generate an XML document from the given syntax tree.
Methods included from Logging
Constructor Details
#initialize(opts = {}) ⇒ Builder
Create a new builder.
Specify either ‘:parser` or `:grammar_file` and `:grammar_class`.
45 46 47 48 49 |
# File 'lib/slaw/parse/builder.rb', line 45 def initialize(opts={}) @parser = opts[:parser] @parse_options = opts[:parse_optiosn] || {} @force_ascii = false end |
Instance Attribute Details
#force_ascii ⇒ Object
Should the parsing re-encoding the string as ASCII?
37 38 39 |
# File 'lib/slaw/parse/builder.rb', line 37 def force_ascii @force_ascii end |
#fragment_id_prefix ⇒ Object
Prefix to use when generating IDs for fragments
34 35 36 |
# File 'lib/slaw/parse/builder.rb', line 34 def fragment_id_prefix @fragment_id_prefix end |
#parse_options ⇒ Object
Additional hash of options to be provided to the parser when parsing.
28 29 30 |
# File 'lib/slaw/parse/builder.rb', line 28 def @parse_options end |
#parser ⇒ Object
The parser to use
31 32 33 |
# File 'lib/slaw/parse/builder.rb', line 31 def parser @parser end |
Instance Method Details
#adjust_blocklists(doc) ⇒ Object
Adjust blocklists:
-
nest them correctly
-
change preceding p tags into listIntroductions
199 200 201 202 203 204 |
# File 'lib/slaw/parse/builder.rb', line 199 def adjust_blocklists(doc) logger.info("Adjusting blocklists") Slaw::Parse::Blocklists.nest_blocklists(doc) Slaw::Parse::Blocklists.fix_intros(doc) end |
#escape_utf8(text) ⇒ Object
Use %-encoding to escape everything outside of the US_ASCII range, including encoding % itself.
This can have a huge performance benefit. String lookups on utf-8 strings are linear in Ruby, while string lookups on US_ASCII encoded strings are constant time.
This option can only be used if the grammar doesn’t include non-ascii literals.
106 107 108 109 110 111 112 |
# File 'lib/slaw/parse/builder.rb', line 106 def escape_utf8(text) unsafe = (0..126).to_a - ['%'.ord] unsafe = unsafe.map { |i| '\u%04x' % i } unsafe = Regexp.new('[^' + unsafe.join('') + ']') URI::DEFAULT_PARSER.escape(text, unsafe) end |
#parse_and_process_text(text, parse_options = {}) ⇒ Nokogiri::XML::Document
Do all the work necessary to parse text into a well-formed XML document.
57 58 59 |
# File 'lib/slaw/parse/builder.rb', line 57 def parse_and_process_text(text, ={}) postprocess(parse_xml(parse_text(text, ))) end |
#parse_text(text, parse_options = {}) ⇒ String
Parse text into XML. You should still run #postprocess on the resulting XML to normalise it.
83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/slaw/parse/builder.rb', line 83 def parse_text(text, ={}) text = preprocess(text) text = escape_utf8(text) if @force_ascii tree = text_to_syntax_tree(text, ) xml = xml_from_syntax_tree(tree) xml = unescape_utf8(xml) if @force_ascii xml end |
#parse_xml(xml) ⇒ Nokogiri::XML::Document
Parse a string into a Nokogiri::XML::Document
169 170 171 |
# File 'lib/slaw/parse/builder.rb', line 169 def parse_xml(xml) Nokogiri::XML(xml, &:noblanks) end |
#postprocess(doc) ⇒ Nokogiri::XML::Document
Postprocess an XML document.
187 188 189 190 191 |
# File 'lib/slaw/parse/builder.rb', line 187 def postprocess(doc) adjust_blocklists(doc) doc end |
#preprocess(text) ⇒ String
Pre-process text just before parsing it using the grammar.
65 66 67 68 69 70 71 72 73 74 |
# File 'lib/slaw/parse/builder.rb', line 65 def preprocess(text) # our grammar doesn't handle inline table cells; instead, we break # inline cells into block-style cells # first, find all the tables text.gsub(/{\|(?!\|}).*?\|}/m) do |table| # on each table line, split inline cells into block cells table.split("\n").map { |line| line.gsub(/(\|\||!!)/) { |m| "\n" + m[0]} }.join("\n") end end |
#text_to_syntax_tree(text, parse_options = {}) ⇒ Object
Parse plain text into a syntax tree.
124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/slaw/parse/builder.rb', line 124 def text_to_syntax_tree(text, ={}) logger.info("Parsing...") = @parse_options.dup.update() tree = @parser.parse(text, ) logger.info("Parsed!") if tree.nil? raise Slaw::Parse::ParseError.new(@parser.failure_reason || "Couldn't match to grammar", line: @parser.failure_line || 0, column: @parser.failure_column || 0) end tree end |
#to_xml(doc) ⇒ String
Serialise a Nokogiri::XML::Document into a string
178 179 180 |
# File 'lib/slaw/parse/builder.rb', line 178 def to_xml(doc) doc.to_xml(indent: 2) end |
#unescape_utf8(xml) ⇒ Object
114 115 116 |
# File 'lib/slaw/parse/builder.rb', line 114 def unescape_utf8(xml) URI.unescape(xml) end |
#xml_from_syntax_tree(tree) ⇒ String
Generate an XML document from the given syntax tree. You should still run #postprocess on the resulting XML to normalise it.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/slaw/parse/builder.rb', line 145 def xml_from_syntax_tree(tree) builder = ::Nokogiri::XML::Builder.new builder.akomaNtoso("xmlns:xsi"=> "http://www.w3.org/2001/XMLSchema-instance", "xsi:schemaLocation" => "http://www.akomantoso.org/2.0 akomantoso20.xsd", "xmlns" => NS) do |b| args = [b] # should we provide an id prefix? arity = tree.method('to_xml').arity arity = arity.abs-1 if arity < 0 args << (fragment_id_prefix || "") if arity > 1 tree.to_xml(*args) end builder.to_xml(encoding: 'UTF-8') end |