Class: Agio::Broker
- Inherits:
-
Nokogiri::XML::SAX::Document
- Object
- Nokogiri::XML::SAX::Document
- Agio::Broker
- Defined in:
- lib/agio/broker.rb
Overview
The Broker class is the object that transforms HTML into an intermediate format for Agio so that the intermediate format can be converted into Markdown text.
The Broker has two primary data structures it keeps: the block list (#blocks) and the block stack (#stack).
The block list is an array of completed blocks for the document that, when processed correctly, will allow the meaningful creation of the Markdown text.
The block stack is where the blocks reside during creation.
Agio::Broker is a Nokogiri::XML::SAX::Document and can be used by the Nokogiri SAX parser to fill the block list.
Algorithm
Assume a fairly simple HTML document:
<h1>Title</h1>
<p>Lorem ipsum dolor sit amet,
<strong>consectetur</strong> adipiscing.</p>
When the first element (“h1”) is observed, a new block will be created on the stack:
Blocks[ ]
Stack [ block(h1) ]
The text will be appended to the block:
Blocks[ ]
Stack [ block(h1, Title) ]
When the closing tag for the element is observed, the block will be popped from the stack and pushed to the end of the blocks list.
Blocks[ block(h1, Title) ]
Stack [ ]
The same happens for the second element (“p”) and its text:
Blocks[ block(h1, Title) ]
Stack [ block(p, Lorem ipsum dolor it amet) ]
When the “strong” element is received, though, it and its text are pushed onto the stack:
Blocks[ block(h1, Title) ]
Stack [ block(p, Lorem ipsum dolor it amet),
block(strong, consectetur)
]
When the closing tag for the “strong” element is received, the “strong” block is popped off the stack and appended to the block at the top of the stack.
Blocks[ block(h1, Title) ]
Stack [ block(p, Lorem ipsum dolor it amet,
block(strong, consectetur)
]
Finally, the text is appended, the closing tag for the “p” element shows up, and that block is popped off the stack and appended to the blocks list:
Blocks[ block(h1, Title),
block(p, Lorem ipsum dolor it amet,
block(strong, consectetur), adipiscing)
]
Stack [ ]
Handling Broken HTML
Agio tries to be sane when dealing with broken HTML.
Missing Block Elements
It is possible to have missing block elements. In this case, an implicit “p” block element will be assumed.
Lorem ipsum dolor sit amet,
When encountered, this will be treated as:
Stack [ block(p, Lorem ipsum dolor sit amet,) ]
If a span element is encountered, an implicit “p” block element will still be assumed.
<em>Lorem ipsum dolor sit amet,</em>
Will produce:
Stack [ block(p),
block(em, Lorem ipsum dolor sit amet,)
]
A special case exists for the “li”, “dt”, and “dd” tags; if they are encountered outside of lists (“ul”, “ol”, or “dl”), implicit list tags will be inserted (“ul” for “li”; “dl” for “dt” or “dd”).
Unclosed Elements Inside a Block
Things are a little more complex when dealing with broken HTML. Agio::Broker tries to deal with them sanely. Assume the following HTML:
<p>Lorem ipsum dolor sit amet,
<strong>consectetur adipiscing.</p>
Before the closing “p” tag is observed, the stack looks like this:
Stack [ block(p, Lorem ipsum dolor it amet),
block(strong, consectetur adipiscing)
]
When the “p” tag is observed, the Broker sees that the topmost block was not opened with a “p” tag, so it implicitly closes the topmost block as defined above, resulting in:
Blocks[ block(p, Lorem ipsum dolor it amet,
block(strong, consectetur adipiscing)
]
Unclosed Elements Between Blocks
If an HTML element is not nestable (see below), then observing another element start of that type will cause the existing block to be closed and a new one to be opened. For example:
<p>Lorem ipsum dolor sit amet,
<p>consectetur adipiscing.</p>
If the Broker has processed the the first “p” element:
Blocks[ ]
Stack [ block(p, Lorem ipsum dolor it amet,) ]
When the second “p” opening tag is seen, Agio::Broker treats this as having an implicit closing “p” tag:
Blocsk[ block(p, Lorem ipsum dolor it amet,) ]
Stack [ block(p) ]
This behaviour does not apply to a nestable element.
Nestable HTML Elements
Some HTML elements are considered nestable by Agio::Broker. These currently include “blockquote”, “ol”, and “ul”. When opening tags for these types are observed, these tags do not cause a current block of the same type to be shifted as outlined above. Nestable elements can contain other HTML block elements; “li” elements are special in that they cannot directly contain another “li”, but they can contain other HTML block elements.
Instance Attribute Summary collapse
-
#blocks ⇒ Object
readonly
The array of completed document subsections.
-
#errors ⇒ Object
readonly
Errors found while parsing the document.
-
#warnings ⇒ Object
readonly
Warnings found while parsing the document.
Instance Method Summary collapse
- #cdata_block(string) ⇒ Object
- #characters(string) ⇒ Object
- #comment(string) ⇒ Object
- #end_document ⇒ Object
- #end_element(name) ⇒ Object
- #end_element_namespace(name, prefix = nil, uri = nil) ⇒ Object
- #error(string) ⇒ Object
-
#initialize ⇒ Broker
constructor
A new instance of Broker.
-
#start_document ⇒ Object
When we.
- #start_element(name, attrs = []) ⇒ Object
- #start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = []) ⇒ Object
- #warning(string) ⇒ Object
- #xmldecl(version, encoding, standalone) ⇒ Object
Constructor Details
#initialize ⇒ Broker
Returns a new instance of Broker.
180 181 182 183 184 185 |
# File 'lib/agio/broker.rb', line 180 def initialize @blocks = [] @stack = [] @warnings = [] @errors = [] end |
Instance Attribute Details
#blocks ⇒ Object (readonly)
The array of completed document subsections. Each entry is a root object for contained contents. When HTML parsing is complete, this attribute should be read for the structures that must be translated into Markdown.
162 163 164 |
# File 'lib/agio/broker.rb', line 162 def blocks @blocks end |
#errors ⇒ Object (readonly)
Errors found while parsing the document. For example, “<p><em>Foo</p>”, will produce an error when the “</p>” is encountered because the “<em>” has not been closed. The logic for the Agio::Broker is such that this sort of error is not a problem; it implicitly closes the “<em>”.
175 176 177 |
# File 'lib/agio/broker.rb', line 175 def errors @errors end |
#warnings ⇒ Object (readonly)
Warnings found while parsing the document.
178 179 180 |
# File 'lib/agio/broker.rb', line 178 def warnings @warnings end |
Instance Method Details
#cdata_block(string) ⇒ Object
356 357 358 |
# File 'lib/agio/broker.rb', line 356 def cdata_block(string) push Agio::CData.new(string) end |
#characters(string) ⇒ Object
360 361 362 363 |
# File 'lib/agio/broker.rb', line 360 def characters(string) return if (stack.empty? or stack[-1].pre?) and string =~ /\A\s+\Z/ push Agio::Data.new(string) end |
#comment(string) ⇒ Object
365 366 367 |
# File 'lib/agio/broker.rb', line 365 def comment(string) push Agio::Comment.new(string) end |
#end_document ⇒ Object
369 370 371 |
# File 'lib/agio/broker.rb', line 369 def end_document pop while not stack.empty? end |
#end_element(name) ⇒ Object
373 374 375 |
# File 'lib/agio/broker.rb', line 373 def end_element(name) pop(name) end |
#end_element_namespace(name, prefix = nil, uri = nil) ⇒ Object
377 378 379 |
# File 'lib/agio/broker.rb', line 377 def end_element_namespace(name, prefix = nil, uri = nil) pop(name, :prefix => prefix, :uri => uri) end |
#error(string) ⇒ Object
381 382 383 |
# File 'lib/agio/broker.rb', line 381 def error(string) errors << string end |
#start_document ⇒ Object
When we
386 387 388 |
# File 'lib/agio/broker.rb', line 386 def start_document pop while not stack.empty? end |
#start_element(name, attrs = []) ⇒ Object
390 391 392 393 394 395 396 397 398 |
# File 'lib/agio/broker.rb', line 390 def start_element(name, attrs = []) = if attrs.empty? { } else { :attrs => Hash[attrs] } end push Agio::Block.new(name, ) end |
#start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = []) ⇒ Object
400 401 402 403 |
# File 'lib/agio/broker.rb', line 400 def start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = []) push Agio::Block.new(name, :attrs => attrs, :prefix => prefix, :uri => uri, :ns => ns) end |
#warning(string) ⇒ Object
405 406 407 |
# File 'lib/agio/broker.rb', line 405 def warning(string) warnings << string end |