Class: Agio::Broker

Inherits:
Nokogiri::XML::SAX::Document
  • Object
show all
Defined in:
lib/agio/broker.rb

Overview

The Broker class is the object that transforms HTML into an intermediate format for Agio so that the intermediate format can be converted into Markdown text.

The Broker has two primary data structures it keeps: the block list (#blocks) and the block stack (#stack).

The block list is an array of completed blocks for the document that, when processed correctly, will allow the meaningful creation of the Markdown text.

The block stack is where the blocks reside during creation.

Agio::Broker is a Nokogiri::XML::SAX::Document and can be used by the Nokogiri SAX parser to fill the block list.

Algorithm

Assume a fairly simple HTML document:

<h1>Title</h1>
<p>Lorem ipsum dolor sit amet,
<strong>consectetur</strong> adipiscing.</p>

When the first element (“h1”) is observed, a new block will be created on the stack:

Blocks[ ]
Stack [ block(h1) ]

The text will be appended to the block:

Blocks[ ]
Stack [ block(h1, Title) ]

When the closing tag for the element is observed, the block will be popped from the stack and pushed to the end of the blocks list.

Blocks[ block(h1, Title) ]
Stack [ ]

The same happens for the second element (“p”) and its text:

Blocks[ block(h1, Title) ]
Stack [ block(p, Lorem ipsum dolor it amet) ]

When the “strong” element is received, though, it and its text are pushed onto the stack:

Blocks[ block(h1, Title) ]
Stack [ block(p, Lorem ipsum dolor it amet),
        block(strong, consectetur)
      ]

When the closing tag for the “strong” element is received, the “strong” block is popped off the stack and appended to the block at the top of the stack.

Blocks[ block(h1, Title) ]
Stack [ block(p, Lorem ipsum dolor it amet,
              block(strong, consectetur)
      ]

Finally, the text is appended, the closing tag for the “p” element shows up, and that block is popped off the stack and appended to the blocks list:

Blocks[ block(h1, Title),
        block(p, Lorem ipsum dolor it amet,
              block(strong, consectetur), adipiscing)
      ]
Stack [ ]

Handling Broken HTML

Agio tries to be sane when dealing with broken HTML.

Missing Block Elements

It is possible to have missing block elements. In this case, an implicit “p” block element will be assumed.

Lorem ipsum dolor sit amet,

When encountered, this will be treated as:

Stack [ block(p, Lorem ipsum dolor sit amet,) ]

If a span element is encountered, an implicit “p” block element will still be assumed.

<em>Lorem ipsum dolor sit amet,</em>

Will produce:

Stack [ block(p),
        block(em, Lorem ipsum dolor sit amet,)
      ]

A special case exists for the “li”, “dt”, and “dd” tags; if they are encountered outside of lists (“ul”, “ol”, or “dl”), implicit list tags will be inserted (“ul” for “li”; “dl” for “dt” or “dd”).

Unclosed Elements Inside a Block

Things are a little more complex when dealing with broken HTML. Agio::Broker tries to deal with them sanely. Assume the following HTML:

<p>Lorem ipsum dolor sit amet,
<strong>consectetur adipiscing.</p>

Before the closing “p” tag is observed, the stack looks like this:

Stack [ block(p, Lorem ipsum dolor it amet),
        block(strong, consectetur adipiscing)
      ]

When the “p” tag is observed, the Broker sees that the topmost block was not opened with a “p” tag, so it implicitly closes the topmost block as defined above, resulting in:

Blocks[ block(p, Lorem ipsum dolor it amet,
              block(strong, consectetur adipiscing)
      ]

Unclosed Elements Between Blocks

If an HTML element is not nestable (see below), then observing another element start of that type will cause the existing block to be closed and a new one to be opened. For example:

<p>Lorem ipsum dolor sit amet,
<p>consectetur adipiscing.</p>

If the Broker has processed the the first “p” element:

Blocks[ ]
Stack [ block(p, Lorem ipsum dolor it amet,) ]

When the second “p” opening tag is seen, Agio::Broker treats this as having an implicit closing “p” tag:

Blocsk[ block(p, Lorem ipsum dolor it amet,) ]
Stack [ block(p) ]

This behaviour does not apply to a nestable element.

Nestable HTML Elements

Some HTML elements are considered nestable by Agio::Broker. These currently include “blockquote”, “ol”, and “ul”. When opening tags for these types are observed, these tags do not cause a current block of the same type to be shifted as outlined above. Nestable elements can contain other HTML block elements; “li” elements are special in that they cannot directly contain another “li”, but they can contain other HTML block elements.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeBroker

Returns a new instance of Broker.



180
181
182
183
184
185
# File 'lib/agio/broker.rb', line 180

def initialize
  @blocks   = []
  @stack    = []
  @warnings = []
  @errors   = []
end

Instance Attribute Details

#blocksObject (readonly)

The array of completed document subsections. Each entry is a root object for contained contents. When HTML parsing is complete, this attribute should be read for the structures that must be translated into Markdown.



162
163
164
# File 'lib/agio/broker.rb', line 162

def blocks
  @blocks
end

#errorsObject (readonly)

Errors found while parsing the document. For example, “<p><em>Foo</p>”, will produce an error when the “</p>” is encountered because the “<em>” has not been closed. The logic for the Agio::Broker is such that this sort of error is not a problem; it implicitly closes the “<em>”.



175
176
177
# File 'lib/agio/broker.rb', line 175

def errors
  @errors
end

#warningsObject (readonly)

Warnings found while parsing the document.



178
179
180
# File 'lib/agio/broker.rb', line 178

def warnings
  @warnings
end

Instance Method Details

#cdata_block(string) ⇒ Object



356
357
358
# File 'lib/agio/broker.rb', line 356

def cdata_block(string)
  push Agio::CData.new(string)
end

#characters(string) ⇒ Object



360
361
362
363
# File 'lib/agio/broker.rb', line 360

def characters(string)
  return if (stack.empty? or stack[-1].pre?) and string =~ /\A\s+\Z/
  push Agio::Data.new(string)
end

#comment(string) ⇒ Object



365
366
367
# File 'lib/agio/broker.rb', line 365

def comment(string)
  push Agio::Comment.new(string)
end

#end_documentObject



369
370
371
# File 'lib/agio/broker.rb', line 369

def end_document
  pop while not stack.empty?
end

#end_element(name) ⇒ Object



373
374
375
# File 'lib/agio/broker.rb', line 373

def end_element(name)
  pop(name)
end

#end_element_namespace(name, prefix = nil, uri = nil) ⇒ Object



377
378
379
# File 'lib/agio/broker.rb', line 377

def end_element_namespace(name, prefix = nil, uri = nil)
  pop(name, :prefix => prefix, :uri => uri)
end

#error(string) ⇒ Object



381
382
383
# File 'lib/agio/broker.rb', line 381

def error(string)
  errors << string
end

#start_documentObject

When we



386
387
388
# File 'lib/agio/broker.rb', line 386

def start_document
  pop while not stack.empty?
end

#start_element(name, attrs = []) ⇒ Object



390
391
392
393
394
395
396
397
398
# File 'lib/agio/broker.rb', line 390

def start_element(name, attrs = [])
  options = if attrs.empty?
              { }
            else
              { :attrs => Hash[attrs] }
            end

  push Agio::Block.new(name, options)
end

#start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = []) ⇒ Object



400
401
402
403
# File 'lib/agio/broker.rb', line 400

def start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = [])
  push Agio::Block.new(name, :attrs => attrs, :prefix => prefix,
                       :uri => uri, :ns => ns)
end

#warning(string) ⇒ Object



405
406
407
# File 'lib/agio/broker.rb', line 405

def warning(string)
  warnings << string
end

#xmldecl(version, encoding, standalone) ⇒ Object



409
410
411
412
# File 'lib/agio/broker.rb', line 409

def xmldecl(version, encoding, standalone)
  push Agio::XMLDecl.new(:version => version, :encoding => encoding,
                         :standalone => standalone)
end