Class: Oga::XML::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/oga/xml/lexer.rb

Overview

Low level lexer that supports both XML and HTML (using an extra option). To lex HTML input set the :html option to true when creating an instance of the lexer:

lexer = Oga::XML::Lexer.new(:html => true)

This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.

Thread Safety

Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:

# Don't do this!
lexer   = Oga::XML::Lexer.new('....')
threads = []

2.times do
  threads << Thread.new do
    lexer.advance do |*args|
      p args
    end
  end
end

threads.each(&:join)

However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.

Strict Mode

By default the lexer is rather permissive regarding the input. For example, missing closing tags are inserted by default. To disable this behaviour the lexer can be run in "strict mode" by setting :strict to true:

lexer = Oga::XML::Lexer.new('...', :strict => true)

Strict mode only applies to XML documents.

Constant Summary collapse

HTML_SCRIPT =

These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.

'script'.freeze
HTML_STYLE =
'style'.freeze
HTML_TABLE_ALLOWED =

Elements that are allowed directly in a

element.

Whitelist.new(
  %w{thead tbody tfoot tr caption colgroup col}
)
HTML_SCRIPT_ELEMENTS =
Whitelist.new(%w{script template})
HTML_TABLE_ROW_ELEMENTS =
Whitelist.new(%w{tr}) + HTML_SCRIPT_ELEMENTS
HTML_CLOSE_SELF =

Elements that should be closed automatically before a new opening tag is processed.

{
  'head' => Blacklist.new(%w{head body}),
  'body' => Blacklist.new(%w{head body}),
  'li'   => Blacklist.new(%w{li}),
  'dt'   => Blacklist.new(%w{dt dd}),
  'dd'   => Blacklist.new(%w{dt dd}),
  'p'    => Blacklist.new(%w{
    address article aside blockquote details div dl fieldset figcaption
    figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr main menu nav
    ol p pre section table ul
  }),
  'rb'       => Blacklist.new(%w{rb rt rtc rp}),
  'rt'       => Blacklist.new(%w{rb rt rtc rp}),
  'rtc'      => Blacklist.new(%w{rb rtc}),
  'rp'       => Blacklist.new(%w{rb rt rtc rp}),
  'optgroup' => Blacklist.new(%w{optgroup}),
  'option'   => Blacklist.new(%w{optgroup option}),
  'colgroup' => Whitelist.new(%w{col template}),
  'caption'  => HTML_TABLE_ALLOWED.to_blacklist,
  'table'    => HTML_TABLE_ALLOWED + HTML_SCRIPT_ELEMENTS,
  'thead'    => HTML_TABLE_ROW_ELEMENTS,
  'tbody'    => HTML_TABLE_ROW_ELEMENTS,
  'tfoot'    => HTML_TABLE_ROW_ELEMENTS,
  'tr'       => Whitelist.new(%w{td th}) + HTML_SCRIPT_ELEMENTS,
  'td'       => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED,
  'th'       => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED
}
LITERAL_HTML_ELEMENTS =

Names of HTML tags of which the content should be lexed as-is.

Whitelist.new([HTML_SCRIPT, HTML_STYLE])

Instance Method Summary collapse

Constructor Details

#initialize(data, options = {}) ⇒ Lexer

Returns a new instance of Lexer.

Parameters:

  • data (String|IO)

    The data to lex. This can either be a String or an IO instance.

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :html (TrueClass|FalseClass)

    When set to true the lexer will treat the input as HTML instead of XML. This makes it possible to lex HTML void elements such as <link href="">.

  • :strict (TrueClass|FalseClass)

    Enables/disables strict parsing of XML documents, disabled by default.



117
118
119
120
121
122
123
# File 'lib/oga/xml/lexer.rb', line 117

def initialize(data, options = {})
  @data   = data
  @html   = options[:html]
  @strict = options[:strict] || false

  reset
end

Instance Method Details

#advance {|type, value, line| ... } ⇒ Object

Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.

Each token is an Array in the following format:

[TYPE, VALUE]

The type is a symbol, the value is either nil or a String.

This method stores the supplied block in @block and resets it after the lexer loop has finished.

This method does not reset the internal state of the lexer.

Yield Parameters:

  • type (Symbol)
  • value (String)
  • line (Fixnum)


200
201
202
203
204
205
206
207
208
209
210
211
212
213
# File 'lib/oga/xml/lexer.rb', line 200

def advance(&block)
  @block = block

  read_data do |chunk|
    advance_native(chunk)
  end

  # Add any missing closing tags
  if !strict? and !@elements.empty?
    @elements.length.times { on_element_end }
  end
ensure
  @block = nil
end

#html?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


218
219
220
# File 'lib/oga/xml/lexer.rb', line 218

def html?
  @html == true
end

#html_script?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


232
233
234
# File 'lib/oga/xml/lexer.rb', line 232

def html_script?
  html? && current_element == HTML_SCRIPT
end

#html_style?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


239
240
241
# File 'lib/oga/xml/lexer.rb', line 239

def html_style?
  html? && current_element == HTML_STYLE
end

#lexArray

Gathers all the tokens for the input and returns them as an Array.

This method resets the internal state of the lexer after consuming the input.

Returns:

  • (Array)

See Also:



169
170
171
172
173
174
175
176
177
178
179
# File 'lib/oga/xml/lexer.rb', line 169

def lex
  tokens = []

  advance do |type, value, line|
    tokens << [type, value, line]
  end

  reset

  tokens
end

#read_data {|| ... } ⇒ String

Yields the data to lex to the supplied block.

Yield Parameters:

  • (String)

Returns:

  • (String)


145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/oga/xml/lexer.rb', line 145

def read_data
  if @data.is_a?(String)
    yield @data

  # IO, StringIO, etc
  # THINK: read(N) would be nice, but currently this screws up the C code
  elsif @data.respond_to?(:each_line)
    @data.each_line { |line| yield line }

  # Enumerator, Array, etc
  elsif @data.respond_to?(:each)
    @data.each { |chunk| yield chunk }
  end
end

#resetObject

Resets the internal state of the lexer. Typically you don't need to call this method yourself as its called by #lex after lexing a given String.



130
131
132
133
134
135
136
137
# File 'lib/oga/xml/lexer.rb', line 130

def reset
  @line     = 1
  @elements = []

  @data.rewind if @data.respond_to?(:rewind)

  reset_native
end

#strict?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


225
226
227
# File 'lib/oga/xml/lexer.rb', line 225

def strict?
  @strict
end