Class: Oga::XML::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/oga/xml/lexer.rb

Overview

Low level lexer that supports both XML and HTML (using an extra option). To lex HTML input set the :html option to true when creating an instance of the lexer:

lexer = Oga::XML::Lexer.new(:html => true)

This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.

Thread Safety

Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:

# Don't do this!
lexer   = Oga::XML::Lexer.new('....')
threads = []

2.times do
  threads << Thread.new do
    lexer.advance do |*args|
      p args
    end
  end
end

threads.each(&:join)

However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data, options = {}) ⇒ Lexer

Returns a new instance of Lexer

Parameters:

  • data (String|IO)

    The data to lex. This can either be a String or an IO instance.

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :html (Symbol)

    When set to true the lexer will treat the input as HTML instead of SGML/XML. This makes it possible to lex HTML void elements such as <link href="">.


53
54
55
56
57
58
# File 'lib/oga/xml/lexer.rb', line 53

def initialize(data, options = {})
  @data = data
  @html = options[:html]

  reset
end

Instance Attribute Details

#htmlTrueClass|FalseClass (readonly)

Returns:

  • (TrueClass|FalseClass)

40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
# File 'lib/oga/xml/lexer.rb', line 40

class Lexer
  attr_reader :html

  ##
  # @param [String|IO] data The data to lex. This can either be a String or
  #  an IO instance.
  #
  # @param [Hash] options
  #
  # @option options [Symbol] :html When set to `true` the lexer will treat
  #  the input as HTML instead of SGML/XML. This makes it possible to lex
  #  HTML void elements such as `<link href="">`.
  #
  def initialize(data, options = {})
    @data = data
    @html = options[:html]

    reset
  end

  ##
  # Resets the internal state of the lexer. Typically you don't need to
  # call this method yourself as its called by #lex after lexing a given
  # String.
  #
  def reset
    @line     = 1
    @elements = []

    @data.rewind if @data.respond_to?(:rewind)

    reset_native
  end

  ##
  # Yields the data to lex to the supplied block.
  #
  # @return [String]
  # @yieldparam [String]
  #
  def read_data
    if @data.is_a?(String)
      yield @data

    # IO, StringIO, etc
    # THINK: read(N) would be nice, but currently this screws up the C code
    elsif @data.respond_to?(:each_line)
      @data.each_line { |line| yield line }

    # Enumerator, Array, etc
    elsif @data.respond_to?(:each)
      @data.each { |chunk| yield chunk }
    end
  end

  ##
  # Gathers all the tokens for the input and returns them as an Array.
  #
  # This method resets the internal state of the lexer after consuming the
  # input.
  #
  # @see #advance
  # @return [Array]
  #
  def lex
    tokens = []

    advance do |type, value, line|
      tokens << [type, value, line]
    end

    reset

    return tokens
  end

  ##
  # Advances through the input and generates the corresponding tokens. Each
  # token is yielded to the supplied block.
  #
  # Each token is an Array in the following format:
  #
  #     [TYPE, VALUE]
  #
  # The type is a symbol, the value is either nil or a String.
  #
  # This method stores the supplied block in `@block` and resets it after
  # the lexer loop has finished.
  #
  # This method does *not* reset the internal state of the lexer.
  #
  # @yieldparam [Symbol] type
  # @yieldparam [String] value
  # @yieldparam [Fixnum] line
  #
  def advance(&block)
    @block = block

    read_data do |chunk|
      advance_native(chunk)
    end
  ensure
    @block = nil
  end

  ##
  # @return [TrueClass|FalseClass]
  #
  def html?
    return !!html
  end

  private

  ##
  # @param [Fixnum] amount The amount of lines to advance.
  #
  def advance_line(amount = 1)
    @line += amount
  end

  ##
  # Calls the supplied block with the information of the current token.
  #
  # @param [Symbol] type The token type.
  # @param [String] value The token value.
  #
  # @yieldparam [String] type
  # @yieldparam [String] value
  # @yieldparam [Fixnum] line
  #
  def add_token(type, value = nil)
    @block.call(type, value, @line)
  end

  ##
  # Returns the name of the element we're currently in.
  #
  # @return [String]
  #
  def current_element
    return @elements.last
  end

  ##
  # Called when processing a single quote.
  #
  def on_string_squote
    add_token(:T_STRING_SQUOTE)
  end

  ##
  # Called when processing a double quote.
  #
  def on_string_dquote
    add_token(:T_STRING_DQUOTE)
  end

  ##
  # Called when processing the body of a string.
  #
  # @param [String] value The data between the quotes.
  #
  def on_string_body(value)
    add_token(:T_STRING_BODY, Entities.decode(value))
  end

  ##
  # Called when a doctype starts.
  #
  def on_doctype_start
    add_token(:T_DOCTYPE_START)
  end

  ##
  # Called on the identifier specifying the type of the doctype.
  #
  # @param [String] value
  #
  def on_doctype_type(value)
    add_token(:T_DOCTYPE_TYPE, value)
  end

  ##
  # Called on the identifier specifying the name of the doctype.
  #
  # @param [String] value
  #
  def on_doctype_name(value)
    add_token(:T_DOCTYPE_NAME, value)
  end

  ##
  # Called on the end of a doctype.
  #
  def on_doctype_end
    add_token(:T_DOCTYPE_END)
  end

  ##
  # Called on an inline doctype block.
  #
  # @param [String] value
  #
  def on_doctype_inline(value)
    add_token(:T_DOCTYPE_INLINE, value)
  end

  ##
  # Called on a CDATA tag.
  #
  def on_cdata(value)
    add_token(:T_CDATA, value)
  end

  ##
  # Called on a comment.
  #
  # @param [String] value
  #
  def on_comment(value)
    add_token(:T_COMMENT, value)
  end

  ##
  # Called on the start of an XML declaration tag.
  #
  def on_xml_decl_start
    add_token(:T_XML_DECL_START)
  end

  ##
  # Called on the end of an XML declaration tag.
  #
  def on_xml_decl_end
    add_token(:T_XML_DECL_END)
  end

  ##
  # Called on the start of an element.
  #
  def on_element_start
    add_token(:T_ELEM_START)
  end

  ##
  # Called on the start of a processing instruction.
  #
  def on_proc_ins_start
    add_token(:T_PROC_INS_START)
  end

  ##
  # Called on a processing instruction name.
  #
  # @param [String] value
  #
  def on_proc_ins_name(value)
    add_token(:T_PROC_INS_NAME, value)
  end

  ##
  # Called on the end of a processing instruction.
  #
  def on_proc_ins_end
    add_token(:T_PROC_INS_END)
  end

  ##
  # Called on the name of an element.
  #
  # @param [String] name The name of the element, including namespace.
  #
  def on_element_name(name)
    @elements << name if html?

    add_token(:T_ELEM_NAME, name)
  end

  ##
  # Called on the element namespace.
  #
  # @param [String] namespace
  #
  def on_element_ns(namespace)
    add_token(:T_ELEM_NS, namespace)
  end

  ##
  # Called on the closing `>` of the open tag of an element.
  #
  def on_element_open_end
    return unless html?

    # Only downcase the name if we can't find an all lower/upper version of
    # the element name. This can save us a *lot* of String allocations.
    if HTML_VOID_ELEMENTS.include?(current_element) \
    or HTML_VOID_ELEMENTS.include?(current_element.downcase)
      add_token(:T_ELEM_END)
      @elements.pop
    end
  end

  ##
  # Called on the closing tag of an element.
  #
  def on_element_end
    add_token(:T_ELEM_END)

    @elements.pop if html?
  end

  ##
  # Called on regular text values.
  #
  # @param [String] value
  #
  def on_text(value)
    return if value.empty?

    add_token(:T_TEXT, Entities.decode(value))
  end

  ##
  # Called on attribute namespaces.
  #
  # @param [String] value
  #
  def on_attribute_ns(value)
    add_token(:T_ATTR_NS, value)
  end

  ##
  # Called on tag attributes.
  #
  # @param [String] value
  #
  def on_attribute(value)
    add_token(:T_ATTR, value)
  end
end

Instance Method Details

#add_token(type, value = nil) {|type, value, line| ... } ⇒ Object (private)

Calls the supplied block with the information of the current token.

Parameters:

  • type (Symbol)

    The token type.

  • value (String) (defaults to: nil)

    The token value.

Yield Parameters:

  • type (String)
  • value (String)
  • line (Fixnum)

171
172
173
# File 'lib/oga/xml/lexer.rb', line 171

def add_token(type, value = nil)
  @block.call(type, value, @line)
end

#advance {|type, value, line| ... } ⇒ Object

Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.

Each token is an Array in the following format:

[TYPE, VALUE]

The type is a symbol, the value is either nil or a String.

This method stores the supplied block in @block and resets it after the lexer loop has finished.

This method does not reset the internal state of the lexer.

Yield Parameters:

  • type (Symbol)
  • value (String)
  • line (Fixnum)

135
136
137
138
139
140
141
142
143
# File 'lib/oga/xml/lexer.rb', line 135

def advance(&block)
  @block = block

  read_data do |chunk|
    advance_native(chunk)
  end
ensure
  @block = nil
end

#advance_line(amount = 1) ⇒ Object (private)

Parameters:

  • amount (Fixnum) (defaults to: 1)

    The amount of lines to advance.


157
158
159
# File 'lib/oga/xml/lexer.rb', line 157

def advance_line(amount = 1)
  @line += amount
end

#current_elementString (private)

Returns the name of the element we're currently in.

Returns:

  • (String)

180
181
182
# File 'lib/oga/xml/lexer.rb', line 180

def current_element
  return @elements.last
end

#html?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)

148
149
150
# File 'lib/oga/xml/lexer.rb', line 148

def html?
  return !!html
end

#lexArray

Gathers all the tokens for the input and returns them as an Array.

This method resets the internal state of the lexer after consuming the input.

Returns:

  • (Array)

See Also:


104
105
106
107
108
109
110
111
112
113
114
# File 'lib/oga/xml/lexer.rb', line 104

def lex
  tokens = []

  advance do |type, value, line|
    tokens << [type, value, line]
  end

  reset

  return tokens
end

#on_attribute(value) ⇒ Object (private)

Called on tag attributes.

Parameters:

  • value (String)

377
378
379
# File 'lib/oga/xml/lexer.rb', line 377

def on_attribute(value)
  add_token(:T_ATTR, value)
end

#on_attribute_ns(value) ⇒ Object (private)

Called on attribute namespaces.

Parameters:

  • value (String)

368
369
370
# File 'lib/oga/xml/lexer.rb', line 368

def on_attribute_ns(value)
  add_token(:T_ATTR_NS, value)
end

#on_cdata(value) ⇒ Object (private)

Called on a CDATA tag.


251
252
253
# File 'lib/oga/xml/lexer.rb', line 251

def on_cdata(value)
  add_token(:T_CDATA, value)
end

#on_comment(value) ⇒ Object (private)

Called on a comment.

Parameters:

  • value (String)

260
261
262
# File 'lib/oga/xml/lexer.rb', line 260

def on_comment(value)
  add_token(:T_COMMENT, value)
end

#on_doctype_endObject (private)

Called on the end of a doctype.


235
236
237
# File 'lib/oga/xml/lexer.rb', line 235

def on_doctype_end
  add_token(:T_DOCTYPE_END)
end

#on_doctype_inline(value) ⇒ Object (private)

Called on an inline doctype block.

Parameters:

  • value (String)

244
245
246
# File 'lib/oga/xml/lexer.rb', line 244

def on_doctype_inline(value)
  add_token(:T_DOCTYPE_INLINE, value)
end

#on_doctype_name(value) ⇒ Object (private)

Called on the identifier specifying the name of the doctype.

Parameters:

  • value (String)

228
229
230
# File 'lib/oga/xml/lexer.rb', line 228

def on_doctype_name(value)
  add_token(:T_DOCTYPE_NAME, value)
end

#on_doctype_startObject (private)

Called when a doctype starts.


210
211
212
# File 'lib/oga/xml/lexer.rb', line 210

def on_doctype_start
  add_token(:T_DOCTYPE_START)
end

#on_doctype_type(value) ⇒ Object (private)

Called on the identifier specifying the type of the doctype.

Parameters:

  • value (String)

219
220
221
# File 'lib/oga/xml/lexer.rb', line 219

def on_doctype_type(value)
  add_token(:T_DOCTYPE_TYPE, value)
end

#on_element_endObject (private)

Called on the closing tag of an element.


346
347
348
349
350
# File 'lib/oga/xml/lexer.rb', line 346

def on_element_end
  add_token(:T_ELEM_END)

  @elements.pop if html?
end

#on_element_name(name) ⇒ Object (private)

Called on the name of an element.

Parameters:

  • name (String)

    The name of the element, including namespace.


313
314
315
316
317
# File 'lib/oga/xml/lexer.rb', line 313

def on_element_name(name)
  @elements << name if html?

  add_token(:T_ELEM_NAME, name)
end

#on_element_ns(namespace) ⇒ Object (private)

Called on the element namespace.

Parameters:

  • namespace (String)

324
325
326
# File 'lib/oga/xml/lexer.rb', line 324

def on_element_ns(namespace)
  add_token(:T_ELEM_NS, namespace)
end

#on_element_open_endObject (private)

Called on the closing > of the open tag of an element.


331
332
333
334
335
336
337
338
339
340
341
# File 'lib/oga/xml/lexer.rb', line 331

def on_element_open_end
  return unless html?

  # Only downcase the name if we can't find an all lower/upper version of
  # the element name. This can save us a *lot* of String allocations.
  if HTML_VOID_ELEMENTS.include?(current_element) \
  or HTML_VOID_ELEMENTS.include?(current_element.downcase)
    add_token(:T_ELEM_END)
    @elements.pop
  end
end

#on_element_startObject (private)

Called on the start of an element.


281
282
283
# File 'lib/oga/xml/lexer.rb', line 281

def on_element_start
  add_token(:T_ELEM_START)
end

#on_proc_ins_endObject (private)

Called on the end of a processing instruction.


304
305
306
# File 'lib/oga/xml/lexer.rb', line 304

def on_proc_ins_end
  add_token(:T_PROC_INS_END)
end

#on_proc_ins_name(value) ⇒ Object (private)

Called on a processing instruction name.

Parameters:

  • value (String)

297
298
299
# File 'lib/oga/xml/lexer.rb', line 297

def on_proc_ins_name(value)
  add_token(:T_PROC_INS_NAME, value)
end

#on_proc_ins_startObject (private)

Called on the start of a processing instruction.


288
289
290
# File 'lib/oga/xml/lexer.rb', line 288

def on_proc_ins_start
  add_token(:T_PROC_INS_START)
end

#on_string_body(value) ⇒ Object (private)

Called when processing the body of a string.

Parameters:

  • value (String)

    The data between the quotes.


203
204
205
# File 'lib/oga/xml/lexer.rb', line 203

def on_string_body(value)
  add_token(:T_STRING_BODY, Entities.decode(value))
end

#on_string_dquoteObject (private)

Called when processing a double quote.


194
195
196
# File 'lib/oga/xml/lexer.rb', line 194

def on_string_dquote
  add_token(:T_STRING_DQUOTE)
end

#on_string_squoteObject (private)

Called when processing a single quote.


187
188
189
# File 'lib/oga/xml/lexer.rb', line 187

def on_string_squote
  add_token(:T_STRING_SQUOTE)
end

#on_text(value) ⇒ Object (private)

Called on regular text values.

Parameters:

  • value (String)

357
358
359
360
361
# File 'lib/oga/xml/lexer.rb', line 357

def on_text(value)
  return if value.empty?

  add_token(:T_TEXT, Entities.decode(value))
end

#on_xml_decl_endObject (private)

Called on the end of an XML declaration tag.


274
275
276
# File 'lib/oga/xml/lexer.rb', line 274

def on_xml_decl_end
  add_token(:T_XML_DECL_END)
end

#on_xml_decl_startObject (private)

Called on the start of an XML declaration tag.


267
268
269
# File 'lib/oga/xml/lexer.rb', line 267

def on_xml_decl_start
  add_token(:T_XML_DECL_START)
end

#read_data {|| ... } ⇒ String

Yields the data to lex to the supplied block.

Yield Parameters:

  • (String)

Returns:

  • (String)

80
81
82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/oga/xml/lexer.rb', line 80

def read_data
  if @data.is_a?(String)
    yield @data

  # IO, StringIO, etc
  # THINK: read(N) would be nice, but currently this screws up the C code
  elsif @data.respond_to?(:each_line)
    @data.each_line { |line| yield line }

  # Enumerator, Array, etc
  elsif @data.respond_to?(:each)
    @data.each { |chunk| yield chunk }
  end
end

#resetObject

Resets the internal state of the lexer. Typically you don't need to call this method yourself as its called by #lex after lexing a given String.


65
66
67
68
69
70
71
72
# File 'lib/oga/xml/lexer.rb', line 65

def reset
  @line     = 1
  @elements = []

  @data.rewind if @data.respond_to?(:rewind)

  reset_native
end