Class: HTML::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/rails/deprecated_sanitizer/html-scanner/html/tokenizer.rb

Overview

A simple HTML tokenizer. It simply breaks a stream of text into tokens, where each token is a string. Each string represents either “text”, or an HTML element.

This currently assumes valid XHTML, which means no free < or > characters.

Usage:

tokenizer = HTML::Tokenizer.new(text)
while token = tokenizer.next
  p token
end

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text) ⇒ Tokenizer

Create a new Tokenizer for the given text.



25
26
27
28
29
30
31
# File 'lib/rails/deprecated_sanitizer/html-scanner/html/tokenizer.rb', line 25

def initialize(text)
  text.encode!
  @scanner = StringScanner.new(text)
  @position = 0
  @line = 0
  @current_line = 1
end

Instance Attribute Details

#lineObject (readonly)

The current line number



22
23
24
# File 'lib/rails/deprecated_sanitizer/html-scanner/html/tokenizer.rb', line 22

def line
  @line
end

#positionObject (readonly)

The current (byte) position in the text



19
20
21
# File 'lib/rails/deprecated_sanitizer/html-scanner/html/tokenizer.rb', line 19

def position
  @position
end

Instance Method Details

#nextObject

Returns the next token in the sequence, or nil if there are no more tokens in the stream.



35
36
37
38
39
40
41
42
43
44
# File 'lib/rails/deprecated_sanitizer/html-scanner/html/tokenizer.rb', line 35

def next
  return nil if @scanner.eos?
  @position = @scanner.pos
  @line = @current_line
  if @scanner.check(/<\S/)
    update_current_line(scan_tag)
  else
    update_current_line(scan_text)
  end
end