Class: Arrow::HTMLTokenizer

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/arrow/htmltokenizer.rb

Overview

The Arrow::HTMLTokenizer class – a simple HTML parser that can be used to break HTML down into tokens.

Some of the code and design were stolen from the excellent HTMLTokenizer library by Ben Giddings <[email protected]>.

VCS Id

$Id$

Authors

:include: LICENSE

Please see the file LICENSE in the top-level directory for licensing details.

Constant Summary collapse

SVNRev =

SVN Revision

%q$Rev$
SVNId =

SVN Id

%q$Id$

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods inherited from Object

deprecate_class_method, deprecate_method, inherited

Constructor Details

#initialize(source) ⇒ HTMLTokenizer

Create a new Arrow::HtmlTokenizer object.



41
42
43
44
# File 'lib/arrow/htmltokenizer.rb', line 41

def initialize( source )
	@source = source
	@scanner = StringScanner.new( source )
end

Instance Attribute Details

#scannerObject (readonly)

The StringScanner doing the tokenizing



55
56
57
# File 'lib/arrow/htmltokenizer.rb', line 55

def scanner
  @scanner
end

#sourceObject (readonly)

The HTML source being tokenized



52
53
54
# File 'lib/arrow/htmltokenizer.rb', line 52

def source
  @source
end

Instance Method Details

#eachObject

Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/arrow/htmltokenizer.rb', line 60

def each
	@scanner.reset

	until @scanner.empty?
		if @scanner.peek(1) == '<'
			tag = @scanner.scan_until( />/ )

			case tag
			when /^<!--/
				token = HTMLComment.new( tag )
			when /^<!/
				token = DocType.new( tag )
			when /^<\?/
				token = ProcessingInstruction.new( tag )
			else
				token = HTMLTag.new( tag )
			end
		else
			text = @scanner.scan( /[^<]+/ )
			token = HTMLText.new( text )
		end

		yield( token )
	end
end