Class: HTMLTokenizer
- Inherits:
-
Object
- Object
- HTMLTokenizer
- Defined in:
- lib/openid/yadis/htmltokenizer.rb
Overview
A class to tokenize HTML.
Example:
page = "<HTML>
<HEAD>
<TITLE>This is the title</TITLE>
</HEAD>
<!-- Here comes the <a href=\"missing.link\">blah</a>
comment body
-->
<BODY>
<H1>This is the header</H1>
<P>
This is the paragraph, it contains
<a href=\"link.html\">links</a>,
<img src=\"blah.gif\" optional alt='images
are
really cool'>. Ok, here is some more text and
<A href=\"http://another.link.com/\" target=\"_blank\">another link</A>.
</P>
</body>
</HTML>
"
toke = HTMLTokenizer.new(page)
assert("<h1>" == toke.getTag("h1", "h2", "h3").to_s.downcase)
assert(HTMLTag.new("<a href=\"link.html\">") == toke.getTag("IMG", "A"))
assert("links" == toke.getTrimmedText)
assert(toke.getTag("IMG", "A").attr_hash['optional'])
assert("_blank" == toke.getTag("IMG", "A").attr_hash['target'])
Constant Summary collapse
- @@version =
1.0
Instance Attribute Summary collapse
-
#page ⇒ Object
readonly
Returns the value of attribute page.
Class Method Summary collapse
-
.version ⇒ Object
Get version of HTMLTokenizer lib.
Instance Method Summary collapse
-
#getNextToken ⇒ Object
Get the next token, returns an instance of * HTMLText * HTMLToken * HTMLTag.
-
#getTag(*sought_tags) ⇒ Object
Get a tag from the specified set of desired tags.
-
#getText(until_tag = nil) ⇒ Object
Get all the text between the current position and the next tag (if specified) or a specific later tag.
-
#getTrimmedText(until_tag = nil) ⇒ Object
Like getText, but squeeze all whitespace, getting rid of leading and trailing whitespace, and squeezing multiple spaces into a single space.
-
#initialize(content) ⇒ HTMLTokenizer
constructor
Create a new tokenizer, based on the content, used as a string.
-
#peekNextToken ⇒ Object
Look at the next token, but don’t actually grab it.
-
#reset ⇒ Object
Reset the parser, setting the current position back at the stop.
Constructor Details
#initialize(content) ⇒ HTMLTokenizer
Create a new tokenizer, based on the content, used as a string.
57 58 59 60 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 57 def initialize(content) @page = content.to_s @cur_pos = 0 end |
Instance Attribute Details
#page ⇒ Object (readonly)
Returns the value of attribute page.
54 55 56 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 54 def page @page end |
Class Method Details
.version ⇒ Object
Get version of HTMLTokenizer lib
50 51 52 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 50 def self.version @@version end |
Instance Method Details
#getNextToken ⇒ Object
Get the next token, returns an instance of
-
HTMLText
-
HTMLToken
-
HTMLTag
101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 101 def getNextToken token = peekNextToken if token # @page = @page[token.raw.length .. -1] # @page.slice!(0, token.raw.length) @cur_pos += token.raw.length end # p token # print token.raw token end |
#getTag(*sought_tags) ⇒ Object
Get a tag from the specified set of desired tags. For example: foo = toke.getTag("h1", "h2", "h3") Will return the next header tag encountered.
117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 117 def getTag(*) .collect! { |elm| elm.downcase } while (tag = getNextToken) if tag.is_a?(HTMLTag) and (0 == .length or .include?(tag.tag_name)) break end end tag end |
#getText(until_tag = nil) ⇒ Object
Get all the text between the current position and the next tag (if specified) or a specific later tag
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 131 def getText(until_tag = nil) if until_tag.nil? if "<" == @page[@cur_pos] # Next token is a tag, not text "" else # Next token is text getNextToken.text end else ret_str = "" while (tag = peekNextToken) break if tag.is_a?(HTMLTag) and tag.tag_name == until_tag ret_str << (tag.text + " ") if "" != tag.text getNextToken end ret_str end end |
#getTrimmedText(until_tag = nil) ⇒ Object
Like getText, but squeeze all whitespace, getting rid of leading and trailing whitespace, and squeezing multiple spaces into a single space.
157 158 159 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 157 def getTrimmedText(until_tag = nil) getText(until_tag).strip.gsub(/\s+/m, " ") end |
#peekNextToken ⇒ Object
Look at the next token, but don’t actually grab it
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 68 def peekNextToken return if @cur_pos == @page.length if "<" == @page[@cur_pos] # Next token is a tag of some kind if "!--" == @page[(@cur_pos + 1), 3] # Token is a comment tag_end = @page.index("-->", (@cur_pos + 1)) raise HTMLTokenizerError, "No end found to started comment:\n#{@page[@cur_pos, 80]}" if tag_end.nil? # p @page[@cur_pos .. (tag_end+2)] HTMLComment.new(@page[@cur_pos..(tag_end + 2)]) else # Token is a html tag tag_end = @page.index(">", (@cur_pos + 1)) raise HTMLTokenizerError, "No end found to started tag:\n#{@page[@cur_pos, 80]}" if tag_end.nil? # p @page[@cur_pos .. tag_end] HTMLTag.new(@page[@cur_pos..tag_end]) end else # Next token is text text_end = @page.index("<", @cur_pos) text_end = text_end.nil? ? -1 : (text_end - 1) # p @page[@cur_pos .. text_end] HTMLText.new(@page[@cur_pos..text_end]) end end |
#reset ⇒ Object
Reset the parser, setting the current position back at the stop
63 64 65 |
# File 'lib/openid/yadis/htmltokenizer.rb', line 63 def reset @cur_pos = 0 end |