Class: Ferret::Analysis::Token
- Inherits:
-
Object
- Object
- Ferret::Analysis::Token
- Includes:
- Comparable
- Defined in:
- lib/ferret/analysis/token.rb
Overview
A Token is an occurence of a term from the text of a field. It consists of a term’s text, the start and end offset of the term in the text of the field, and a type string.
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.
The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type “eos”. The default token type is “word”.
- start_offset
-
is the position of the first character corresponding to this token in the source text
- end_offset
-
is equal to one greater than the position of the last character corresponding of this token Note that the difference between @end_offset and @start_offset may not be equal to @term_text.length(), as the term text may have been altered by a stemmer or some other filter.
Instance Attribute Summary collapse
-
#end_offset ⇒ Object
readonly
Returns the value of attribute end_offset.
-
#position_increment ⇒ Object
Returns the value of attribute position_increment.
-
#start_offset ⇒ Object
readonly
Returns the value of attribute start_offset.
-
#term_text ⇒ Object
Returns the value of attribute term_text.
-
#type ⇒ Object
readonly
Returns the value of attribute type.
Instance Method Summary collapse
-
#<=>(o) ⇒ Object
Tokens are sorted by the position in the text at which they occur, ie the start_offset.
- #eql?(o) ⇒ Boolean (also: #==)
-
#initialize(txt, so, eo, typ = "word", pos_inc = 1) ⇒ Token
constructor
Constructs a Token with the given term text, and start & end offsets.
-
#to_s ⇒ Object
Returns a string representation of the token with all the attributes.
Constructor Details
#initialize(txt, so, eo, typ = "word", pos_inc = 1) ⇒ Token
Constructs a Token with the given term text, and start & end offsets. The type defaults to “word.”
30 31 32 33 34 35 36 |
# File 'lib/ferret/analysis/token.rb', line 30 def initialize(txt, so, eo, typ="word", pos_inc=1) @term_text = txt @start_offset = so @end_offset = eo @type = typ # lexical type @position_increment = pos_inc end |
Instance Attribute Details
#end_offset ⇒ Object (readonly)
Returns the value of attribute end_offset.
26 27 28 |
# File 'lib/ferret/analysis/token.rb', line 26 def end_offset @end_offset end |
#position_increment ⇒ Object
Returns the value of attribute position_increment.
26 27 28 |
# File 'lib/ferret/analysis/token.rb', line 26 def position_increment @position_increment end |
#start_offset ⇒ Object (readonly)
Returns the value of attribute start_offset.
26 27 28 |
# File 'lib/ferret/analysis/token.rb', line 26 def start_offset @start_offset end |
#term_text ⇒ Object
Returns the value of attribute term_text.
25 26 27 |
# File 'lib/ferret/analysis/token.rb', line 25 def term_text @term_text end |
#type ⇒ Object (readonly)
Returns the value of attribute type.
26 27 28 |
# File 'lib/ferret/analysis/token.rb', line 26 def type @type end |
Instance Method Details
#<=>(o) ⇒ Object
Tokens are sorted by the position in the text at which they occur, ie the start_offset. If two tokens have the same start offset, (see position_increment=) then, they are sorted by the end_offset and then lexically by the token text.
48 49 50 51 52 53 54 55 |
# File 'lib/ferret/analysis/token.rb', line 48 def <=>(o) r = @start_offset <=> o.start_offset return r if r != 0 r = @end_offset <=> o.end_offset return r if r != 0 r = @term_text <=> o.term_text return r end |
#eql?(o) ⇒ Boolean Also known as: ==
38 39 40 41 |
# File 'lib/ferret/analysis/token.rb', line 38 def eql?(o) return (o.instance_of?(Token) and @start_offset == o.start_offset and @end_offset == o.end_offset and @term_text = o.term_text) end |
#to_s ⇒ Object
Returns a string representation of the token with all the attributes.
86 87 88 89 90 91 |
# File 'lib/ferret/analysis/token.rb', line 86 def to_s buf = "#{term_text}:#{start_offset}->#{end_offset}" buf << "(pos_inc=#{@position_increment})" if (@position_increment != 1) buf << "(type=#{@type})" if (@type != "word") buf end |