Class: Ferret::Analysis::Token

Inherits:
Object
  • Object
show all
Includes:
Comparable
Defined in:
lib/ferret/analysis/token.rb

Overview

A Token is an occurence of a term from the text of a field. It consists of a term’s text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type “eos”. The default token type is “word”.

start_offset

is the position of the first character corresponding to this token in the source text

end_offset

is equal to one greater than the position of the last character corresponding of this token Note that the difference between @end_offset and @start_offset may not be equal to @term_text.length(), as the term text may have been altered by a stemmer or some other filter.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(txt, so, eo, typ = "word", pos_inc = 1) ⇒ Token

Constructs a Token with the given term text, and start & end offsets. The type defaults to “word.”



30
31
32
33
34
35
36
# File 'lib/ferret/analysis/token.rb', line 30

def initialize(txt, so, eo, typ="word", pos_inc=1)
  @term_text = txt
  @start_offset = so
  @end_offset = eo
  @type = typ # lexical type
  @position_increment = pos_inc
end

Instance Attribute Details

#end_offsetObject (readonly)

Returns the value of attribute end_offset.



26
27
28
# File 'lib/ferret/analysis/token.rb', line 26

def end_offset
  @end_offset
end

#position_incrementObject

Returns the value of attribute position_increment.



26
27
28
# File 'lib/ferret/analysis/token.rb', line 26

def position_increment
  @position_increment
end

#start_offsetObject (readonly)

Returns the value of attribute start_offset.



26
27
28
# File 'lib/ferret/analysis/token.rb', line 26

def start_offset
  @start_offset
end

#term_textObject

Returns the value of attribute term_text.



25
26
27
# File 'lib/ferret/analysis/token.rb', line 25

def term_text
  @term_text
end

#typeObject (readonly)

Returns the value of attribute type.



26
27
28
# File 'lib/ferret/analysis/token.rb', line 26

def type
  @type
end

Instance Method Details

#<=>(o) ⇒ Object

Tokens are sorted by the position in the text at which they occur, ie the start_offset. If two tokens have the same start offset, (see position_increment=) then, they are sorted by the end_offset and then lexically by the token text.



48
49
50
51
52
53
54
55
# File 'lib/ferret/analysis/token.rb', line 48

def <=>(o)
  r = @start_offset <=> o.start_offset
  return r if r != 0
  r = @end_offset <=> o.end_offset
  return r if r != 0
  r = @term_text <=> o.term_text
  return r
end

#eql?(o) ⇒ Boolean Also known as: ==



38
39
40
41
# File 'lib/ferret/analysis/token.rb', line 38

def eql?(o)
  return (o.instance_of?(Token) and @start_offset == o.start_offset and
          @end_offset == o.end_offset and @term_text = o.term_text)
end

#to_sObject

Returns a string representation of the token with all the attributes.



86
87
88
89
90
91
# File 'lib/ferret/analysis/token.rb', line 86

def to_s
  buf = "#{term_text}:#{start_offset}->#{end_offset}"
  buf << "(pos_inc=#{@position_increment})" if (@position_increment != 1)
  buf << "(type=#{@type})" if (@type != "word")
  buf
end