Class: Ferret::Analysis::Token

Inherits:

Object

Object
Ferret::Analysis::Token

show all

Includes:: Comparable

Defined in:: lib/ferret/analysis/token.rb

Overview

A Token is an occurence of a term from the text of a field. It consists of a term’s text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type “eos”. The default token type is “word”.

start_offset: is the position of the first character corresponding to this token in the source text
end_offset: is equal to one greater than the position of the last character corresponding of this token Note that the difference between @end_offset and @start_offset may not be equal to @term_text.length(), as the term text may have been altered by a stemmer or some other filter.

Instance Attribute Summary collapse

#end_offset ⇒ Object readonly

Returns the value of attribute end_offset.
#position_increment ⇒ Object

Returns the value of attribute position_increment.
#start_offset ⇒ Object readonly

Returns the value of attribute start_offset.
#term_text ⇒ Object

Returns the value of attribute term_text.
#type ⇒ Object readonly

Returns the value of attribute type.

Instance Method Summary collapse

#<=>(o) ⇒ Object

Tokens are sorted by the position in the text at which they occur, ie the start_offset.
#eql?(o) ⇒ Boolean (also: #==)
#initialize(txt, so, eo, typ = "word", pos_inc = 1) ⇒ Token constructor

Constructs a Token with the given term text, and start & end offsets.
#to_s ⇒ Object

Returns a string representation of the token with all the attributes.

Constructor Details

#initialize(txt, so, eo, typ = "word", pos_inc = 1) ⇒ `Token`

Constructs a Token with the given term text, and start & end offsets. The type defaults to “word.”

# File 'lib/ferret/analysis/token.rb', line 30

def initialize(txt, so, eo, typ="word", pos_inc=1)
  @term_text = txt
  @start_offset = so
  @end_offset = eo
  @type = typ # lexical type
  @position_increment = pos_inc
end

Instance Attribute Details

#end_offset ⇒ `Object` (readonly)

Returns the value of attribute end_offset.



26
27
28

# File 'lib/ferret/analysis/token.rb', line 26

def end_offset
  @end_offset
end

#position_increment ⇒ `Object`

Returns the value of attribute position_increment.



26
27
28

# File 'lib/ferret/analysis/token.rb', line 26

def position_increment
  @position_increment
end

#start_offset ⇒ `Object` (readonly)

Returns the value of attribute start_offset.



26
27
28

# File 'lib/ferret/analysis/token.rb', line 26

def start_offset
  @start_offset
end

#term_text ⇒ `Object`

Returns the value of attribute term_text.



25
26
27

# File 'lib/ferret/analysis/token.rb', line 25

def term_text
  @term_text
end

#type ⇒ `Object` (readonly)

Returns the value of attribute type.



26
27
28

# File 'lib/ferret/analysis/token.rb', line 26

def type
  @type
end

Instance Method Details

#<=>(o) ⇒ `Object`

Tokens are sorted by the position in the text at which they occur, ie the start_offset. If two tokens have the same start offset, (see position_increment=) then, they are sorted by the end_offset and then lexically by the token text.

# File 'lib/ferret/analysis/token.rb', line 48

def <=>(o)
  r = @start_offset <=> o.start_offset
  return r if r != 0
  r = @end_offset <=> o.end_offset
  return r if r != 0
  r = @term_text <=> o.term_text
  return r
end

#eql?(o) ⇒ `Boolean` Also known as: ==

# File 'lib/ferret/analysis/token.rb', line 38

def eql?(o)
  return (o.instance_of?(Token) and @start_offset == o.start_offset and
          @end_offset == o.end_offset and @term_text = o.term_text)
end

#to_s ⇒ `Object`

Returns a string representation of the token with all the attributes.

# File 'lib/ferret/analysis/token.rb', line 86

def to_s
  buf = "#{term_text}:#{start_offset}->#{end_offset}"
  buf << "(pos_inc=#{@position_increment})" if (@position_increment != 1)
  buf << "(type=#{@type})" if (@type != "word")
  buf
end

Class: Ferret::Analysis::Token

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(txt, so, eo, typ = "word", pos_inc = 1) ⇒ Token

Instance Attribute Details

#end_offset ⇒ Object (readonly)

#position_increment ⇒ Object

#start_offset ⇒ Object (readonly)

#term_text ⇒ Object

#type ⇒ Object (readonly)