Method: Classifier::LSI#build_index

Defined in:
lib/classifier/lsi.rb

#build_index(cutoff = 0.75) ⇒ Object

This function rebuilds the index if needs_rebuild? returns true. For very large document spaces, this indexing operation may take some time to complete, so it may be wise to place the operation in another thread.

As a rule, indexing will be fairly swift on modern machines until you have well over 500 documents indexed, or have an incredibly diverse vocabulary for your documents.

The optional parameter “cutoff” is a tuning parameter. When the index is built, a certain number of s-values are discarded from the system. The cutoff parameter tells the indexer how many of these values to keep. A value of 1 for cutoff means that no semantic analysis will take place, turning the LSI class into a simple vector search engine.



113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/classifier/lsi.rb', line 113

def build_index(cutoff = 0.75)
  return unless needs_rebuild?

  make_word_list

  doc_list = @items.values
  tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }

  if $GSL
    tdm = GSL::Matrix.alloc(*tda).trans
    ntdm = build_reduced_matrix(tdm, cutoff)

    ntdm.size[1].times do |col|
      vec = GSL::Vector.alloc(ntdm.column(col)).row
      doc_list[col].lsi_vector = vec
      doc_list[col].lsi_norm = vec.normalize
    end
  else
    tdm = Matrix.rows(tda).trans
    ntdm = build_reduced_matrix(tdm, cutoff)

    ntdm.row_size.times do |col|
      doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
      doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
    end
  end

  @built_at_version = @version
end