Class: ClassifierReborn::LSI

Inherits:

Object

Object
ClassifierReborn::LSI

show all

Defined in:: lib/classifier-reborn/lsi.rb

Overview

This class implements a Latent Semantic Indexer, which can search, classify and cluster data based on underlying semantic relations. For more information on the algorithms used, please consult Wikipedia.

Instance Attribute Summary collapse

#auto_rebuild ⇒ Object

Returns the value of attribute auto_rebuild.
#cache_node_vectors ⇒ Object readonly

Returns the value of attribute cache_node_vectors.
#word_list ⇒ Object readonly

Returns the value of attribute word_list.

Instance Method Summary collapse

#<<(item) ⇒ Object

A less flexible shorthand for add_item that assumes you are passing in a string with no categorries.
#add_item(item, *categories, &block) ⇒ Object

Adds an item to the index.
#build_index(cutoff = 0.75) ⇒ Object

This function rebuilds the index if needs_rebuild? returns true.
#categories_for(item) ⇒ Object

Returns categories for a given indexed item.
#classify(doc, cutoff = 0.30, &block) ⇒ Object

Return the most obvious category without the score.
#classify_with_score(doc, cutoff = 0.30, &block) ⇒ Object

Return the most obvious category with the score.
#content_node_norms(content_node) ⇒ Object
#find_related(doc, max_nearest = 3, &block) ⇒ Object

This function takes content and finds other documents that are semantically “close”, returning an array of documents sorted from most to least relavant.
#highest_ranked_stems(doc, count = 3) ⇒ Object

Prototype, only works on indexed documents.
#highest_relative_content(max_chunks = 10) ⇒ Object

This method returns max_chunks entries, ordered by their average semantic rating.
#initialize(options = {}) ⇒ LSI constructor

Create a fresh index.
#items ⇒ Object

Returns an array of items that are indexed.
#needs_rebuild? ⇒ Boolean

Returns true if the index needs to be rebuilt.
#proximity_array_for_content(doc, &block) ⇒ Object

This function is the primitive that find_related and classify build upon.
#proximity_norms_for_content(doc, &block) ⇒ Object

Similar to proximity_array_for_content, this function takes similar arguments and returns a similar array.
#remove_item(item) ⇒ Object

Removes an item from the database, if it is indexed.
#reset ⇒ Object
#scored_categories(doc, cutoff = 0.30, &block) ⇒ Object

This function uses a voting system to categorize documents, based on the categories of other documents.
#search(string, max_nearest = 3) ⇒ Object

This function allows for text-based search of your index.

Constructor Details

#initialize(options = {}) ⇒ `LSI`

Create a fresh index. If you want to call #build_index manually, use

ClassifierReborn::LSI.new :auto_rebuild => false

If you want to use ContentNodes with cached vector transpositions, use

lsi = ClassifierReborn::LSI.new :cache_node_vectors => true

# File 'lib/classifier-reborn/lsi.rb', line 52

def initialize(options = {})
  @auto_rebuild = options[:auto_rebuild] != false
  @word_list = WordList.new
  @items = {}
  @version = 0
  @built_at_version = -1
  @language = options[:language] || 'en'
  @token_filters = [
    TokenFilter::Stopword,
    TokenFilter::Symbol
  ]
  TokenFilter::Stopword.language = @language
  extend CachedContentNode::InstanceMethods if @cache_node_vectors = options[:cache_node_vectors]
end

Instance Attribute Details

#auto_rebuild ⇒ `Object`

Returns the value of attribute auto_rebuild.



44
45
46

# File 'lib/classifier-reborn/lsi.rb', line 44

def auto_rebuild
  @auto_rebuild
end

#cache_node_vectors ⇒ `Object` (readonly)

Returns the value of attribute cache_node_vectors.



43
44
45

# File 'lib/classifier-reborn/lsi.rb', line 43

def cache_node_vectors
  @cache_node_vectors
end

#word_list ⇒ `Object` (readonly)

Returns the value of attribute word_list.



43
44
45

# File 'lib/classifier-reborn/lsi.rb', line 43

def word_list
  @word_list
end

Instance Method Details

#<<(item) ⇒ `Object`

A less flexible shorthand for add_item that assumes you are passing in a string with no categorries. item will be duck typed via to_s .



107
108
109

# File 'lib/classifier-reborn/lsi.rb', line 107

def <<(item)
  add_item(item)
end

#add_item(item, *categories, &block) ⇒ `Object`

Adds an item to the index. item is assumed to be a string, but any item may be indexed so long as it responds to #to_s or if you provide an optional block explaining how the indexer can fetch fresh string data. This optional block is passed the item, so the item may only be a reference to a URL or file name.

For example:

lsi = ClassifierReborn::LSI.new
lsi.add_item "This is just plain text"
lsi.add_item "/home/me/filename.txt" { |x| File.read x }
ar = ActiveRecordObject.find( :all )
lsi.add_item ar, *ar.categories { |x| ar.content }

# File 'lib/classifier-reborn/lsi.rb', line 87

def add_item(item, *categories, &block)
  clean_word_hash = Hasher.word_hash((block ? yield(item) : item.to_s),
                                     token_filters: @token_filters)
  if clean_word_hash.empty?
    puts "Input: '#{item}' is entirely stopwords or words with 2 or fewer characters. Classifier-Reborn cannot handle this document properly."
  else
    @items[item] = if @cache_node_vectors
                     CachedContentNode.new(clean_word_hash, *categories)
                   else
                     ContentNode.new(clean_word_hash, *categories)
                   end
    @version += 1
    build_index if @auto_rebuild
  end
end

#build_index(cutoff = 0.75) ⇒ `Object`

This function rebuilds the index if needs_rebuild? returns true. For very large document spaces, this indexing operation may take some time to complete, so it may be wise to place the operation in another thread.

As a rule, indexing will be fairly swift on modern machines until you have well over 500 documents indexed, or have an incredibly diverse vocabulary for your documents.

The optional parameter “cutoff” is a tuning parameter. When the index is built, a certain number of s-values are discarded from the system. The cutoff parameter tells the indexer how many of these values to keep. A value of 1 for cutoff means that no semantic analysis will take place, turning the LSI class into a simple vector search engine.

# File 'lib/classifier-reborn/lsi.rb', line 147

def build_index(cutoff = 0.75)
  return unless needs_rebuild?

  make_word_list

  doc_list = @items.values
  tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }

  if $SVD == :numo
    tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
    ntdm = numo_build_reduced_matrix(tdm, cutoff)

    ntdm.each_over_axis(1).with_index do |col_vec, i|
      doc_list[i].lsi_vector = col_vec
      doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
    end
  elsif $SVD == :gsl
    tdm = GSL::Matrix.alloc(*tda).trans
    ntdm = build_reduced_matrix(tdm, cutoff)

    ntdm.size[1].times do |col|
      vec = GSL::Vector.alloc(ntdm.column(col)).row
      doc_list[col].lsi_vector = vec
      doc_list[col].lsi_norm = vec.normalize
    end
  else
    tdm = Matrix.rows(tda).trans
    ntdm = build_reduced_matrix(tdm, cutoff)

    ntdm.column_size.times do |col|
      doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
      if ntdm.column(col).zero?
        doc_list[col].lsi_norm = ntdm.column(col) if doc_list[col]
      else
        doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
      end
    end
  end

  @built_at_version = @version
end

#categories_for(item) ⇒ `Object`

Returns categories for a given indexed item. You are free to add and remove items from this as you see fit. It does not invalide an index to change its categories.

# File 'lib/classifier-reborn/lsi.rb', line 113

def categories_for(item)
  return [] unless @items[item]

  @items[item].categories
end

#classify(doc, cutoff = 0.30, &block) ⇒ `Object`

Return the most obvious category without the score



307
308
309

# File 'lib/classifier-reborn/lsi.rb', line 307

def classify(doc, cutoff = 0.30, &block)
  scored_categories(doc, cutoff, &block).last.first
end

#classify_with_score(doc, cutoff = 0.30, &block) ⇒ `Object`

Return the most obvious category with the score



302
303
304

# File 'lib/classifier-reborn/lsi.rb', line 302

def classify_with_score(doc, cutoff = 0.30, &block)
  scored_categories(doc, cutoff, &block).last
end

#content_node_norms(content_node) ⇒ `Object`

# File 'lib/classifier-reborn/lsi.rb', line 253

def content_node_norms(content_node)
  result =
    @items.keys.collect do |item|
      val = if $SVD == :numo
              content_node.search_norm.dot(@items[item].search_norm)
            elsif $SVD == :gsl
              content_node.search_norm * @items[item].search_norm.col
            else
              (Matrix[content_node.search_norm] * @items[item].search_norm)[0]
            end
      [item, val]
    end
  result.sort_by { |x| x[1] }.reverse
end

#find_related(doc, max_nearest = 3, &block) ⇒ `Object`

This function takes content and finds other documents that are semantically “close”, returning an array of documents sorted from most to least relavant. max_nearest specifies the number of documents to return. A value of 0 means that it returns all the indexed documents, sorted by relavence.

This is particularly useful for identifing clusters in your document space. For example you may want to identify several “What’s Related” items for weblog articles, or find paragraphs that relate to each other in an essay.

# File 'lib/classifier-reborn/lsi.rb', line 294

def find_related(doc, max_nearest = 3, &block)
  carry =
    proximity_array_for_content(doc, &block).reject { |pair| pair[0].eql? doc }
  result = carry.collect { |x| x[0] }
  result[0..max_nearest - 1]
end

#highest_ranked_stems(doc, count = 3) ⇒ `Object`

Prototype, only works on indexed documents. I have no clue if this is going to work, but in theory it’s supposed to.

# File 'lib/classifier-reborn/lsi.rb', line 337

def highest_ranked_stems(doc, count = 3)
  raise 'Requested stem ranking on non-indexed content!' unless @items[doc]

  content_vector_array = node_for_content(doc).lsi_vector.to_a
  top_n = content_vector_array.sort.reverse[0..count - 1]
  top_n.collect { |x| @word_list.word_for_index(content_vector_array.index(x)) }
end

#highest_relative_content(max_chunks = 10) ⇒ `Object`

This method returns max_chunks entries, ordered by their average semantic rating. Essentially, the average distance of each entry from all other entries is calculated, the highest are returned.

This can be used to build a summary service, or to provide more information about your dataset’s general content. For example, if you were to use categorize on the results of this data, you could gather information on what your dataset is generally about.

# File 'lib/classifier-reborn/lsi.rb', line 197

def highest_relative_content(max_chunks = 10)
  return [] if needs_rebuild?

  avg_density = {}
  @items.each_key { |item| avg_density[item] = proximity_array_for_content(item).inject(0.0) { |x, y| x + y[1] } }

  avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map
end

#items ⇒ `Object`

Returns an array of items that are indexed.



129
130
131

# File 'lib/classifier-reborn/lsi.rb', line 129

def items
  @items.keys
end

#needs_rebuild? ⇒ `Boolean`

Returns true if the index needs to be rebuilt. The index needs to be built after all informaton is added, but before you start using it for search, classification and cluster detection.

Returns:

(Boolean)



70
71
72

# File 'lib/classifier-reborn/lsi.rb', line 70

def needs_rebuild?
  (@items.size > 1) && (@version != @built_at_version)
end

#proximity_array_for_content(doc, &block) ⇒ `Object`

This function is the primitive that find_related and classify build upon. It returns an array of 2-element arrays. The first element of this array is a document, and the second is its “score”, defining how “close” it is to other indexed items.

These values are somewhat arbitrary, having to do with the vector space created by your content, so the magnitude is interpretable but not always meaningful between indexes.

The parameter doc is the content to compare. If that content is not indexed, you can pass an optional block to define how to create the text data. See add_item for examples of how this works.

# File 'lib/classifier-reborn/lsi.rb', line 218

def proximity_array_for_content(doc, &block)
  return [] if needs_rebuild?

  content_node = node_for_content(doc, &block)
  result =
    @items.keys.collect do |item|
      val = if $SVD == :numo
              content_node.search_vector.dot(@items[item].transposed_search_vector)
            elsif $SVD == :gsl
              content_node.search_vector * @items[item].transposed_search_vector
            else
              (Matrix[content_node.search_vector] * @items[item].search_vector)[0]
            end
      [item, val]
    end
  result.sort_by { |x| x[1] }.reverse
end

#proximity_norms_for_content(doc, &block) ⇒ `Object`

Similar to proximity_array_for_content, this function takes similar arguments and returns a similar array. However, it uses the normalized calculated vectors instead of their full versions. This is useful when you’re trying to perform operations on content that is much smaller than the text you’re working with. search uses this primitive.

# File 'lib/classifier-reborn/lsi.rb', line 241

def proximity_norms_for_content(doc, &block)
  return [] if needs_rebuild?

  content_node = node_for_content(doc, &block)
  if ($SVD == :gsl && content_node.raw_norm.isnan?.all?) ||
      ($SVD == :numo && content_node.raw_norm.isnan.all?)
    puts "There are no documents that are similar to #{doc}"
  else
    content_node_norms(content_node)
  end
end

#remove_item(item) ⇒ `Object`

Removes an item from the database, if it is indexed.

# File 'lib/classifier-reborn/lsi.rb', line 121

def remove_item(item)
  return unless @items.key? item

  @items.delete item
  @version += 1
end

#reset ⇒ `Object`



345
346
347

# File 'lib/classifier-reborn/lsi.rb', line 345

def reset
  initialize(auto_rebuild: @auto_rebuild, cache_node_vectors: @cache_node_vectors)
end

#scored_categories(doc, cutoff = 0.30, &block) ⇒ `Object`

This function uses a voting system to categorize documents, based on the categories of other documents. It uses the same logic as the find_related function to find related documents, then returns the list of sorted categories.

cutoff signifies the number of documents to consider when clasifying text. A cutoff of 1 means that every document in the index votes on what category the document is in. This may not always make sense.

# File 'lib/classifier-reborn/lsi.rb', line 320

def scored_categories(doc, cutoff = 0.30, &block)
  icutoff = (@items.size * cutoff).round
  carry = proximity_array_for_content(doc, &block)
  carry = carry[0..icutoff - 1]
  votes = Hash.new(0.0)
  carry.each do |pair|
    @items[pair[0]].categories.each do |category|
      votes[category] += pair[1]
    end
  end

  votes.sort_by { |_, score| score }
end

#search(string, max_nearest = 3) ⇒ `Object`

This function allows for text-based search of your index. Unlike other functions like find_related and classify, search only takes short strings. It will also ignore factors like repeated words. It is best for short, google-like search terms. A search will first priortize lexical relationships, then semantic ones.

While this may seem backwards compared to the other functions that LSI supports, it is actually the same algorithm, just applied on a smaller document.

# File 'lib/classifier-reborn/lsi.rb', line 275

def search(string, max_nearest = 3)
  return [] if needs_rebuild?

  carry = proximity_norms_for_content(string)
  unless carry.nil?
    result = carry.collect { |x| x[0] }
    result[0..max_nearest - 1]
  end
end

Class: ClassifierReborn::LSI

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ LSI

Instance Attribute Details

#auto_rebuild ⇒ Object

#cache_node_vectors ⇒ Object (readonly)

#word_list ⇒ Object (readonly)

Instance Method Details

#<<(item) ⇒ Object

#add_item(item, *categories, &block) ⇒ Object

#build_index(cutoff = 0.75) ⇒ Object

#categories_for(item) ⇒ Object

#classify(doc, cutoff = 0.30, &block) ⇒ Object

#classify_with_score(doc, cutoff = 0.30, &block) ⇒ Object

#content_node_norms(content_node) ⇒ Object

#find_related(doc, max_nearest = 3, &block) ⇒ Object

#highest_ranked_stems(doc, count = 3) ⇒ Object

#highest_relative_content(max_chunks = 10) ⇒ Object

#items ⇒ Object

#needs_rebuild? ⇒ Boolean

#proximity_array_for_content(doc, &block) ⇒ Object

#proximity_norms_for_content(doc, &block) ⇒ Object

#remove_item(item) ⇒ Object

#reset ⇒ Object

#scored_categories(doc, cutoff = 0.30, &block) ⇒ Object

#search(string, max_nearest = 3) ⇒ Object