Class: Treat::Workers::Extractors::TopicWords::LDA

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/extractors/topic_words/lda.rb

Overview

Topic word retrieval using a thin wrapper over a C implementation of Latent Dirichlet Allocation (LDA), a statistical model that posits each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.

Original paper: Blei, David, Ng, Andrew, and Jordan, Michael. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3 (Mar. 2003), 993-1022.

Constant Summary collapse

DefaultOptions =

Default options for the LDA algorithm.

{
  :num_topics => 20,
  :words_per_topic => 10,
  :iterations => 20,
  :vocabulary => nil
}

Class Method Summary collapse

Class Method Details

.topic_words(collection, options = {}) ⇒ Object

Retrieve the topic words of a collection.



41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/treat/workers/extractors/topic_words/lda.rb', line 41

def self.topic_words(collection, options = {})

  options = DefaultOptions.merge(options)
  
  docs = collection.documents.map { |d| d.to_s }
  # Create a corpus with the collection
  corpus = Lda::TextCorpus.new(docs)
  
  # Create an Lda object for training
  lda = Lda::Lda.new(corpus)
  lda.num_topics = options[:num_topics]
  lda.max_iter = options[:iterations]
  # Run the EM algorithm using random 
  # starting points
  
  Treat.core.verbosity.silence ?
  silence_stdout { lda.em('random') }  :
  lda.em('random')
  
  # Load the vocabulary.
  if options[:vocabulary]
    lda.load_vocabulary(options[:vocabulary])
  end
  
  # Get the topic words.
  lda.top_words(
  options[:words_per_topic]
  ).values
  
end