Class: Treat::Workers::Extractors::TopicWords::LDA
- Inherits:
-
Object
- Object
- Treat::Workers::Extractors::TopicWords::LDA
- Defined in:
- lib/treat/workers/extractors/topic_words/lda.rb
Overview
Topic word retrieval using a thin wrapper over a C implementation of Latent Dirichlet Allocation (LDA), a statistical model that posits each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.
Original paper: Blei, David, Ng, Andrew, and Jordan, Michael. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3 (Mar. 2003), 993-1022.
Constant Summary collapse
- DefaultOptions =
Default options for the LDA algorithm.
{ :num_topics => 20, :words_per_topic => 10, :iterations => 20, :vocabulary => nil }
Class Method Summary collapse
-
.topic_words(collection, options = {}) ⇒ Object
Retrieve the topic words of a collection.
Class Method Details
.topic_words(collection, options = {}) ⇒ Object
Retrieve the topic words of a collection.
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/treat/workers/extractors/topic_words/lda.rb', line 41 def self.topic_words(collection, = {}) = DefaultOptions.merge() docs = collection.documents.map { |d| d.to_s } # Create a corpus with the collection corpus = Lda::TextCorpus.new(docs) # Create an Lda object for training lda = Lda::Lda.new(corpus) lda.num_topics = [:num_topics] lda.max_iter = [:iterations] # Run the EM algorithm using random # starting points Treat.core.verbosity.silence ? silence_stdout { lda.em('random') } : lda.em('random') # Load the vocabulary. if [:vocabulary] lda.load_vocabulary([:vocabulary]) end # Get the topic words. lda.top_words( [:words_per_topic] ).values end |