Class: TfIdfSimilarity::BM25Model
- Inherits:
-
Object
- Object
- TfIdfSimilarity::BM25Model
- Extended by:
- Forwardable
- Includes:
- MatrixMethods
- Defined in:
- lib/tf-idf-similarity/bm25_model.rb
Overview
A document-term matrix using the BM25 function.
Instance Method Summary collapse
-
#initialize(documents, opts = {}) ⇒ BM25Model
constructor
A new instance of BM25Model.
-
#inverse_document_frequency(term) ⇒ Float
(also: #idf)
Return the term's inverse document frequency.
-
#similarity_matrix ⇒ GSL::Matrix, ...
Returns a similarity matrix for the documents in the corpus.
-
#term_frequency(document, term) ⇒ Float
(also: #tf)
Returns the term's frequency in the document.
-
#term_frequency_inverse_document_frequency(document, term) ⇒ Float
(also: #tfidf)
Return the term frequency–inverse document frequency.
Constructor Details
#initialize(documents, opts = {}) ⇒ BM25Model
Returns a new instance of BM25Model.
14 15 16 17 18 19 20 21 22 23 24 25 26 |
# File 'lib/tf-idf-similarity/bm25_model.rb', line 14 def initialize(documents, opts = {}) @model = TfIdfSimilarity::TermCountModel.new(documents, opts) @library = (opts[:library] || :matrix).to_sym array = Array.new(terms.size) do |i| idf = inverse_document_frequency(terms[i]) Array.new(documents.size) do |j| term_frequency(documents[j], terms[i]) * idf end end @matrix = initialize_matrix(array) end |
Instance Method Details
#inverse_document_frequency(term) ⇒ Float Also known as: idf
Return the term's inverse document frequency.
32 33 34 35 |
# File 'lib/tf-idf-similarity/bm25_model.rb', line 32 def inverse_document_frequency(term) df = @model.document_count(term) log((documents.size - df + 0.5) / (df + 0.5)) end |
#similarity_matrix ⇒ GSL::Matrix, ...
Note:
Columns are normalized to unit vectors, so we can calculate the cosine similarity of all document vectors.
Returns a similarity matrix for the documents in the corpus.
66 67 68 |
# File 'lib/tf-idf-similarity/bm25_model.rb', line 66 def similarity_matrix multiply_self(normalize) end |
#term_frequency(document, term) ⇒ Float Also known as: tf
Note:
Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.
Returns the term's frequency in the document.
45 46 47 48 |
# File 'lib/tf-idf-similarity/bm25_model.rb', line 45 def term_frequency(document, term) tf = document.term_count(term) (tf * 2.2) / (tf + 0.3 + 0.9 * documents.size / @model.average_document_size) end |
#term_frequency_inverse_document_frequency(document, term) ⇒ Float Also known as: tfidf
Return the term frequency–inverse document frequency.
56 57 58 |
# File 'lib/tf-idf-similarity/bm25_model.rb', line 56 def term_frequency_inverse_document_frequency(document, term) inverse_document_frequency(term) * term_frequency(document, term) end |