Class: TfIdfSimilarity::BM25Model

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Includes:
MatrixMethods
Defined in:
lib/tf-idf-similarity/bm25_model.rb

Overview

A document-term matrix using the BM25 function.

Instance Method Summary collapse

Constructor Details

#initialize(documents, opts = {}) ⇒ BM25Model

Returns a new instance of BM25Model.

Parameters:

Options Hash (opts):

  • :library (Symbol)

    :gsl, :narray, :nmatrix or :matrix (default)



14
15
16
17
18
19
20
21
22
23
24
25
26
# File 'lib/tf-idf-similarity/bm25_model.rb', line 14

def initialize(documents, opts = {})
  @model = TfIdfSimilarity::TermCountModel.new(documents, opts)
  @library = (opts[:library] || :matrix).to_sym

  array = Array.new(terms.size) do |i|
    idf = inverse_document_frequency(terms[i])
    Array.new(documents.size) do |j|
      term_frequency(documents[j], terms[i]) * idf
    end
  end

  @matrix = initialize_matrix(array)
end

Instance Method Details

#inverse_document_frequency(term) ⇒ Float Also known as: idf

Return the term's inverse document frequency.

Parameters:

  • term (String)

    a term

Returns:

  • (Float)

    the term's inverse document frequency



32
33
34
35
# File 'lib/tf-idf-similarity/bm25_model.rb', line 32

def inverse_document_frequency(term)
  df = @model.document_count(term)
  log((documents.size - df + 0.5) / (df + 0.5))
end

#similarity_matrixGSL::Matrix, ...

Note:

Columns are normalized to unit vectors, so we can calculate the cosine similarity of all document vectors.

Returns a similarity matrix for the documents in the corpus.

Returns:

  • (GSL::Matrix, NMatrix, Matrix)

    a similarity matrix



66
67
68
# File 'lib/tf-idf-similarity/bm25_model.rb', line 66

def similarity_matrix
  multiply_self(normalize)
end

#term_frequency(document, term) ⇒ Float Also known as: tf

Note:

Like Lucene, we use a b value of 0.75 and a k1 value of 1.2.

Returns the term's frequency in the document.

Parameters:

  • document (Document)

    a document

  • term (String)

    a term

Returns:

  • (Float)

    the term's frequency in the document



45
46
47
48
# File 'lib/tf-idf-similarity/bm25_model.rb', line 45

def term_frequency(document, term)
  tf = document.term_count(term)
  (tf * 2.2) / (tf + 0.3 + 0.9 * documents.size / @model.average_document_size)
end

#term_frequency_inverse_document_frequency(document, term) ⇒ Float Also known as: tfidf

Return the term frequency–inverse document frequency.

Parameters:

  • document (Document)

    a document

  • term (String)

    a term

Returns:

  • (Float)

    the term frequency–inverse document frequency



56
57
58
# File 'lib/tf-idf-similarity/bm25_model.rb', line 56

def term_frequency_inverse_document_frequency(document, term)
  inverse_document_frequency(term) * term_frequency(document, term)
end