Class: TfIdfSimilarity::Model

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Includes:
MatrixMethods
Defined in:
lib/tf-idf-similarity/model.rb

Direct Known Subclasses

BM25Model, TfIdfModel

Instance Method Summary collapse

Constructor Details

#initialize(documents, opts = {}) ⇒ Model

Returns a new instance of Model.

Parameters:

  • documents (Array<Document>)

    documents

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :library (Symbol)

    :gsl, :narray, :nmatrix or :matrix (default)



11
12
13
14
15
16
17
18
19
20
21
22
23
# File 'lib/tf-idf-similarity/model.rb', line 11

def initialize(documents, opts = {})
  @model = TermCountModel.new(documents, opts)
  @library = (opts[:library] || :matrix).to_sym

  array = Array.new(terms.size) do |i|
    idf = inverse_document_frequency(terms[i])
    Array.new(documents.size) do |j|
      (term_frequency(documents[j], terms[i]) * idf).to_f
    end
  end

  @matrix = initialize_matrix(array)
end

Instance Method Details

#document_index(document) ⇒ Integer?

Return the index of the document in the corpus.

Parameters:

Returns:

  • (Integer, nil)

    the index of the document



52
53
54
# File 'lib/tf-idf-similarity/model.rb', line 52

def document_index(document)
  @model.documents.index(document)
end

#similarity_matrixGSL::Matrix, ...

Note:

Columns are normalized to unit vectors, so we can calculate the cosine similarity of all document vectors.

Returns a similarity matrix for the documents in the corpus.

Returns:

  • (GSL::Matrix, NMatrix, Matrix)

    a similarity matrix



40
41
42
43
44
45
46
# File 'lib/tf-idf-similarity/model.rb', line 40

def similarity_matrix
  if documents.empty?
    []
  else
    multiply_self(normalize)
  end
end

#term_frequency_inverse_document_frequency(document, term) ⇒ Float Also known as: tfidf

Return the term frequency–inverse document frequency.

Parameters:

  • document (Document)

    a document

  • term (String)

    a term

Returns:

  • (Float)

    the term frequency–inverse document frequency



30
31
32
# File 'lib/tf-idf-similarity/model.rb', line 30

def term_frequency_inverse_document_frequency(document, term)
  inverse_document_frequency(term) * term_frequency(document, term)
end

#text_index(text) ⇒ Integer?

Return the index of the document with matching text.

Parameters:

  • text (String)

    a text

Returns:

  • (Integer, nil)

    the index of the document



60
61
62
63
64
# File 'lib/tf-idf-similarity/model.rb', line 60

def text_index(text)
  @model.documents.index do |document|
    document.text == text
  end
end