Class: TfIdfSimilarity::Document

Inherits:
Object
  • Object
show all
Defined in:
lib/tf-idf-similarity/document.rb,
lib/tf-idf-similarity/extras/document.rb

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text, opts = {}) ⇒ Document

Returns a new instance of Document.

Parameters:

  • text (String)

    the document's text

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :id (String)

    the document's identifier

  • :tokens (Array)

    the document's tokenized text

  • :term_counts (Hash)

    the number of times each term appears

  • :size (Integer)

    the number of tokens in the document



21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/tf-idf-similarity/document.rb', line 21

def initialize(text, opts = {})
  @text   = text
  @id     = opts[:id] || object_id
  @tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens]
  @tokenizer = opts[:tokenizer] || Tokenizer.new

  if opts[:term_counts]
    @term_counts = opts[:term_counts]
    @size = opts[:size] || term_counts.values.reduce(0, :+)
    # Nothing to do.
  else
    @term_counts = Hash.new(0)
    @size = 0
    set_term_counts_and_size
  end
end

Instance Attribute Details

#idObject (readonly)

The document's identifier.



7
8
9
# File 'lib/tf-idf-similarity/document.rb', line 7

def id
  @id
end

#sizeObject (readonly)

The number of tokens in the document.



13
14
15
# File 'lib/tf-idf-similarity/document.rb', line 13

def size
  @size
end

#term_countsObject (readonly)

The number of times each term appears in the document.



11
12
13
# File 'lib/tf-idf-similarity/document.rb', line 11

def term_counts
  @term_counts
end

#textObject (readonly)

The document's text.



9
10
11
# File 'lib/tf-idf-similarity/document.rb', line 9

def text
  @text
end

Instance Method Details

#average_term_countFloat

Returns the average term count of all terms in the document.

Returns:

  • (Float)

    the average term count of all terms in the document



9
10
11
# File 'lib/tf-idf-similarity/extras/document.rb', line 9

def average_term_count
  @average_term_count ||= term_counts.values.reduce(0, :+) / term_counts.size.to_f
end

#maximum_term_countFloat

Returns the maximum term count of any term in the document.

Returns:

  • (Float)

    the maximum term count of any term in the document



4
5
6
# File 'lib/tf-idf-similarity/extras/document.rb', line 4

def maximum_term_count
  @maximum_term_count ||= term_counts.values.max.to_f
end

#term_count(term) ⇒ Integer

Returns the number of occurrences of the term in the document.

Parameters:

  • term (String)

    a term

Returns:

  • (Integer)

    the number of times the term appears in the document



49
50
51
# File 'lib/tf-idf-similarity/document.rb', line 49

def term_count(term)
  term_counts[term].to_i # need #to_i if unmarshalled
end

#termsArray<String>

Returns the set of terms in the document.

Returns:

  • (Array<String>)

    the unique terms in the document



41
42
43
# File 'lib/tf-idf-similarity/document.rb', line 41

def terms
  term_counts.keys
end