Module: Eco::Data::FuzzyMatch::NGramsScore

Included in:
ClassMethods
Defined in:
lib/eco/data/fuzzy_match/ngrams_score.rb

Instance Method Summary collapse

Instance Method Details

#ngrams_score(str1, str2, range: 3..5, normalized: false) ⇒ Score

Note:

This algorithm is best suited for matching sentences, or 'firstname lastname' compared with 'lastname firstname' combinations.

A score is kept of matching ngram combinations of str2.

Parameters:

  • range (Integer, Range) (defaults to: 3..5)

    determine the lenght of the generated values.

Returns:

  • (Score)

    the score object with the result.



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/eco/data/fuzzy_match/ngrams_score.rb', line 42

def ngrams_score(str1, str2, range: 3..5, normalized: false)
  str1, str2 = normalize_string([str1, str2]) unless normalized
  len1 = str1 && str1.length; len2 = str2 && str2.length

  Score.new(0, len1 || 0).tap do |score|
    next if !str2 || !str1
    next if str2.empty? || str1.empty?
    score.total = len1
    next score.increase(score.total) if str1 == str2
    next if str1.length < 2 || str2.length < 2

    grams     = word_ngrams(str2, range, normalized: true)
    grams_count = grams.length
    next unless grams_count > 0

    if range.is_a?(Integer)
      item_weight = score.total.to_f / grams_count
      matches     = grams.select {|res| str1.include?(gram)}.length
      score.increase(matches * item_weight)
    else
      groups       = grams.group_by {|gram| gram.length}
      sorted_lens  = groups.keys.sort.reverse
      lens         = sorted_lens.length
      group_weight = (1.0 / lens).round(3)

      groups.each do |len, grams|
        len_max_score  = score.total * group_weight
        item_weight    = len_max_score / grams_count
        matches        = grams.select {|gram| str1.include?(gram)}.length
        #pp "(#{len}) match: #{matches} (of #{grams.length} of total #{grams_count}) || max_score: #{len_max_score} (over #{score.total})"
        score.increase(matches * item_weight)
      end
    end

  end
end

#words_ngrams_score(str1, str2, range: 3..5, normalized: false) ⇒ Score

It does the following:

  1. It splits both strings into words
  2. Pairs all words by best ngrams_score match
  3. Gives 0 score to those words of str2 that lost their pair (a word of str1 cannot be paired twice)
  4. Merges the ngrams_score of all the paired words of str2 against their str1 word pair

Parameters:

  • range (Integer, Range) (defaults to: 3..5)

    determine the lenght of the generated values for each word.

Returns:

  • (Score)

    the score object with the result.



13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'lib/eco/data/fuzzy_match/ngrams_score.rb', line 13

def words_ngrams_score(str1, str2, range: 3..5, normalized: false)
  str1, str2 = normalize_string([str1, str2]) unless normalized
  len1 = str1 && str1.length; len2 = str2 && str2.length

  Score.new(0, 0).tap do |score|
    next if !str2 || !str1
    next score.increase_total(len1) if str2.empty? || str1.empty?
    if str1 == str2
      score.total = len1
      score.increase(score.total)
    end
    if str1.length < 2 || str1.length < 2
      score.increase_total(len1)
    end

    pairs = paired_words(str1, str2, normalized: true) do |needle, item|
      ngrams_score(needle, item, range: range, normalized: true)
    end.each do |sub_str1, data|
      item, iscore = data
      score.merge!(iscore)
    end
  end
end