Module: Edits::Levenshtein

Defined in:
lib/edits/levenshtein.rb

Overview

Implementation of Levenshtein distance algorithm.

Determines distance between two string by counting edits, identifying:

  • Insertion
  • Deletion
  • Substitution

Class Method Summary collapse

Class Method Details

.distance(seq1, seq2) ⇒ Integer

Note:

A true distance metric, satisfies triangle inequality.

Calculate the Levenshtein (edit) distance of two sequences.

Examples:

Levenshtein.distance('sand', 'hands')
# => 2

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)

Returns:

  • (Integer)


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/edits/levenshtein.rb', line 20

def self.distance(seq1, seq2)
  if seq1.length > seq2.length
    temp = seq1
    seq1 = seq2
    seq2 = temp
  end

  # array of Integer codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  rows = seq1.length
  cols = seq2.length
  return cols if rows.zero?
  return rows if cols.zero?

  # Initialize first row of cost matrix.
  # The full initial state where cols=3, rows=2 would be:
  #   [[0, 1, 2, 3],
  #    [1, 0, 0, 0],
  #    [2, 0, 0, 0]]
  last_row = 0.upto(cols).to_a

  rows.times do |row|
    last_col = row + 1

    seq1_item = seq1[row]

    cols.times do |col|
      deletion = last_row[col + 1] + 1
      insertion = last_col + 1
      substitution = last_row[col] + (seq1_item == seq2[col] ? 0 : 1)

      # step cost is min of operation costs
      cost = deletion < insertion ? deletion : insertion
      cost = substitution if substitution < cost

      # overwrite previous row as we progress
      last_row[col] = last_col
      last_col = cost
    end
    last_row[cols] = last_col
  end

  last_row[cols]
end

.distance_with_max(seq1, seq2, max) ⇒ Integer

Calculate the Levenshtein (edit) distance of two sequences, bounded by a maximum value.

Examples:

Edits::Levenshtein.distance("cloud", "crayon")
# => 5
Edits::Levenshtein.distance_with_max("cloud", "crayon", 2)
# => 2

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)
  • max (Integer)

    maximum distance

Returns:

  • (Integer)


79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# File 'lib/edits/levenshtein.rb', line 79

def self.distance_with_max(seq1, seq2, max)
  if seq1.length > seq2.length
    temp = seq1
    seq1 = seq2
    seq2 = temp
  end

  rows = seq1.length
  cols = seq2.length
  return cols if rows.zero?
  return rows if cols.zero?
  return max if (rows - cols).abs >= max

  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  last_row = 0.upto(cols).to_a

  rows.times do |row|
    last_col_cost = row + 1
    seq1_item = seq1[row]

    min_col = row > max ? row - max : 0
    max_col = row + max
    max_col = cols - 1 if max_col > cols - 1
    diagonal = cols - rows + row

    cols.times do |col|
      return max if diagonal == col && last_row[col] >= max
      col_cost =
        if col < min_col || col > max_col
          max + 1
        else
          # step cost is min of operation costs
          deletion = last_row[col + 1] + 1
          insertion = last_col_cost + 1
          substitution = last_row[col] + (seq1_item == seq2[col] ? 0 : 1)

          cost = deletion < insertion ? deletion : insertion
          substitution < cost ? substitution : cost
        end

      last_row[col] = last_col_cost
      last_col_cost = col_cost
    end

    last_row[cols] = last_col_cost
  end

  last_row[cols] > max ? max : last_row[cols]
end

.most_similar(prototype, strings) ⇒ String?

Given a prototype string and an array of strings, determines which string is most similar to the prototype.

Levenshtein.most_similar("foo", strings) is functionally equivalent to strings.min_by { |s| Levenshtein.distance("foo", s) }, leveraging distance_with_max.

Examples:

Edits::Levenshtein.most_similar("atom", %w[tram atlas rota racer])
# => "atlas"

Parameters:

  • prototype (String)
  • strings (<String>)

Returns:

  • (String, nil)

    most similar string, or nil for empty array



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/edits/levenshtein.rb', line 144

def self.most_similar(prototype, strings)
  return nil if strings.empty?
  min_s = strings[0]
  min_d = distance(prototype, min_s)

  strings[1..-1].each do |s|
    return min_s if min_d.zero?
    d = distance_with_max(prototype, s, min_d)
    if d < min_d
      min_d = d
      min_s = s
    end
  end

  min_s
end