Module: Edits::Levenshtein
- Defined in:
- lib/edits/levenshtein.rb
Overview
Implementation of Levenshtein distance algorithm.
Determines distance between two string by counting edits, identifying:
- Insertion
- Deletion
- Substitution
Class Method Summary collapse
-
.distance(seq1, seq2) ⇒ Integer
Calculate the Levenshtein (edit) distance of two sequences.
-
.distance_with_max(seq1, seq2, max) ⇒ Integer
Calculate the Levenshtein (edit) distance of two sequences, bounded by a maximum value.
-
.most_similar(prototype, strings) ⇒ String?
Given a prototype string and an array of strings, determines which string is most similar to the prototype.
Class Method Details
.distance(seq1, seq2) ⇒ Integer
A true distance metric, satisfies triangle inequality.
Calculate the Levenshtein (edit) distance of two sequences.
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/edits/levenshtein.rb', line 20 def self.distance(seq1, seq2) if seq1.length > seq2.length temp = seq1 seq1 = seq2 seq2 = temp end # array of Integer codepoints outperforms String seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String rows = seq1.length cols = seq2.length return cols if rows.zero? return rows if cols.zero? # Initialize first row of cost matrix. # The full initial state where cols=3, rows=2 would be: # [[0, 1, 2, 3], # [1, 0, 0, 0], # [2, 0, 0, 0]] last_row = 0.upto(cols).to_a rows.times do |row| last_col = row + 1 seq1_item = seq1[row] cols.times do |col| deletion = last_row[col + 1] + 1 insertion = last_col + 1 substitution = last_row[col] + (seq1_item == seq2[col] ? 0 : 1) # step cost is min of operation costs cost = deletion < insertion ? deletion : insertion cost = substitution if substitution < cost # overwrite previous row as we progress last_row[col] = last_col last_col = cost end last_row[cols] = last_col end last_row[cols] end |
.distance_with_max(seq1, seq2, max) ⇒ Integer
Calculate the Levenshtein (edit) distance of two sequences, bounded by a maximum value.
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/edits/levenshtein.rb', line 79 def self.distance_with_max(seq1, seq2, max) if seq1.length > seq2.length temp = seq1 seq1 = seq2 seq2 = temp end rows = seq1.length cols = seq2.length return cols if rows.zero? return rows if cols.zero? return max if (rows - cols).abs >= max seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String last_row = 0.upto(cols).to_a rows.times do |row| last_col_cost = row + 1 seq1_item = seq1[row] min_col = row > max ? row - max : 0 max_col = row + max max_col = cols - 1 if max_col > cols - 1 diagonal = cols - rows + row cols.times do |col| return max if diagonal == col && last_row[col] >= max col_cost = if col < min_col || col > max_col max + 1 else # step cost is min of operation costs deletion = last_row[col + 1] + 1 insertion = last_col_cost + 1 substitution = last_row[col] + (seq1_item == seq2[col] ? 0 : 1) cost = deletion < insertion ? deletion : insertion substitution < cost ? substitution : cost end last_row[col] = last_col_cost last_col_cost = col_cost end last_row[cols] = last_col_cost end last_row[cols] > max ? max : last_row[cols] end |
.most_similar(prototype, strings) ⇒ String?
Given a prototype string and an array of strings, determines which string is most similar to the prototype.
Levenshtein.most_similar("foo", strings) is functionally equivalent to
strings.min_by { |s| Levenshtein.distance("foo", s) }, leveraging
distance_with_max.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
# File 'lib/edits/levenshtein.rb', line 144 def self.most_similar(prototype, strings) return nil if strings.empty? min_s = strings[0] min_d = distance(prototype, min_s) strings[1..-1].each do |s| return min_s if min_d.zero? d = distance_with_max(prototype, s, min_d) if d < min_d min_d = d min_s = s end end min_s end |