Module: Edits::Levenshtein
- Extended by:
- Compare
- Defined in:
- lib/edits/levenshtein.rb
Overview
Implements Levenshtein distance algorithm.
Determines distance between two string by counting edits, identifying:
- Insertion
- Deletion
- Substitution
Class Method Summary collapse
-
.distance(seq1, seq2) ⇒ Integer
Calculate the Levenshtein (edit) distance of two sequences.
-
.distance_with_max(seq1, seq2, max) ⇒ Integer
Calculate the Levenshtein (edit) distance of two sequences, bounded by a maximum value.
Methods included from Compare
Class Method Details
.distance(seq1, seq2) ⇒ Integer
Note:
A true distance metric, satisfies triangle inequality.
Calculate the Levenshtein (edit) distance of two sequences.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/edits/levenshtein.rb', line 24 def self.distance(seq1, seq2) seq1, seq2 = seq2, seq1 if seq1.length > seq2.length # array of codepoints outperforms String seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String rows = seq1.length cols = seq2.length return cols if rows == 0 return rows if cols == 0 # Initialize first row of cost matrix. # The full initial state where cols=3, rows=2 would be: # [[0, 1, 2, 3], # [1, 0, 0, 0], # [2, 0, 0, 0]] last_row = 0.upto(cols).to_a rows.times do |row| prev_col_cost = row + 1 seq1_item = seq1[row] cols.times do |col| # | Xs | Xd | # | Xi | ? | # step cost is min of operation costs # substitution, deletion, insertion cost = [ last_row[col] + (seq1_item == seq2[col] ? 0 : 1), last_row[col + 1] + 1, prev_col_cost + 1 ].min # overwrite previous row as we progress last_row[col] = prev_col_cost prev_col_cost = cost end last_row[cols] = prev_col_cost end last_row[cols] end |
.distance_with_max(seq1, seq2, max) ⇒ Integer
Calculate the Levenshtein (edit) distance of two sequences, bounded by a maximum value.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
# File 'lib/edits/levenshtein.rb', line 80 def self.distance_with_max(seq1, seq2, max) seq1, seq2 = seq2, seq1 if seq1.length > seq2.length rows = seq1.length cols = seq2.length return cols > max ? max : cols if rows == 0 return rows > max ? max : rows if cols == 0 return max if (cols - rows) >= max # array of codepoints outperforms String seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String # 'infinite' edit distance for padding cost matrix. # Can be any value > max[rows, cols] inf = cols + 1 # retain previous row of cost matrix last_row = 0.upto(cols).to_a rows.times do |row| # Ukkonen cut-off min_col = row > max ? row - max : 0 max_col = row + max max_col = cols - 1 if max_col > cols - 1 prev_col_cost = min_col == 0 ? row + 1 : inf seq1_item = seq1[row] diagonal = cols - rows + row min_col.upto(max_col) do |col| return max if diagonal == col && last_row[col] >= max # | Xs | Xd | # | Xi | ? | # substitution, deletion, insertion cost = [ last_row[col] + (seq1_item == seq2[col] ? 0 : 1), last_row[col + 1] + 1, prev_col_cost + 1 ].min # overwrite previous row as we progress last_row[col] = prev_col_cost prev_col_cost = cost end last_row[cols] = prev_col_cost end last_row[cols] > max ? max : last_row[cols] end |