Module: Edits::DamerauLevenshtein
- Defined in:
- lib/edits/damerau_levenshtein.rb
Overview
Implements the Damerau/Levenshtein distance algorithm.
Determines distance between two strings by counting edits, identifying:
- Insertion
- Deletion
- Substitution
- Adjacent transposition
Class Method Summary collapse
-
.distance(seq1, seq2) ⇒ Integer
Calculate the Damerau/Levenshtein distance of two sequences.
Class Method Details
.distance(seq1, seq2) ⇒ Integer
Calculate the Damerau/Levenshtein distance of two sequences.
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/edits/damerau_levenshtein.rb', line 20 def self.distance(seq1, seq2) seq1, seq2 = seq2, seq1 if seq1.length > seq2.length # array of codepoints outperforms String if seq1.is_a?(String) && seq2.is_a?(String) seq1 = seq1.codepoints seq2 = seq2.codepoints end rows = seq1.length cols = seq2.length return cols if rows == 0 return rows if cols == 0 # 'infinite' edit distance to pad cost matrix. # Any value > max[rows, cols] inf = cols + 1 # element => last row seen row_history = Hash.new(0) # initialize alphabet-keyed cost matrix matrix = {} curr_row = 0.upto(cols).to_a rows.times do |row| seq1_item = seq1[row] match_col = 0 # rotate row arrays & generate next matrix[seq1_item] = last_row = curr_row curr_row = Array.new(cols + 1, inf) curr_row[0] = row + 1 cols.times do |col| seq2_item = seq2[col] sub_cost = seq1_item == seq2_item ? 0 : 1 # | Xs | Xd | # | Xi | ? | # substitution, deletion, insertion cost = [ last_row[col] + sub_cost, last_row[col + 1] + 1, curr_row[col] + 1 ].min # transposition cost # skip missed matrix lookup (inf cost) if sub_cost > 0 && row > 0 && (m = matrix[seq2_item]) transpose = 1 + m[match_col] \ + (row - row_history[seq2_item] - 1) \ + (col - match_col - 1) cost = transpose if transpose < cost end match_col = col if sub_cost == 0 curr_row[col + 1] = cost end row_history[seq1_item] = row end curr_row[cols] end |