Module: Edits::RestrictedEdit
- Extended by:
- Compare
- Defined in:
- lib/edits/restricted_edit.rb
Overview
Implements Restricted Damerau-Levenshtein distance (Optimal Alignment) algorithm.
Determines distance between two strings by counting edits, identifying:
- Insertion
- Deletion
- Substitution
- Adjacent transposition
This variant is restricted by the condition that no sub-string is edited more than once.
Class Method Summary collapse
-
.distance(seq1, seq2) ⇒ Integer
Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences.
-
.distance_with_max(seq1, seq2, max) ⇒ Integer
Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences, bounded by a maximum value.
Methods included from Compare
Class Method Details
.distance(seq1, seq2) ⇒ Integer
Note:
Not a true distance metric, fails to satisfy triangle inequality.
Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences.
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/edits/restricted_edit.rb', line 28 def self.distance(seq1, seq2) seq1, seq2 = seq2, seq1 if seq1.length > seq2.length # array of codepoints outperforms String seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String rows = seq1.length cols = seq2.length return cols if rows == 0 return rows if cols == 0 # retain previous two rows of cost matrix lastlast_row = [] last_row = [] # Initialize first row of cost matrix. # The full initial state where cols=3, rows=2 would be: # [[0, 1, 2, 3], # [1, 0, 0, 0], # [2, 0, 0, 0]] curr_row = 0.upto(cols).to_a rows.times do |row| # rotate row arrays curr_row, last_row, lastlast_row = lastlast_row, curr_row, last_row curr_row[0] = row + 1 seq1_item = seq1[row] cols.times do |col| sub_cost = seq1_item == seq2[col] ? 0 : 1 is_swap = sub_cost > 0 && row > 0 && col > 0 && seq1_item == seq2[col - 1] && seq1[row - 1] == seq2[col] # | Xt | | | # | | Xs | Xd | # | | Xi | ? | # step cost is min of operation costs # substitution, deletion, insertion, transposition cost = [ last_row[col] + sub_cost, last_row[col + 1] + 1, curr_row[col] + 1 ].min if is_swap swap = lastlast_row[col - 1] + 1 cost = swap if swap < cost end curr_row[col + 1] = cost end end curr_row[cols] end |
.distance_with_max(seq1, seq2, max) ⇒ Integer
Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences, bounded by a maximum value.
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
# File 'lib/edits/restricted_edit.rb', line 100 def self.distance_with_max(seq1, seq2, max) seq1, seq2 = seq2, seq1 if seq1.length > seq2.length rows = seq1.length cols = seq2.length return cols > max ? max : cols if rows == 0 return rows > max ? max : rows if cols == 0 return max if (cols - rows) >= max # array of codepoints outperforms String seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String # 'infinite' edit distance for padding cost matrix. # Can be any value > max[rows, cols] inf = cols + 1 # retain previous two rows of cost matrix, # padded with "inf" as matrix is not fully evaluated lastlast_row = Array.new(inf, inf) last_row = Array.new(inf, inf) curr_row = 0.upto(cols).to_a rows.times do |row| # rotate row arrays curr_row, last_row, lastlast_row = lastlast_row, curr_row, last_row # Ukkonen cut-off min_col = row > max ? row - max : 0 max_col = row + max max_col = cols - 1 if max_col > cols - 1 curr_row[min_col] = min_col == 0 ? row + 1 : inf seq1_item = seq1[row] diagonal = cols - rows + row min_col.upto(max_col) do |col| return max if diagonal == col && last_row[col] >= max sub_cost = seq1_item == seq2[col] ? 0 : 1 is_swap = sub_cost > 0 && row > 0 && col > 0 && seq1_item == seq2[col - 1] && seq1[row - 1] == seq2[col] # | Xt | | | # | | Xs | Xd | # | | Xi | ? | # substitution, deletion, insertion, transposition cost = [ last_row[col] + sub_cost, last_row[col + 1] + 1, curr_row[col] + 1 ].min if is_swap swap = lastlast_row[col - 1] + 1 cost = swap if swap < cost end curr_row[col + 1] = cost end end curr_row[cols] > max ? max : curr_row[cols] end |