Module: Rababa::Reconcile
- Included in:
- Diacritizer
- Defined in:
- lib/rababa/reconcile.rb
Overview
ALGORITHM IDEA #
Given strings: s_original and s_diacritized
-
Build pivot map of matches, so for instance:
(24 abc, axbycz) -> [(3,0), (4,2), (5,4)]
-
Build a string from “pivots“:
a. first with the original string if needed b. then with diacritized
-
finalize by writing:
a. end of diacritics b. end of original
Instance Method Summary collapse
-
#build_pivot_map(d_original, d_diacritized) ⇒ Object
build_pivot_map: This function takes 2 strings and finds the pivot points, i.e the points where both strings are identical.
-
#reconcile_strings(str_original, str_diacritized) ⇒ Object
reconcile_strings: This function takes original and diacritized string and merge them into a sensible output.
Instance Method Details
#build_pivot_map(d_original, d_diacritized) ⇒ Object
build_pivot_map: This function takes 2 strings and finds the pivot points, i.e the points where both strings are identical. args:
d_original: dictionary modelling the original string abc -> {0:a,1:b,2:c}
d_diacritized: dictionary modelling diacritized as above
return: list of ids tuple where strings match
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/rababa/reconcile.rb', line 25 def build_pivot_map(d_original, d_diacritized) l_map = [] idx_dia, idx_ori = 0, 0 while idx_dia < d_diacritized.length c_dia = d_diacritized[idx_dia] (idx_ori..d_original.length).each {|i| if (c_dia == d_original[i]) idx_ori = i l_map << [idx_dia, idx_ori] break end } idx_dia += 1 end l_map end |
#reconcile_strings(str_original, str_diacritized) ⇒ Object
reconcile_strings: This function takes original and diacritized string and merge them into a sensible output. For instance:
original string:
# گيله پسمير الجديد 34
diacritised string (with non arabic removed by the nnets preprocessing):
يَلِهُ سُمِيْرٌ الجَدِيدُ
reconcile_strings -->
'# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34'
Other examples and tests can be found in the commented section below. args:
str_original: original string
str_diacritized: diacritized string
return: reconciled string
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/rababa/reconcile.rb', line 59 def reconcile_strings(str_original, str_diacritized) # we model the strings as dict d_original = Hash[*str_original.chars().select \ {|n| not Rababa::ArabicConstants::HARAQAT.include? n} \ .map.with_index \ {|c, i| [i, c] }.flatten] d_diacritized = Hash[*str_diacritized.chars().map.with_index \ {|c, i| [i, c] }.flatten] # matching positions l_pivot_map = build_pivot_map(d_original, d_diacritized) str__ = '' # "accumulated" chars pt_dia, pt_ori = 0, 0 # pointers for resp diacr and orig. strings for x_dia, x_ori in l_pivot_map # We start to write characters from original strings if (pt_ori < x_ori) (pt_ori..x_ori-1).each {|i| str__ += d_original[i]} end # We then add chars from diacritized strings if (pt_dia < x_dia) (pt_dia..x_dia-1).each {|i| str__ += d_diacritized[i]} end # append matches str__ += d_original[x_ori] pt_dia, pt_ori = x_dia + 1, x_ori + 1 end # Finalize by adding first last diacritized chars and then (pt_dia..d_diacritized.length-1).each {|i| str__ += d_diacritized[i]} # remaining chars for original string (pt_ori..d_original.length-1).each {|i| str__ += d_original[i]} str__ end |