Class: Merge3

Inherits:
Object
  • Object
show all
Defined in:
lib/merge3.rb

Overview

so here comes the main algorithm. We have the class to carry the original file and it’s hash codes we need for the matching of the files

Class Method Summary collapse

Class Method Details

.all_matches(chunk) ⇒ Object

collect all matches (for the start file, so it doesn’t have to be done twice)



182
183
184
185
186
187
188
189
# File 'lib/merge3.rb', line 182

def Merge3::all_matches( chunk )  
  # order counts, so large chunks overwrites small for hash collisions
  matches =  Merge3::match(chunk,16)
  matches.update(match(chunk,32))
  matches.update(match(chunk,48))
  matches.update(match_newlines(chunk))
  matches
end

.match(chunk, around) ⇒ Object

find matches for around the given length. Exactly the length for binary files, but if there is whitespace, we use that to break within 70% of the given value



226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# File 'lib/merge3.rb', line 226

def Merge3::match( chunk , around )
  match = {} # the hash of matches , using the strings as key, chunks as value
  index = 0 #count through the string length
  while index < chunk.length  # also possible here to do half from the 
    left = chunk.length - index   #front and half from the back (?)
    if left > 0.7*around and left <= 1.5*around   # are we at the end 
      len = chunk.length - index
    else
      sub = chunk.str[index ,  1.5*around]  # take the maximum length we want
      if (white = sub.index( /\W/ ,  0.7 * around )).nil? # and check for non char
        len = around    # either taking the whole 
      else     #or the "words" we found   
        len = white   #hoping this is better for text, so we have no allignment
      end           # issues
    end  #then create a chunk with the indexes and string
    cop = chunk.subchunk( index , len , index )
    match[cop.str] =  cop  # and store in our hash (no collision detection)
    index += len
  end   
  # now we only take one from the back,because that's often a match and will be 
  at = (chunk.length - 0.7*around).to_i
  cop = chunk.subchunk( at , (0.7*around).to_i , at  )
  match[cop.str] =  cop
  match
end

.match_newlines(chunk) ⇒ Object

find matches according to newline seperation.



192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
# File 'lib/merge3.rb', line 192

def Merge3::match_newlines( chunk )
  match = {} # the hash of matches , using the strings as key, chunk as value
  index = 0 #count through the string length
  if chunk.whitespace.empty? # whitespace compression on or not ?
    # because there are only newline if whitespace compression is off
    chunk.str.split("\n").each do |line|
      # this logic strips the whitespace off both sides of the line, in an 
      # effort to match things like changed indentation
      offset = 0 
      offset += 1 while line[offset,1] =~ /\s/
      index += offset
      line.strip!
      cop = chunk.subchunk( index , line.length   , index )
      match[cop.str] =  cop  # and store in our hash (no collision detection)
      #puts "LINE #{index} #{line.length} #{offset}--#{cop.str}--"
      index += line.length + 1  # 1 is the newline
    end
  else  #whitespace compression is on, so the whitespace table contains
    # the newlines, just have to find them
    chunk.whitespace.each do | at , str |
      #puts "###{str}##"
      next unless  str.include?( "\n" ) 
      cop = chunk.subchunk( index + 1 , at - index   , index )
      index =at
      #puts "###{cop.str}##" if cop.str.include?("Four")
      match[cop.str] =  cop  # and store in our hash (no collision detection)
    end  
  end
  match
end

.matches(chunk, size) ⇒ Object

build a hash of chunks( hash of the chunks string against the chunk). This is the backbone of the matching algorithm, so that finding a match is basically a hash lookup Matches may be of different sizes, we use 16,32,48 and -1 for line matching



173
174
175
176
177
178
179
# File 'lib/merge3.rb', line 173

def Merge3::matches( chunk , size ) 
  if size == -1 then 
    return  Merge3::match_newlines( chunk )
  else 
    return Merge3::match( chunk , size )
  end
end

.merge(one_seq, two_seq) ⇒ Object

given the Sequences of both files, attempt merging them. the chunks in the sequences are non-overlapping, so we can always work on whole chunks. In a sentence, the algorithm is now: Follow the line of change



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/merge3.rb', line 111

def Merge3::merge( one_seq , two_seq )
  # we go along the original file's order and change to the other sequence
  #  if there has been a change compared to the original file.
  out = []  # this is the array of string to be outputted, the result
  # this doesn't consider changes at the beginning of the file
  # which just goes to prove that we're researchers
  one_chunk = one_seq.chunks.first  
  two_chunk = two_seq.chunks.first 
  last_choosen  = one_chunk
  break_flag = nil
  while break_flag.nil?
    if DEBUG and one_chunk.from != two_chunk.from
      #chunks are non overlapping, so they must be the same
      raise "CHUNKS differ #{one_chunk}\n#{two_chunk}" 
    end 
    # we start with the next in the edited file order and then check where
    next_one = one_seq.get_next( one_chunk )         # the change is
    next_two = two_seq.get_next( two_chunk )
    break_flag = true  if next_one.nil? or next_two.nil? #eof 
    # so if the change is in the first file,
    if break_flag.nil? and one_chunk.org_stop != next_one.from
      raise "conflict at #{one_chunk.org_stop}" if two_chunk.org_stop != next_two.from
      # we find the one in file two that starts where it (one) ends 
      next_two = two_seq.find_from( next_one.from ) 
      puts "ONE #{one_chunk}\n    #{two_chunk} nNEXT #{next_one}\n     #{next_two}" if DEBUG
      next_choosen = next_one
      # otherwise the change must be in file two 
    elsif break_flag.nil?
      # and we look for the chunk in one, that starts where two ends
      next_one = one_seq.find_from( next_two.from ) 
      puts "TWO #{two_chunk}\n    #{one_chunk}\nNEXT #{next_two}\n     #{next_one}" if DEBUG
      next_choosen = next_two
    else 
      next_one , next_two = nil ,nil
    end
    this_del = ( one_chunk.kind_of?(Deleted) or two_chunk.kind_of?(Deleted) ) 
    # Output the associated (real==whitespace expanded) string if not deleted
    out << last_choosen.real_string unless this_del
    puts "OUT #{Chunk::short_string(one_chunk.real_string)}" if DEBUG and not this_del 
    # we ignore adds (from one file) that happen to fall in the middle of a deleted in the other
    one_del = ( one_chunk.kind_of?( Deleted) and next_one.kind_of?( Deleted) )
    two_del = ( two_chunk.kind_of?( Deleted) and next_two.kind_of?( Deleted) )  
    unless one_del or two_del
      # otherwise we output added text from both files
      puts "ADD one #{one_chunk.added}" if DEBUG and not one_chunk.added.nil? 
      out << one_chunk.added.real_string  unless one_chunk.added.nil? 
      # and if both have added text create an asymetric result, hmm
      puts "ADD two #{ two_chunk.added}" if DEBUG  and not two_chunk.added.nil?
      out << two_chunk.added.real_string unless two_chunk.added.nil?
    end
    one_chunk = next_one
    two_chunk = next_two
    last_choosen = next_choosen
  end
  puts "RESULT" , out.join , "END" if DEBUG
  return out.join    # joins the pieces and returns the merged string  
end

.three_way(start_file, one_file, two_file, ignore_white_space = false) ⇒ Object

make a three way merge, passing in strings and returning a string strings are byte arrays (not line arrays as in diff)



80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# File 'lib/merge3.rb', line 80

def Merge3::three_way( start_file , one_file , two_file , ignore_white_space = false )
  # We also create chunks of the whole files, in case whitespace 
  # handling is on. Chunk::subchunk passed compressed whitespace on
  start_chunk = Chunk.new( 0 , start_file.length , start_file )
  one_chunk = Chunk.new( 0 , one_file.length , one_file )
  two_chunk = Chunk.new( 0 , two_file.length , two_file )
  if ignore_white_space  
    start_chunk.squeeze  # all consecutive whitespace is then squeezed 
    one_chunk.squeeze   # into one, and a talbe create to unsqueeze later
    two_chunk.squeeze  # Yes, this is expensive
  end
  # only create the macthes hash once for the start file
  start_matches = Merge3::all_matches(start_chunk)
  # Create the chunks( moves first , then added and deleted)
  one_seq = Sequence.new( one_chunk , start_chunk , start_matches ) 
  two_seq = Sequence.new( two_chunk , start_chunk , start_matches )
  puts "Diff original 0" , one_seq if DEBUG
  puts "Diff original 1" , two_seq if DEBUG
  # break all chunks into non overlapping regions (in the original file order)
  one_seq.no_overlap( two_seq )
  two_seq.no_overlap( one_seq )
  puts "Diff non-overlapping 0" , one_seq if DEBUG
  puts "Diff non-overlapping 1" , two_seq if DEBUG
  # and finally (and most simply) do the merge
  Merge3::merge( one_seq , two_seq  )
end