Class: DNA
- Inherits:
-
Object
- Object
- DNA
- Defined in:
- lib/langa/dna.rb
Overview
The class DNA creates a typical fingerprint from a unicode character stream.
This fingerprint can be compared with fingerprints of other test streams to
support an automatic language recognition.
The fingerprint is a statistical analysis of the frequency of occurance of
single characters. With the analysis non letter characters are filtered and
upper case letters are mapped to lowercase.
The distance between two fingerprints is measured in the sum of distances
between each single letter.
Constant Summary collapse
- @@gene_map =
Hash.new
Class Method Summary collapse
Instance Method Summary collapse
-
#add_gene(unicode) ⇒ Object
Add an unicode character to the dna chain.
-
#distance(dna) ⇒ Object
Calculate the distance bewteen two fingerprint to measure the equality.
-
#feed(filename, codepage) ⇒ Object
With feed you can give complete files as an input to the dna.
-
#fingerprint ⇒ Object
The fingerprint is the significant extract of a file, which is essentially for the language recognition process.
-
#initialize(*parm) ⇒ DNA
constructor
Create a new DNA object.
-
#reset ⇒ Object
Reset the DNA object.
- #size ⇒ Object
-
#to_s ⇒ Object
Convert the fingerprint to a string.
-
#to_utf8 ⇒ Object
Convert the fingerprint to an UTF-8 string.
Constructor Details
#initialize(*parm) ⇒ DNA
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/langa/dna.rb', line 91 def initialize(*parm) # => initialize class variable @@gene_map.empty? && DNA.fill_gene_map # => check parameters case parm.size when 0 @dna_chain = Hash.new(0) when 1 if parm[0].is_a?(String) # => create dna object from fingerprint @fingerprint = Hash.new parm[0].scan(/([^+-]+)-([^+-]+)/).each do |gene| idx, @fingerprint[idx] = gene.collect {|var| var.to_i} end else raise ArgumentError, "wrong type of argument (String expected)" end else raise ArgumentError, "wrong number of argument (#{parm.size} for 0/1)" end end |
Class Method Details
.fill_gene_map ⇒ Object
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/langa/dna.rb', line 56 def DNA.fill_gene_map # => find lokal CaseFolding.txt case_fold = File.join(File.dirname(__FILE__), '..', '..', 'unicode', 'CaseFolding.txt') # => load uppwer-/lowercase mappings File.open(case_fold).each_line do |line| # Line format looks like # 0041; C; 0061; # LATIN CAPITAL LETTER A code, stat, mapp = line.gsub(/ /, '').split(';') if stat=='C' || stat=='S' code, mapp = code.hex, mapp.hex @@gene_map[code] = @@gene_map[mapp] = mapp end end # complete mapping for use as legal character identification [0x130, 0x131, 0x138, 0x149, 0x180, 0x18d, 0x19b, 0x1aa, 0x1ab, 0x1ba, 0x1bb, 0x1be, 0x1f0, 0x221, 0x234, 0x235, 0x236, 0x237, 0x238, 0x239, 0x23a, 0x23e, 0x23f, 0x240].each { |code| @@gene_map[code] = code } end |
Instance Method Details
#add_gene(unicode) ⇒ Object
Add an unicode character to the dna chain. This can be done in precedence to calculating the dna fingerprint. If the fingerprint was already calculated, you have to reset the dna object, before you can add another character.
add_gene(0x123)
121 122 123 124 125 126 127 128 129 |
# File 'lib/langa/dna.rb', line 121 def add_gene(unicode) raise "fingerprint already calculated, try reset first" unless @fingerprint.nil? if unicode > 0x0250 @dna_chain[unicode] += 1 unless unicode === (0x2b0..0x2af) else @dna_chain[@@gene_map[unicode]] += 1 if @@gene_map.has_key?(unicode) end @dna_size += 1 end |
#distance(dna) ⇒ Object
Calculate the distance bewteen two fingerprint to measure the equality.
dna.distance(other_dna) -> distance
167 168 169 170 171 172 173 174 |
# File 'lib/langa/dna.rb', line 167 def distance(dna) fp = dna.fingerprint dst = 0 @fingerprint.each do |gene| char, freq = gene dst += (fp.has_key?(char) ? (fp[char]-freq).abs : freq) end dst / 1000.0 end |
#feed(filename, codepage) ⇒ Object
138 139 140 141 |
# File 'lib/langa/dna.rb', line 138 def feed(filename, codepage) self.reset File.open([filename, codepage]).each_unicode {|uc| add_gene(uc) } end |
#fingerprint ⇒ Object
The fingerprint is the significant extract of a file, which is essentially for the language recognition process.
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/langa/dna.rb', line 146 def fingerprint if @fingerprint.nil? # => filter gene, that are least significant filter = (@dna_chain.size > 1000) ? 100 : 10 # => check the length of the chain, i.e. number of characters length = weight = 0 @dna_chain.each { |pair| length += pair[1] } @size = length # => normalize the frequence of characters @fingerprint = @dna_chain.collect { |gene| char, freq = gene weight = (freq * 100000.0 / length).to_i (weight > filter) ? [char, weight] : nil }.compact.sort {|a,b| b[1]<=>a[1]} end @fingerprint end |
#reset ⇒ Object
Reset the DNA object
177 178 179 180 181 |
# File 'lib/langa/dna.rb', line 177 def reset @dna_chain.clear @dna_size = 0 @fingerprint = nil end |
#size ⇒ Object
183 184 185 |
# File 'lib/langa/dna.rb', line 183 def size @dna_size end |
#to_s ⇒ Object
Convert the fingerprint to a string.
dna.to_s -> '101-16251+110-9918+105-7865+...'
195 196 197 |
# File 'lib/langa/dna.rb', line 195 def to_s fingerprint.collect { |gene| gene.join('-') }.join('+') end |
#to_utf8 ⇒ Object
Convert the fingerprint to an UTF-8 string.
dna.to_utf_8 -> 'enirtsadhlugcmobfkwzpvüäjöyxq'
189 190 191 |
# File 'lib/langa/dna.rb', line 189 def to_utf8 fingerprint.collect {|pair| pair[0]}.to_utf8 end |