Class: Linguist::Classifier
- Inherits:
-
Object
- Object
- Linguist::Classifier
- Defined in:
- lib/linguist/classifier.rb
Overview
Language content classifier.
Constant Summary collapse
- CLASSIFIER_CONSIDER_BYTES =
Maximum number of bytes to consider for classification. This is only used at evaluation time. During training, full content of samples is used.
50 * 1024
Class Method Summary collapse
-
.call(blob, possible_languages) ⇒ Object
Public: Use the classifier to detect language of the blob.
-
.classify(db, tokens, languages = nil) ⇒ Object
Public: Guess language of data.
-
.filter_vocab_by_freq!(db, min_freq) ⇒ Object
Filter vocabulary by minimum document frequency.
-
.finalize_train!(db) ⇒ Object
Public: Finalize training.
- .get_centroids(db) ⇒ Object
-
.inverse_class_freqs(db) ⇒ Object
Compute inverse class frequency (ICF) for every term.
- .l2_norm(vec) ⇒ Object
- .l2_normalize!(vec) ⇒ Object
- .normalize_samples!(db) ⇒ Object
- .similarity(a, b) ⇒ Object
-
.sort_vocab!(db) ⇒ Object
Sort vocabulary lexicographically.
- .to_vocabulary_index_termfreq(vocab, tokens) ⇒ Object
- .to_vocabulary_index_termfreq_gaps(vocab, tokens) ⇒ Object
-
.train!(db, language, data) ⇒ Object
Public: Train classifier that data is a certain language.
Instance Method Summary collapse
-
#classify(tokens, languages) ⇒ Object
Internal: Guess language of data.
-
#initialize(db = {}) ⇒ Classifier
constructor
Internal: Initialize a Classifier.
Constructor Details
#initialize(db = {}) ⇒ Classifier
Internal: Initialize a Classifier.
110 111 112 113 114 |
# File 'lib/linguist/classifier.rb', line 110 def initialize(db = {}) @vocabulary = db['vocabulary'] @centroids = db['centroids'] @icf = db['icf'] end |
Class Method Details
.call(blob, possible_languages) ⇒ Object
Public: Use the classifier to detect language of the blob.
blob - An object that quacks like a blob. possible_languages - Array of Language objects
Examples
Classifier.call(FileBlob.new("path/to/file"), [
Language["Ruby"], Language["Python"]
])
Returns an Array of Language objects, most probable first.
24 25 26 27 28 29 |
# File 'lib/linguist/classifier.rb', line 24 def self.call(blob, possible_languages) language_names = possible_languages.map(&:name) classify(Samples.cache, blob.data[0...CLASSIFIER_CONSIDER_BYTES], language_names).map do |name, _| Language[name] # Return the actual Language objects end end |
.classify(db, tokens, languages = nil) ⇒ Object
Public: Guess language of data.
db - Hash of classifier tokens database. data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.
Examples
Classifier.classify(db, "def hello; end")
# => [ 'Ruby', 0.90], ['Python', 0.2], ... ]
Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.
104 105 106 107 |
# File 'lib/linguist/classifier.rb', line 104 def self.classify(db, tokens, languages = nil) languages ||= db['centroids'].keys new(db).classify(tokens, languages) end |
.filter_vocab_by_freq!(db, min_freq) ⇒ Object
Filter vocabulary by minimum document frequency.
320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 |
# File 'lib/linguist/classifier.rb', line 320 def self.filter_vocab_by_freq!(db, min_freq) vocabulary = db['vocabulary'] # Get document frequencies docfreq = Array.new(vocabulary.size, 0) db['samples'].each_value do |samples| samples.each do |sample| sample.each_key do |idx| docfreq[idx] += 1 end end end vocabulary.select! do |_, idx| docfreq[idx] >= min_freq end nil end |
.finalize_train!(db) ⇒ Object
Public: Finalize training.
db - Hash classifier database object
Examples:
Classifier.finalize_train!(db)
Returns nil.
This method must be called after the last #train! call.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/linguist/classifier.rb', line 75 def self.finalize_train!(db) db['vocabulary'] ||= {} # Unset hash autoincrement db['vocabulary'].default_proc = nil db['samples'] ||= [] filter_vocab_by_freq! db, MIN_DOCUMENT_FREQUENCY sort_vocab! db db['icf'] = inverse_class_freqs db normalize_samples! db db['centroids'] = get_centroids db db.delete 'samples' nil end |
.get_centroids(db) ⇒ Object
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 |
# File 'lib/linguist/classifier.rb', line 393 def self.get_centroids(db) centroids = {} db['samples'].each do |language, samples| centroid = Hash.new(0.0) samples.each do |sample| sample.each do |idx, val| centroid[idx] += val end end centroid.each_key do |idx| centroid[idx] = centroid[idx] / samples.length end l2_normalize! centroid centroids[language] = centroid end centroids end |
.inverse_class_freqs(db) ⇒ Object
Compute inverse class frequency (ICF) for every term.
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
# File 'lib/linguist/classifier.rb', line 363 def self.inverse_class_freqs(db) icf = Array.new(db['vocabulary'].size, 0) db['samples'].each_value do |samples| terms = Set.new samples.each do |sample| terms |= sample.keys end terms.each do |idx| icf[idx] += 1 end end icf.map! do |val| Math.log(db['samples'].size.to_f / val.to_f) + 1 end icf end |
.l2_norm(vec) ⇒ Object
296 297 298 299 |
# File 'lib/linguist/classifier.rb', line 296 def self.l2_norm(vec) norm = vec.values.inject(0.0) { |sum, x| sum + x**2 } Math.sqrt(norm) end |
.l2_normalize!(vec) ⇒ Object
301 302 303 304 305 306 307 |
# File 'lib/linguist/classifier.rb', line 301 def self.l2_normalize!(vec) norm = l2_norm(vec) vec.transform_values! do |value| value.to_f / norm end nil end |
.normalize_samples!(db) ⇒ Object
380 381 382 383 384 385 386 387 388 389 390 391 |
# File 'lib/linguist/classifier.rb', line 380 def self.normalize_samples!(db) icf = db['icf'] db['samples'].each_value do |samples| samples.each do |sample| sample.each do |idx, freq| tf = 1.0 + Math.log(freq) sample[idx] = tf * icf[idx] end l2_normalize! sample end end end |
.similarity(a, b) ⇒ Object
309 310 311 312 313 314 315 316 317 |
# File 'lib/linguist/classifier.rb', line 309 def self.similarity(a, b) sum = 0.0 a.each_key do |idx| if b.key? idx sum += a[idx] * b[idx] end end sum end |
.sort_vocab!(db) ⇒ Object
Sort vocabulary lexicographically.
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 |
# File 'lib/linguist/classifier.rb', line 341 def self.sort_vocab!(db) new_indices = Hash.new { |h,k| h[k] = h.length } db['vocabulary'].sort_by { |x| x[0] }.each do |term, idx| db['vocabulary'][term] = new_indices[idx] end new_indices.default_proc = nil db['samples'].transform_values! do |samples| samples.map do |sample| new_sample = {} sample.each do |idx, freq| new_idx = new_indices[idx] if not new_idx.nil? new_sample[new_idx] = freq end end new_sample end end end |
.to_vocabulary_index_termfreq(vocab, tokens) ⇒ Object
276 277 278 279 280 281 282 283 |
# File 'lib/linguist/classifier.rb', line 276 def self.to_vocabulary_index_termfreq(vocab, tokens) counts = Hash.new(0) tokens.each do |key| idx = vocab[key] counts[idx] += 1 end counts end |
.to_vocabulary_index_termfreq_gaps(vocab, tokens) ⇒ Object
285 286 287 288 289 290 291 292 293 294 |
# File 'lib/linguist/classifier.rb', line 285 def self.to_vocabulary_index_termfreq_gaps(vocab, tokens) counts = Hash.new(0) tokens.each do |key| if vocab.key? key idx = vocab[key] counts[idx] += 1 end end counts end |
.train!(db, language, data) ⇒ Object
Public: Train classifier that data is a certain language.
db - Hash classifier database object language - String language of data data - String contents of file or array of tokens.
Examples
Classifier.train!(db, 'Ruby', "def hello; end")
Returns nil.
Set LINGUIST_DEBUG=1, =2 or =3 to print internal statistics.
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/linguist/classifier.rb', line 44 def self.train!(db, language, data) tokens = data tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String) db['vocabulary'] ||= {} # Set hash to autoincremented index value if db['vocabulary'].default_proc.nil? db['vocabulary'].default_proc = proc do |hash, key| hash[key] = hash.length end end db['samples'] ||= {} db['samples'][language] ||= [] termfreq = to_vocabulary_index_termfreq(db['vocabulary'], tokens) db['samples'][language] << termfreq nil end |
Instance Method Details
#classify(tokens, languages) ⇒ Object
Internal: Guess language of data
data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.
Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# File 'lib/linguist/classifier.rb', line 123 def classify(tokens, languages) return [] if tokens.nil? || languages.empty? tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String) debug_dump_tokens(tokens) if verbosity >= 3 vec = Classifier.to_vocabulary_index_termfreq_gaps(@vocabulary, tokens) vec.each do |idx, freq| tf = 1.0 + Math.log(freq) vec[idx] = tf * @icf[idx] end return [] if vec.empty? Classifier.l2_normalize!(vec) scores = {} languages.each do |language| centroid = @centroids[language] score = Classifier.similarity(vec, centroid) if score > 0.0 scores[language] = score end end scores = scores.sort_by { |x| -x[1] } debug_dump_all_tokens(tokens, scores) if verbosity >= 2 debug_dump_scores(scores) if verbosity >= 1 scores end |