Class: Linguist::Classifier

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/classifier.rb

Overview

Language content classifier.

Constant Summary collapse

CLASSIFIER_CONSIDER_BYTES =

Maximum number of bytes to consider for classification. This is only used at evaluation time. During training, full content of samples is used.

50 * 1024

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(db = {}) ⇒ Classifier

Internal: Initialize a Classifier.



110
111
112
113
114
# File 'lib/linguist/classifier.rb', line 110

def initialize(db = {})
  @vocabulary = db['vocabulary']
  @centroids  = db['centroids']
  @icf = db['icf']
end

Class Method Details

.call(blob, possible_languages) ⇒ Object

Public: Use the classifier to detect language of the blob.

blob - An object that quacks like a blob. possible_languages - Array of Language objects

Examples

Classifier.call(FileBlob.new("path/to/file"), [
  Language["Ruby"], Language["Python"]
])

Returns an Array of Language objects, most probable first.



24
25
26
27
28
29
# File 'lib/linguist/classifier.rb', line 24

def self.call(blob, possible_languages)
  language_names = possible_languages.map(&:name)
  classify(Samples.cache, blob.data[0...CLASSIFIER_CONSIDER_BYTES], language_names).map do |name, _|
    Language[name] # Return the actual Language objects
  end
end

.classify(db, tokens, languages = nil) ⇒ Object

Public: Guess language of data.

db - Hash of classifier tokens database. data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.

Examples

Classifier.classify(db, "def hello; end")
# => [ 'Ruby', 0.90], ['Python', 0.2], ... ]

Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.



104
105
106
107
# File 'lib/linguist/classifier.rb', line 104

def self.classify(db, tokens, languages = nil)
  languages ||= db['centroids'].keys
  new(db).classify(tokens, languages)
end

.filter_vocab_by_freq!(db, min_freq) ⇒ Object

Filter vocabulary by minimum document frequency.



320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# File 'lib/linguist/classifier.rb', line 320

def self.filter_vocab_by_freq!(db, min_freq)
  vocabulary = db['vocabulary']

  # Get document frequencies
  docfreq = Array.new(vocabulary.size, 0)
  db['samples'].each_value do |samples|
    samples.each do |sample|
      sample.each_key do |idx|
        docfreq[idx] += 1
      end
    end
  end

  vocabulary.select! do |_, idx|
    docfreq[idx] >= min_freq
  end

  nil
end

.finalize_train!(db) ⇒ Object

Public: Finalize training.

db - Hash classifier database object

Examples:

Classifier.finalize_train!(db)

Returns nil.

This method must be called after the last #train! call.



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/linguist/classifier.rb', line 75

def self.finalize_train!(db)
  db['vocabulary'] ||= {}

  # Unset hash autoincrement
  db['vocabulary'].default_proc = nil

  db['samples'] ||= []
  filter_vocab_by_freq! db, MIN_DOCUMENT_FREQUENCY
  sort_vocab! db
  db['icf'] = inverse_class_freqs db
  normalize_samples! db
  db['centroids'] = get_centroids db
  db.delete 'samples'
  nil
end

.get_centroids(db) ⇒ Object



393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# File 'lib/linguist/classifier.rb', line 393

def self.get_centroids(db)
  centroids = {}
  db['samples'].each do |language, samples|
    centroid = Hash.new(0.0)
    samples.each do |sample|
      sample.each do |idx, val|
        centroid[idx] += val
      end
    end
    centroid.each_key do |idx|
      centroid[idx] = centroid[idx] / samples.length
    end
    l2_normalize! centroid
    centroids[language] = centroid
  end
  centroids
end

.inverse_class_freqs(db) ⇒ Object

Compute inverse class frequency (ICF) for every term.



363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
# File 'lib/linguist/classifier.rb', line 363

def self.inverse_class_freqs(db)
  icf = Array.new(db['vocabulary'].size, 0)
  db['samples'].each_value do |samples|
    terms = Set.new
    samples.each do |sample|
      terms |= sample.keys
    end
    terms.each do |idx|
      icf[idx] += 1
    end
  end
  icf.map! do |val|
    Math.log(db['samples'].size.to_f / val.to_f) + 1
  end
  icf
end

.l2_norm(vec) ⇒ Object



296
297
298
299
# File 'lib/linguist/classifier.rb', line 296

def self.l2_norm(vec)
  norm = vec.values.inject(0.0) { |sum, x| sum + x**2 }
  Math.sqrt(norm)
end

.l2_normalize!(vec) ⇒ Object



301
302
303
304
305
306
307
# File 'lib/linguist/classifier.rb', line 301

def self.l2_normalize!(vec)
  norm = l2_norm(vec)
  vec.transform_values! do |value|
    value.to_f / norm
  end
  nil
end

.normalize_samples!(db) ⇒ Object



380
381
382
383
384
385
386
387
388
389
390
391
# File 'lib/linguist/classifier.rb', line 380

def self.normalize_samples!(db)
  icf = db['icf']
  db['samples'].each_value do |samples|
    samples.each do |sample|
      sample.each do |idx, freq|
        tf = 1.0 + Math.log(freq)
        sample[idx] = tf * icf[idx]
      end
      l2_normalize! sample
    end
  end
end

.similarity(a, b) ⇒ Object



309
310
311
312
313
314
315
316
317
# File 'lib/linguist/classifier.rb', line 309

def self.similarity(a, b)
  sum = 0.0
  a.each_key do |idx|
    if b.key? idx
      sum += a[idx] * b[idx]
    end
  end
  sum
end

.sort_vocab!(db) ⇒ Object

Sort vocabulary lexicographically.



341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
# File 'lib/linguist/classifier.rb', line 341

def self.sort_vocab!(db)
  new_indices = Hash.new { |h,k| h[k] = h.length }
  db['vocabulary'].sort_by { |x| x[0] }.each do |term, idx|
    db['vocabulary'][term] = new_indices[idx]
  end
  new_indices.default_proc = nil

  db['samples'].transform_values! do |samples|
    samples.map do |sample|
      new_sample = {}
      sample.each do |idx, freq|
        new_idx = new_indices[idx]
        if not new_idx.nil?
          new_sample[new_idx] = freq
        end
      end
      new_sample
    end
  end
end

.to_vocabulary_index_termfreq(vocab, tokens) ⇒ Object



276
277
278
279
280
281
282
283
# File 'lib/linguist/classifier.rb', line 276

def self.to_vocabulary_index_termfreq(vocab, tokens)
  counts = Hash.new(0)
  tokens.each do |key|
    idx = vocab[key]
    counts[idx] += 1
  end
  counts
end

.to_vocabulary_index_termfreq_gaps(vocab, tokens) ⇒ Object



285
286
287
288
289
290
291
292
293
294
# File 'lib/linguist/classifier.rb', line 285

def self.to_vocabulary_index_termfreq_gaps(vocab, tokens)
  counts = Hash.new(0)
  tokens.each do |key|
    if vocab.key? key
      idx = vocab[key]
      counts[idx] += 1
    end
  end
  counts
end

.train!(db, language, data) ⇒ Object

Public: Train classifier that data is a certain language.

db - Hash classifier database object language - String language of data data - String contents of file or array of tokens.

Examples

Classifier.train!(db, 'Ruby', "def hello; end")

Returns nil.

Set LINGUIST_DEBUG=1, =2 or =3 to print internal statistics.



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/linguist/classifier.rb', line 44

def self.train!(db, language, data)
  tokens = data
  tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)

  db['vocabulary'] ||= {}
  # Set hash to autoincremented index value
  if db['vocabulary'].default_proc.nil?
    db['vocabulary'].default_proc = proc do |hash, key|
      hash[key] = hash.length
    end
  end

  db['samples'] ||= {}
  db['samples'][language] ||= []

  termfreq = to_vocabulary_index_termfreq(db['vocabulary'], tokens)
  db['samples'][language] << termfreq

  nil
end

Instance Method Details

#classify(tokens, languages) ⇒ Object

Internal: Guess language of data

data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.

Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# File 'lib/linguist/classifier.rb', line 123

def classify(tokens, languages)
  return [] if tokens.nil? || languages.empty?
  tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)

  debug_dump_tokens(tokens) if verbosity >= 3

  vec = Classifier.to_vocabulary_index_termfreq_gaps(@vocabulary, tokens)
  vec.each do |idx, freq|
    tf = 1.0 + Math.log(freq)
    vec[idx] = tf * @icf[idx]
  end
  return [] if vec.empty?
  Classifier.l2_normalize!(vec)

  scores = {}
  languages.each do |language|
    centroid = @centroids[language]
    score = Classifier.similarity(vec, centroid)
    if score > 0.0
      scores[language] = score
    end
  end
  scores = scores.sort_by { |x| -x[1] }
  debug_dump_all_tokens(tokens, scores) if verbosity >= 2
  debug_dump_scores(scores) if verbosity >= 1
  scores
end