Module: OpenTox::Algorithm::Similarity

Defined in:
lib/utils.rb

Overview

Similarity calculations

Class Method Summary collapse

Class Method Details

.cosine(fingerprints_a, fingerprints_b, weights = nil) ⇒ Float

Cosine similarity

Parameters:

  • properties_a (Hash)

    key-value properties of first compound

  • properties_b (Hash)

    key-value properties of second compound

Returns:

  • (Float)

    cosine of angle enclosed between vectors induced by keys present in both a and b



291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# File 'lib/utils.rb', line 291

def self.cosine(fingerprints_a,fingerprints_b,weights=nil)

  # fingerprints are hashes
  if fingerprints_a.class == Hash && fingerprints_b.class == Hash
    a = []; b = []
    common_features = fingerprints_a.keys & fingerprints_b.keys
    if common_features.size > 1
      common_features.each do |p|
        a << fingerprints_a[p]
        b << fingerprints_b[p]
      end
    end

  # fingerprints are arrays
  elsif fingerprints_a.class == Array && fingerprints_b.class == Array
    a = fingerprints_a
    b = fingerprints_b
  end

  (a.size > 0 && b.size > 0) ? self.cosine_num(a.to_gv, b.to_gv) : 0.0

end

.cosine_num(a, b) ⇒ Float

Cosine similarity

Parameters:

  • a (GSL::Vector)
  • b (GSL::Vector)

Returns:

  • (Float)

    cosine of angle enclosed between a and b



319
320
321
322
323
324
325
# File 'lib/utils.rb', line 319

def self.cosine_num(a, b)
  if a.size>12 && b.size>12
    a = a[0..11]
    b = b[0..11]
  end
  a.dot(b) / (a.norm * b.norm)
end

.outliers(params) ⇒ Object

Outlier detection based on Mahalanobis distances Multivariate detection on X, univariate detection on y Uses an existing Rinruby instance, if possible @param Keys query_matrix, data_matrix, acts are required; r, p_outlier optional @return indices identifying outliers (may occur several times, this is intended)



333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
# File 'lib/utils.rb', line 333

def self.outliers(params)
  outlier_array = []
  data_matrix = params[:data_matrix]
  query_matrix = params[:query_matrix]
  acts = params[:acts]
  begin
    LOGGER.debug "Outliers (p=#{params[:p_outlier] || 0.9999})..."
    r = ( params[:r] || RinRuby.new(false,false) )
    r.eval "suppressPackageStartupMessages(library(\"robustbase\"))"
    r.eval "outlier_threshold = #{params[:p_outlier] || 0.999}"
    nr_cases, nr_features = data_matrix.to_a.size, data_matrix.to_a[0].size
    r.odx = data_matrix.to_a.flatten
    r.q = query_matrix.to_a.flatten
    r.y = acts.to_a.flatten
    r.eval "odx = matrix(odx, #{nr_cases}, #{nr_features}, byrow=T)"
    r.eval 'odx = rbind(q,odx)' # query is nr 0 (1) in ruby (R)
    r.eval 'mah = covMcd(odx)$mah' # run MCD alg
    r.eval "mah = pchisq(mah,#{nr_features})"
    r.eval 'outlier_array = which(mah>outlier_threshold)'  # multivariate outliers using robust mahalanobis
    outlier_array = r.outlier_array.to_a.collect{|v| v-2 }  # translate to ruby index (-1 for q, -1 due to ruby)
    r.eval 'fqu = matrix(summary(y))[2]'
    r.eval 'tqu = matrix(summary(y))[5]'
    r.eval 'outlier_array = which(y>(tqu+1.5*IQR(y)))'     # univariate outliers due to Tukey (http://goo.gl/mwzNH)
    outlier_array += r.outlier_array.to_a.collect{|v| v-1 } # translate to ruby index (-1 due to ruby)
    r.eval 'outlier_array = which(y<(fqu-1.5*IQR(y)))'
    outlier_array += r.outlier_array.to_a.collect{|v| v-1 }
  rescue Exception => e
    LOGGER.debug "#{e.class}: #{e.message}"
    #LOGGER.debug "Backtrace:\n\t#{e.backtrace.join("\n\t")}"
  end
  outlier_array
end

.tanimoto(fingerprints_a, fingerprints_b, weights = nil, params = nil) ⇒ Float

Tanimoto similarity

Parameters:

  • fingerprints (Hash, Array)

    of first compound

  • fingerprints (Hash, Array)

    of second compound

Returns:

  • (Float)

    (Weighted) tanimoto similarity



258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
# File 'lib/utils.rb', line 258

def self.tanimoto(fingerprints_a,fingerprints_b,weights=nil,params=nil)

  common_p_sum = 0.0
  all_p_sum = 0.0

  # fingerprints are hashes
  if fingerprints_a.class == Hash && fingerprints_b.class == Hash
    common_features = fingerprints_a.keys & fingerprints_b.keys
    all_features = (fingerprints_a.keys + fingerprints_b.keys).uniq
    if common_features.size > 0
      common_features.each{ |f| common_p_sum += [ fingerprints_a[f], fingerprints_b[f] ].min }
      all_features.each{ |f| all_p_sum += [ fingerprints_a[f],fingerprints_b[f] ].compact.max } # compact, since one fp may be empty at that pos
    end

  # fingerprints are arrays
  elsif fingerprints_a.class == Array && fingerprints_b.class == Array
    size = [ fingerprints_a.size, fingerprints_b.size ].min
    LOGGER.warn "fingerprints don't have equal size" if fingerprints_a.size != fingerprints_b.size
    (0...size).each { |idx|
      common_p_sum += [ fingerprints_a[idx], fingerprints_b[idx] ].min
      all_p_sum += [ fingerprints_a[idx], fingerprints_b[idx] ].max
    }
  end

  (all_p_sum > 0.0) ? (common_p_sum/all_p_sum) : 0.0

end