Module: OpenTox::Algorithm::Similarity

Defined in:
lib/utils.rb

Overview

Similarity calculations

Class Method Summary collapse

Class Method Details

.cosine(fingerprints_a, fingerprints_b, weights = nil) ⇒ Float

Cosine similarity

Parameters:

  • properties_a (Hash)

    key-value properties of first compound

  • properties_b (Hash)

    key-value properties of second compound

Returns:

  • (Float)

    cosine of angle enclosed between vectors induced by keys present in both a and b



550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
# File 'lib/utils.rb', line 550

def self.cosine(fingerprints_a,fingerprints_b,weights=nil)

  # fingerprints are hashes
  if fingerprints_a.class == Hash && fingerprints_b.class == Hash
    a = []; b = []
    common_features = fingerprints_a.keys & fingerprints_b.keys
    if common_features.size > 1
      common_features.each do |p|
        a << fingerprints_a[p]
        b << fingerprints_b[p]
      end
    end

  # fingerprints are arrays
  elsif fingerprints_a.class == Array && fingerprints_b.class == Array
    a = fingerprints_a
    b = fingerprints_b
  end

  (a.size > 0 && b.size > 0) ? self.cosine_num(a.to_gv, b.to_gv) : 0.0

end

.cosine_num(a, b) ⇒ Float

Cosine similarity

Parameters:

  • a (GSL::Vector)
  • b (GSL::Vector)

Returns:

  • (Float)

    cosine of angle enclosed between a and b



578
579
580
581
582
583
584
# File 'lib/utils.rb', line 578

def self.cosine_num(a, b)
  if a.size>12 && b.size>12
    a = a[0..11]
    b = b[0..11]
  end
  a.dot(b) / (a.norm * b.norm)
end

.outliers(params) ⇒ Object

Outlier detection based on Mahalanobis distances Multivariate detection on X, univariate detection on y Uses an existing Rinruby instance, if possible @param Keys query_matrix, data_matrix, acts are required; r, p_outlier optional @return indices identifying outliers (may occur several times, this is intended)



592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
# File 'lib/utils.rb', line 592

def self.outliers(params)
  outlier_array = []
  data_matrix = params[:data_matrix]
  query_matrix = params[:query_matrix]
  acts = params[:acts]
  begin
    LOGGER.debug "Outliers (p=#{params[:p_outlier] || 0.9999})..."
    r = ( params[:r] || RinRuby.new(false,false) )
    r.eval "suppressPackageStartupMessages(library(\"robustbase\"))"
    r.eval "outlier_threshold = #{params[:p_outlier] || 0.999}"
    nr_cases, nr_features = data_matrix.to_a.size, data_matrix.to_a[0].size
    r.odx = data_matrix.to_a.flatten
    r.q = query_matrix.to_a.flatten
    r.y = acts.to_a.flatten
    r.eval "odx = matrix(odx, #{nr_cases}, #{nr_features}, byrow=T)"
    r.eval 'odx = rbind(q,odx)' # query is nr 0 (1) in ruby (R)
    r.eval 'mah = covMcd(odx)$mah' # run MCD alg
    r.eval "mah = pchisq(mah,#{nr_features})"
    r.eval 'outlier_array = which(mah>outlier_threshold)'  # multivariate outliers using robust mahalanobis
    outlier_array = r.outlier_array.to_a.collect{|v| v-2 }  # translate to ruby index (-1 for q, -1 due to ruby)
    r.eval 'fqu = matrix(summary(y))[2]'
    r.eval 'tqu = matrix(summary(y))[5]'
    r.eval 'outlier_array = which(y>(tqu+1.5*IQR(y)))'     # univariate outliers due to Tukey (http://goo.gl/mwzNH)
    outlier_array += r.outlier_array.to_a.collect{|v| v-1 } # translate to ruby index (-1 due to ruby)
    r.eval 'outlier_array = which(y<(fqu-1.5*IQR(y)))'
    outlier_array += r.outlier_array.to_a.collect{|v| v-1 }
  rescue Exception => e
    LOGGER.debug "#{e.class}: #{e.message}"
    #LOGGER.debug "Backtrace:\n\t#{e.backtrace.join("\n\t")}"
  end
  outlier_array
end

.tanimoto(fingerprints_a, fingerprints_b, weights = nil, params = nil) ⇒ Float

Tanimoto similarity

Parameters:

  • fingerprints (Hash, Array)

    of first compound

  • fingerprints (Hash, Array)

    of second compound

Returns:

  • (Float)

    (Weighted) tanimoto similarity



517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
# File 'lib/utils.rb', line 517

def self.tanimoto(fingerprints_a,fingerprints_b,weights=nil,params=nil)

  common_p_sum = 0.0
  all_p_sum = 0.0

  # fingerprints are hashes
  if fingerprints_a.class == Hash && fingerprints_b.class == Hash
    common_features = fingerprints_a.keys & fingerprints_b.keys
    all_features = (fingerprints_a.keys + fingerprints_b.keys).uniq
    if common_features.size > 0
      common_features.each{ |f| common_p_sum += [ fingerprints_a[f], fingerprints_b[f] ].min }
      all_features.each{ |f| all_p_sum += [ fingerprints_a[f],fingerprints_b[f] ].compact.max } # compact, since one fp may be empty at that pos
    end

  # fingerprints are arrays
  elsif fingerprints_a.class == Array && fingerprints_b.class == Array
    size = [ fingerprints_a.size, fingerprints_b.size ].min
    LOGGER.warn "fingerprints don't have equal size" if fingerprints_a.size != fingerprints_b.size
    (0...size).each { |idx|
      common_p_sum += [ fingerprints_a[idx], fingerprints_b[idx] ].min
      all_p_sum += [ fingerprints_a[idx], fingerprints_b[idx] ].max
    }
  end

  (all_p_sum > 0.0) ? (common_p_sum/all_p_sum) : 0.0

end