Class: Nimbus::ClassificationTree

Inherits:
Tree
  • Object
show all
Defined in:
lib/nimbus/classification_tree.rb

Overview

Tree object representing a random classification tree.

A tree is generated following this steps:

  • 1: Calculate loss function for the individuals in the node (first node contains all the individuals).

  • 2: Take a random sample of the SNPs (size m << total count of SNPs)

  • 3: Compute the loss function (default: gini index) for the split of the sample based on value of every SNP.

  • 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in two nodes, based on average value for that SNP [0,1], or [0]

  • 5: Repeat from 1 for every node until:

    • a) The individuals count in that node is < minimum size OR

    • b) None of the SNP splits has a loss function smaller than the node loss function

  • 6) When a node stops, label the node with the majority class in the node.

Constant Summary

Constants inherited from Tree

Tree::NODE_SPLIT_01_2, Tree::NODE_SPLIT_0_12

Instance Attribute Summary collapse

Attributes inherited from Tree

#generalization_error, #id_to_fenotype, #importances, #individuals, #node_min_size, #predictions, #snp_sample_size, #snp_total_count, #structure, #used_snps

Instance Method Summary collapse

Methods inherited from Tree

traverse

Constructor Details

#initialize(options) ⇒ ClassificationTree

Initialize Tree object with the configuration (as in Nimbus::Configuration.tree) options received.



21
22
23
24
# File 'lib/nimbus/classification_tree.rb', line 21

def initialize(options)
  @classes = options[:classes]
  super
end

Instance Attribute Details

#classesObject

Returns the value of attribute classes.



18
19
20
# File 'lib/nimbus/classification_tree.rb', line 18

def classes
  @classes
end

Instance Method Details

#build_node(individuals_ids, y_hat) ⇒ Object

Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.

  • If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in two:

(the average of the 0,1,2 values for the SNP_min in the individuals is computed, and they are splitted in [<=avg], [>avg]) then it builds these 2 new nodes.

  • Otherwise every individual in the node gets labeled with the average of the fenotype values of all of them.



40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/nimbus/classification_tree.rb', line 40

def build_node(individuals_ids, y_hat)
  # General loss function value for the node
  individuals_count = individuals_ids.size
  return label_node(y_hat, individuals_ids) if individuals_count < @node_min_size
  node_loss_function = Nimbus::LossFunctions.gini_index individuals_ids, @id_to_fenotype, @classes

  # Finding the SNP that minimizes loss function
  snps = snps_random_sample
  min_loss, min_SNP, split, split_type, ginis = node_loss_function, nil, nil, nil, nil

  snps.each do |snp|
    individuals_split_by_snp_value, node_split_type = split_by_snp_avegare_value individuals_ids, snp
    y_hat_0 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[0], @id_to_fenotype, @classes)
    y_hat_1 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[1], @id_to_fenotype, @classes)

    gini_0 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[0], @id_to_fenotype, @classes
    gini_1 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[1], @id_to_fenotype, @classes
    loss_snp = (individuals_split_by_snp_value[0].size * gini_0 +
                individuals_split_by_snp_value[1].size * gini_1) / individuals_count

    min_loss, min_SNP, split, split_type, ginis = loss_snp, snp, individuals_split_by_snp_value, node_split_type, [y_hat_0, y_hat_1] if loss_snp < min_loss
  end
  return build_branch(min_SNP, split, split_type, ginis, y_hat) if min_loss < node_loss_function
  return label_node(y_hat, individuals_ids)
end

#estimate_importances(oob_ids) ⇒ Object

Estimation of importance for every SNP.

The importance of any SNP in the tree is calculated using the OOB sample. For every SNP, every individual in the sample is pushed down the tree but with the value of that SNP permuted with other individual in the sample.

That way the difference between the generalization error and the error frequency with the SNP value modified can be estimated for any given SNP.

This method computes importance estimations for every SNPs used in the tree (for any other SNP it would be 0).



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/nimbus/classification_tree.rb', line 90

def estimate_importances(oob_ids)
  return nil if (@generalization_error.nil? && generalization_error_from_oob(oob_ids).nil?)
  oob_individuals_count = oob_ids.size
  @importances = {}
  @used_snps.uniq.each do |current_snp|
    shuffled_ids = oob_ids.shuffle
    permutated_snp_errors = 0.0
    oob_ids.each_with_index {|oobi, index|
      permutated_prediction = traverse_with_permutation @structure, individuals[oobi].snp_list, current_snp, individuals[shuffled_ids[index]].snp_list
      permutated_snp_errors += 1 unless @id_to_fenotype[oobi] == permutated_prediction
    }
    @importances[current_snp] = ((permutated_snp_errors / oob_individuals_count) - @generalization_error).round(5)
  end
  @importances
end

#generalization_error_from_oob(oob_ids) ⇒ Object

Compute generalization error for the tree.

Traversing the ‘out of bag’ (OOB) sample (those individuals of the training set not used in the building of this tree) through the tree, and comparing the prediction with the real fenotype class of the individual is possible to calculate the error frequency, an unbiased generalization error for the tree.



72
73
74
75
76
77
78
79
# File 'lib/nimbus/classification_tree.rb', line 72

def generalization_error_from_oob(oob_ids)
  return nil if (@structure.nil? || @individuals.nil? || @id_to_fenotype.nil?)
  oob_errors = 0.0
  oob_ids.each do |oobi|
    oob_errors += 1 unless @id_to_fenotype[oobi] == Tree.traverse(@structure, individuals[oobi].snp_list)
  end
  @generalization_error = oob_errors / oob_ids.size
end

#seed(all_individuals, individuals_sample, ids_fenotypes) ⇒ Object

Creates the structure of the tree, as a hash of SNP splits and values.

It just initializes the needed variables and then defines the first node of the tree. The rest of the structure of the tree is computed recursively building every node calling build_node.



30
31
32
33
# File 'lib/nimbus/classification_tree.rb', line 30

def seed(all_individuals, individuals_sample, ids_fenotypes)
  super
  @structure = build_node individuals_sample, Nimbus::LossFunctions.majority_class(individuals_sample, @id_to_fenotype, @classes)
end