Class: BioDSL::Taxonomy::Index
- Inherits:
-
Object
- Object
- BioDSL::Taxonomy::Index
- Defined in:
- lib/BioDSL/taxonomy.rb
Overview
Class for creating and databasing an index of a taxonomic tree. This is done in two steps. 1) A temporary tree is creating using the taxonomic strings from the sequence names in a FASTA file. 2) A simplistic tree is constructed from the temporary tree allowing this to be saved to files. The resulting index consists of the following files:
* taxonomy_tax_index.dat - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
kmer.
Defined Under Namespace
Classes: TaxNode
Instance Attribute Summary collapse
-
#node_id ⇒ Object
(also: #size)
readonly
Returns the value of attribute node_id.
Instance Method Summary collapse
-
#add(entry) ⇒ Object
Method to add a Sequence entry to the taxonomic tree.
-
#get_node(id) ⇒ Object
Testing method to get a node given an id.
-
#initialize(options) ⇒ Index
constructor
Constructor Index object.
-
#save ⇒ Object
Remap and save taxonomic tree to index files.
-
#tree_union(node = @tree) ⇒ Object
Method that traverses the tax tree and populate all parent nodes with the union of all kmers from the patents children.
Constructor Details
#initialize(options) ⇒ Index
Constructor Index object.
55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/BioDSL/taxonomy.rb', line 55 def initialize() @options = # Option hash @seq_id = 0 # Sequence id @node_id = 0 # Node id @tree = TaxNode.new(nil, :r, 'root', nil, @node_id) # Root node @node_id += 1 %i(kmer_size step_size output_dir prefix).each do |option| fail TaxonomyError, "missing #{option} option" unless @options[option] end end |
Instance Attribute Details
#node_id ⇒ Object (readonly) Also known as: size
Returns the value of attribute node_id.
51 52 53 |
# File 'lib/BioDSL/taxonomy.rb', line 51 def node_id @node_id end |
Instance Method Details
#add(entry) ⇒ Object
Method to add a Sequence entry to the taxonomic tree. The sequence name contain a taxonomic string.
Example entry: seq_name: K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria; \
O#Vibrionales;F#Vibrionaceae;G#Vibrio;S#Vibrio
seq: UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGA \
UGCAGCCAUGCCGCGUGUAUGAAGGCCUUCGGGUUGUAACUC ...
The sequence is reduced to a list of oligos of a given size and a given step size, e.g. 8 and 1, respectively:
UCCUACGG
CCUACGGG
CUACGGGA
UACGGGAG
ACGGGAGG
...
Each oligo is encoded as an kmer (integer) by encoding two bits per nucleotide:
A = 00 U = 01 C = 10 G = 11
E.g. UCCUACGG = 0110100100101111 = 26927
For each node in the tree a set is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# File 'lib/BioDSL/taxonomy.rb', line 99 def add(entry) node = @tree old_name = false tax_levels = entry.seq_name.split(';') if tax_levels.size != TAX_LEVELS.size - 1 fail TaxonomyError, "Wrong number of tax levels in #{entry.seq_name}" end tax_levels.each_with_index do |tax_level, i| level, name = tax_level.split('#') if level.downcase.to_sym != TAX_LEVELS[i + 1] fail TaxonomyError, "Unexpected tax id in #{entry.seq_name}" end if name if i > 0 && !old_name fail TaxonomyError, "Gapped tax level info in #{entry.seq_name}" end if (child = node[name]) else child = TaxNode.new(node, level.downcase.to_sym, name, @seq_id, @node_id) @node_id += 1 end if leaf?(tax_levels, i) kmers = entry.to_kmers(kmer_size: @options[:kmer_size], step_size: @options[:step_size]) child.kmers |= Set.new(kmers) end node[name] = child node = node[name] end old_name = name end @seq_id += 1 self end |
#get_node(id) ⇒ Object
Testing method to get a node given an id. Returns nil if node wasn’t found.
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# File 'lib/BioDSL/taxonomy.rb', line 155 def get_node(id) queue = [@tree] until queue.empty? node = queue.shift return node if node.node_id == id node.children.each_value do |child| queue.unshift(child) unless child.nil? end end nil end |
#save ⇒ Object
Remap and save taxonomic tree to index files.
146 147 148 149 150 151 |
# File 'lib/BioDSL/taxonomy.rb', line 146 def save tree_union(@tree) save_kmer_index save_tax_index end |
#tree_union(node = @tree) ⇒ Object
Method that traverses the tax tree and populate all parent nodes with the union of all kmers from the patents children.
173 174 175 176 177 178 179 180 181 182 183 184 |
# File 'lib/BioDSL/taxonomy.rb', line 173 def tree_union(node = @tree) node.children.each_value { |child| tree_union(child) } node.children.each_value do |child| if node.kmers.nil? && child.kmers.nil? elsif node.kmers.nil? node.kmers = child.kmers else node.kmers |= child.kmers if child.kmers end end end |