Class: BioDSL::Taxonomy::Index

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/taxonomy.rb

Overview

Class for creating and databasing an index of a taxonomic tree. This is done in two steps. 1) A temporary tree is creating using the taxonomic strings from the sequence names in a FASTA file. 2) A simplistic tree is constructed from the temporary tree allowing this to be saved to files. The resulting index consists of the following files:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

Defined Under Namespace

Classes: TaxNode

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ Index

Constructor Index object.



55
56
57
58
59
60
61
62
63
64
65
# File 'lib/BioDSL/taxonomy.rb', line 55

def initialize(options)
  @options   = options                                       # Option hash
  @seq_id    = 0                                             # Sequence id
  @node_id   = 0                                             # Node id
  @tree      = TaxNode.new(nil, :r, 'root', nil, @node_id)   # Root node
  @node_id  += 1

  %i(kmer_size step_size output_dir prefix).each do |option|
    fail TaxonomyError, "missing #{option} option" unless @options[option]
  end
end

Instance Attribute Details

#node_idObject (readonly) Also known as: size

Returns the value of attribute node_id.



51
52
53
# File 'lib/BioDSL/taxonomy.rb', line 51

def node_id
  @node_id
end

Instance Method Details

#add(entry) ⇒ Object

Method to add a Sequence entry to the taxonomic tree. The sequence name contain a taxonomic string.

Example entry: seq_name: K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria; \

O#Vibrionales;F#Vibrionaceae;G#Vibrio;S#Vibrio

seq: UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGA \

UGCAGCCAUGCCGCGUGUAUGAAGGCCUUCGGGUUGUAACUC ...

The sequence is reduced to a list of oligos of a given size and a given step size, e.g. 8 and 1, respectively:

UCCUACGG

CCUACGGG
 CUACGGGA
  UACGGGAG
   ACGGGAGG
    ...

Each oligo is encoded as an kmer (integer) by encoding two bits per nucleotide:

A = 00 U = 01 C = 10 G = 11

E.g. UCCUACGG = 0110100100101111 = 26927

For each node in the tree a set is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node.



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# File 'lib/BioDSL/taxonomy.rb', line 99

def add(entry)
  node       = @tree
  old_name   = false
  tax_levels = entry.seq_name.split(';')

  if tax_levels.size != TAX_LEVELS.size - 1
    fail TaxonomyError, "Wrong number of tax levels in #{entry.seq_name}"
  end

  tax_levels.each_with_index do |tax_level, i|
    level, name = tax_level.split('#')

    if level.downcase.to_sym != TAX_LEVELS[i + 1]
      fail TaxonomyError, "Unexpected tax id in #{entry.seq_name}"
    end

    if name
      if i > 0 && !old_name
        fail TaxonomyError, "Gapped tax level info in #{entry.seq_name}"
      end

      if (child = node[name])
      else
        child = TaxNode.new(node, level.downcase.to_sym, name, @seq_id,
                            @node_id)
        @node_id += 1
      end

      if leaf?(tax_levels, i)
        kmers = entry.to_kmers(kmer_size: @options[:kmer_size],
                               step_size: @options[:step_size])
        child.kmers |= Set.new(kmers)
      end

      node[name] = child
      node = node[name]
    end

    old_name = name
  end

  @seq_id += 1

  self
end

#get_node(id) ⇒ Object

Testing method to get a node given an id. Returns nil if node wasn’t found.



155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# File 'lib/BioDSL/taxonomy.rb', line 155

def get_node(id)
  queue = [@tree]

  until queue.empty?
    node = queue.shift

    return node if node.node_id == id

    node.children.each_value do |child|
      queue.unshift(child) unless child.nil?
    end
  end

  nil
end

#saveObject

Remap and save taxonomic tree to index files.



146
147
148
149
150
151
# File 'lib/BioDSL/taxonomy.rb', line 146

def save
  tree_union(@tree)

  save_kmer_index
  save_tax_index
end

#tree_union(node = @tree) ⇒ Object

Method that traverses the tax tree and populate all parent nodes with the union of all kmers from the patents children.



173
174
175
176
177
178
179
180
181
182
183
184
# File 'lib/BioDSL/taxonomy.rb', line 173

def tree_union(node = @tree)
  node.children.each_value { |child| tree_union(child) }

  node.children.each_value do |child|
    if node.kmers.nil? && child.kmers.nil?
    elsif node.kmers.nil?
      node.kmers = child.kmers
    else
      node.kmers |= child.kmers if child.kmers
    end
  end
end