Class: BioDSL::IndexTaxonomy
- Inherits:
-
Object
- Object
- BioDSL::IndexTaxonomy
- Defined in:
- lib/BioDSL/commands/index_taxonomy.rb
Overview
Create taxonomy index from sequences in the stream.
index_taxonomy
is used to create a taxonomy index to allow subsequent taxonomic classification with classify_seq
. The records with taxnomic information must contain :SEQ_NAME and :SEQ keys where the :SEQ_NAME value must be formatted with an initial ID number followed by a space and then the taxonomy string progressing from kingdom to species level. Only the following leves are accepted:
* K - kingdom
* P - phylum
* C - class
* O - order
* F - family
* G - genus
* S - species
Truncated taxonomic strings are allowed, e.g. a string that stops at family level. Below is an example of a full taxonomic string:
32 K#Bacteria;P#Actinobacteria;C#Actinobacteria;O#Acidimicrobiales; \
F#Acidimicrobiaceae;G#Ferrimicrobium;S#Ferrimicrobium acidiphilum
The resulting index consists of the following files (here using the default “taxonomy” as prefix) which are saved to a specified output_dir
:
* taxonomy_tax_index.dat - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
kmer.
The index is constructed by breaking the sequences into kmers of a given kmer_size and using a given step_size:
Example FASTA entry:
>2 K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria;O#Vibrionales; \
F#Vibrionaceae;G#Vibrio;S#Vibrio
UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGAUGCAGCCAUGCCGCGUGUAUGA
This sequence is broken down to a list of oligos using the default kmer_size and step_size of 8 and 1, respectively:
UCCUACGG
CCUACGGG
CUACGGGA
UACGGGAG
ACGGGAGG
...
Oligos containing ambiguity codes are skipped. Each oligo is encoded as an kmer (integer) by encoding two bits per nucletoide:
* A = 00
* U = 01
* C = 10
* G = 11
E.g. UCCUACGG = 0110100100101111 = 26927
For each node in the tree a vector is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node. Finally, the tree is saved to files.
It should be noted that the speed and accuarcy of the classification is strongly dependent on the size and quality of the taxonomic database used (RDP, GreenGenes or Silva) and for a particular amplicon it is strongly recommended to create a slice from the database aligment matching the amplicon.
Usage
index_taxonomy(<output_dir: <dir>>[, kmer_size: <uint>
[, step_size: <uint>[, prefix: <string>
[, force: <bool>]]]])
Options
* output_dir: <dir> - Output directory to contain index files.
* kmer_size: <uint> - Size of kmer to use (default=8).
* step_size: <uint> - Size of steps (default=1).
* prefix: <string> - Prefix to use with index file names
(default="taxonomy").
* force: <bool> - Force overwrite existing index files.
Examples
BD.new.
read_fasta(input: "RDP_11_Bacteria.fna").
index_taxonomy(output_dir: "RDP_11").
run
Constant Summary collapse
- STATS =
%i(records_in records_out sequences_in sequences_out residues_in residues_out)
Instance Method Summary collapse
-
#initialize(options) ⇒ IndexTaxonomy
constructor
Constructor for IndexTaxonomy.
-
#lmb ⇒ Proc
Return command lambda for index_taxonomy.
Constructor Details
#initialize(options) ⇒ IndexTaxonomy
Constructor for IndexTaxonomy.
134 135 136 137 138 139 140 141 142 143 |
# File 'lib/BioDSL/commands/index_taxonomy.rb', line 134 def initialize() @options = defaults create_output_dir check_output_files @index = BioDSL::Taxonomy::Index.new() end |
Instance Method Details
#lmb ⇒ Proc
Return command lambda for index_taxonomy.
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/BioDSL/commands/index_taxonomy.rb', line 148 def lmb lambda do |input, output, status| status_init(status, STATS) input.each do |record| @status[:records_in] += 1 add_to_index(record) if record[:SEQ_NAME] && record[:SEQ] output << record @status[:records_out] += 1 end @index.save end end |