Class: BioDSL::ClassifySeq
- Inherits:
-
Object
- Object
- BioDSL::ClassifySeq
- Defined in:
- lib/BioDSL/commands/classify_seq.rb
Overview
Classify sequences in the stream.
classify_seq
searches sequences in the stream against a pre-indexed (using index_taxonomy
) database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files (here using the prefix “taxonomy”):
* taxonomy_tax_index.dat - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level
and kmer.
Each sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size - see index_taxonomy
. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:
* hits_max - Include maximally this number of hits in the consensus.
* best_only - Include only the best scoring hits in the consensus.
That is if a hit consists of 344 kmers out of 345
possible, only hits with 344 kmers are included.
* coverage - Filter hits based on kmer coverage. If a hit contains
fewer kmers than the total amount of kmers x coverage
it will be filtered.
* consensus - For a number of hits accept consensus at a given level
if within this percentage.
The output of classify_seq
are sequence type records with the additional keys:
* TAXONOMY_HITS - The number of hits used in the consensus.
* TAXONOMY - The taxonomy string.
The consensus is determined from a list of taxonomic strings, i.e. the TAXONOMIC_HITS, and is composed of a consensus for each taxonomic level. E.g. for the kingdom level if 60% of the taxonomic strings indicate ‘Bacteria’ and the consensus is 50% then the consensus for the kingdom level will be reported as ‘Bacteria(60)’. If the name at any level consists of multiple words they are treated independently. E.g if we have three taxonomic strings at the species level with the names:
* Escherichia coli K-12
* Escherichia coli sp. AC3432
* Escherichia coli sp. AC1232
The corresponding consensus for that level will be reported as ‘Escherichia coli sp.(100/100/66)’. The forth word in the last two taxonomy strings (AC3432 and AC1232) have a consensus below 50% and are ignored.
Usage
classify_seq(<dir: <dir>>[, prefix: <string>[, kmer_size: <uint>
[, step_size: <uint>[, hits_max: <uint>[, consensus:
<float>[, coverage: <float>[, best_only: <bool>]]]]]]])
Options
-
dir: <dir> - Directory containing taxonomy files.
-
prefix: <string> - Taxonomy files prefix (default=“taxonomy”).
-
kmer_size: <uint> - Kmer size (default=8).
-
step_size: <uint> - Step size (default=1).
-
hits_max: <uint> - Maximum hits to include in consensus (default=50).
-
consensus: <float> - Consensus cutoff (default=0.51).
-
coverage: <float> - Coverate cutoff (default=0.9).
-
best_only: <bool> - Only use best hits for consensus (default=true).
Examples
To classify a bunch of OTU sequences in the file otus.fna
we do:
BD.new.
read_fasta(input: "otus.fna").
classify_seq(dir: "RDP11_3").
write_table(keys: [:SEQ_NAME, :TAXONOMY_HITS, :TAXONOMY]).
run
OTU_0 1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_1 1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_2 1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_3 1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_4 2 K#Bacteria(100);P#Fusobacteria(100);C#Fusobacteriia(100);O#...
Constant Summary collapse
- STATS =
%i(records_in records_out sequences_in sequences_out residues_in residues_out)
Instance Method Summary collapse
-
#initialize(options) ⇒ ClassifySeq
constructor
Constructor for the ClassifySeq class.
-
#lmb ⇒ Proc
Return a lambda for the ClassifySeq command.
Constructor Details
#initialize(options) ⇒ ClassifySeq
Constructor for the ClassifySeq class.
131 132 133 134 135 136 |
# File 'lib/BioDSL/commands/classify_seq.rb', line 131 def initialize() @options = defaults end |
Instance Method Details
#lmb ⇒ Proc
Return a lambda for the ClassifySeq command.
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/BioDSL/commands/classify_seq.rb', line 141 def lmb lambda do |input, output, status| status_init(status, STATS) @status[:sequences_in] = 0 search = BioDSL::Taxonomy::Search.new(@options) input.each_with_index do |record, i| @status[:records_in] += 1 classify_seq(record, i, search) if record.key? :SEQ output << record @status[:records_out] += 1 end end end |