Class: BioDSL::ClassifySeq

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/classify_seq.rb

Overview

Classify sequences in the stream.

classify_seq searches sequences in the stream against a pre-indexed (using index_taxonomy) database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files (here using the prefix “taxonomy”):

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level
                            and kmer.

Each sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size - see index_taxonomy. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:

* hits_max  - Include maximally this number of hits in the consensus.
* best_only - Include only the best scoring hits in the consensus.
              That is if a hit consists of 344 kmers out of 345
              possible, only hits with 344 kmers are included.
* coverage  - Filter hits based on kmer coverage. If a hit contains
              fewer kmers than the total amount of kmers x coverage
              it will be filtered.
* consensus - For a number of hits accept consensus at a given level
              if within this percentage.

The output of classify_seq are sequence type records with the additional keys:

* TAXONOMY_HITS - The number of hits used in the consensus.
* TAXONOMY      - The taxonomy string.

The consensus is determined from a list of taxonomic strings, i.e. the TAXONOMIC_HITS, and is composed of a consensus for each taxonomic level. E.g. for the kingdom level if 60% of the taxonomic strings indicate 'Bacteria' and the consensus is 50% then the consensus for the kingdom level will be reported as 'Bacteria(60)'. If the name at any level consists of multiple words they are treated independently. E.g if we have three taxonomic strings at the species level with the names:

*  Escherichia coli K-12
*  Escherichia coli sp. AC3432
*  Escherichia coli sp. AC1232

The corresponding consensus for that level will be reported as 'Escherichia coli sp.(100/100/66)'. The forth word in the last two taxonomy strings (AC3432 and AC1232) have a consensus below 50% and are ignored.

Usage

classify_seq(<dir: <dir>>[, prefix: <string>[, kmer_size: <uint>
             [, step_size: <uint>[, hits_max: <uint>[, consensus:
             <float>[, coverage: <float>[, best_only: <bool>]]]]]]])

Options

  • dir: <dir> - Directory containing taxonomy files.

  • prefix: <string> - Taxonomy files prefix (default=“taxonomy”).

  • kmer_size: <uint> - Kmer size (default=8).

  • step_size: <uint> - Step size (default=1).

  • hits_max: <uint> - Maximum hits to include in consensus (default=50).

  • consensus: <float> - Consensus cutoff (default=0.51).

  • coverage: <float> - Coverate cutoff (default=0.9).

  • best_only: <bool> - Only use best hits for consensus (default=true).

Examples

To classify a bunch of OTU sequences in the file otus.fna we do:

BD.new.
read_fasta(input: "otus.fna").
classify_seq(dir: "RDP11_3").
write_table(keys: [:SEQ_NAME, :TAXONOMY_HITS, :TAXONOMY]).
run

OTU_0  1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_1  1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_2  1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_3  1 K#Bacteria(100);P#Proteobacteria(100);C#Gammaproteobacteria...
OTU_4  2 K#Bacteria(100);P#Fusobacteria(100);C#Fusobacteriia(100);O#...

Constant Summary

STATS =
i(records_in records_out sequences_in sequences_out residues_in
residues_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ ClassifySeq

Constructor for the ClassifySeq class.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :dir (String)

    Directory path with indexes.

  • :prefix (String)

    Index prefix.

  • :kmer_size (Integer)

    Kmer size.

  • :step_size (Integer)

    Step size.

  • :hits_max (Integer)

    Max hits to report per sequence.

  • :consensus (Float)

    Taxonomy string consensus percent.

  • :coverage (Float)

    Kmer coverage filter percent.

  • :best_only (Boolean)

    Flag to report best hit only.



131
132
133
134
135
136
# File 'lib/BioDSL/commands/classify_seq.rb', line 131

def initialize(options)
  @options = options

  check_options
  defaults
end

Instance Method Details

#lmbProc

Return a lambda for the ClassifySeq command.

Returns:

  • (Proc)

    Returns the command lambda.



141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/BioDSL/commands/classify_seq.rb', line 141

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    @status[:sequences_in] = 0

    search = BioDSL::Taxonomy::Search.new(@options)

    input.each_with_index do |record, i|
      @status[:records_in] += 1

      classify_seq(record, i, search) if record.key? :SEQ

      output << record
      @status[:records_out] += 1
    end
  end
end