Class: BioDSL::IndexTaxonomy

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/index_taxonomy.rb

Overview

Create taxonomy index from sequences in the stream.

index_taxonomy is used to create a taxonomy index to allow subsequent taxonomic classification with classify_seq. The records with taxnomic information must contain :SEQ_NAME and :SEQ keys where the :SEQ_NAME value must be formatted with an initial ID number followed by a space and then the taxonomy string progressing from kingdom to species level. Only the following leves are accepted:

* K - kingdom
* P - phylum
* C - class
* O - order
* F - family
* G - genus
* S - species

Truncated taxonomic strings are allowed, e.g. a string that stops at family level. Below is an example of a full taxonomic string:

32 K#Bacteria;P#Actinobacteria;C#Actinobacteria;O#Acidimicrobiales; \
   F#Acidimicrobiaceae;G#Ferrimicrobium;S#Ferrimicrobium acidiphilum

The resulting index consists of the following files (here using the default “taxonomy” as prefix) which are saved to a specified output_dir:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

The index is constructed by breaking the sequences into kmers of a given kmer_size and using a given step_size:

Example FASTA entry:

>2 K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria;O#Vibrionales; \
   F#Vibrionaceae;G#Vibrio;S#Vibrio
UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGAUGCAGCCAUGCCGCGUGUAUGA

This sequence is broken down to a list of oligos using the default kmer_size and step_size of 8 and 1, respectively:

UCCUACGG
 CCUACGGG
  CUACGGGA
   UACGGGAG
    ACGGGAGG
     ...

Oligos containing ambiguity codes are skipped. Each oligo is encoded as an kmer (integer) by encoding two bits per nucletoide:

* A = 00
* U = 01
* C = 10
* G = 11

E.g. UCCUACGG = 0110100100101111 = 26927

For each node in the tree a vector is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node. Finally, the tree is saved to files.

It should be noted that the speed and accuarcy of the classification is strongly dependent on the size and quality of the taxonomic database used (RDP, GreenGenes or Silva) and for a particular amplicon it is strongly recommended to create a slice from the database aligment matching the amplicon.

Usage

index_taxonomy(<output_dir: <dir>>[, kmer_size: <uint>
               [, step_size: <uint>[, prefix: <string>
               [, force: <bool>]]]])

Options

* output_dir: <dir> - Output directory to contain index files.
* kmer_size: <uint> - Size of kmer to use (default=8).
* step_size: <uint> - Size of steps (default=1).
* prefix: <string>  - Prefix to use with index file names
                      (default="taxonomy").
* force: <bool>     - Force overwrite existing index files.

Examples

BD.new.
read_fasta(input: "RDP_11_Bacteria.fna").
index_taxonomy(output_dir: "RDP_11").
run

Constant Summary collapse

STATS =
%i(records_in records_out sequences_in sequences_out residues_in
residues_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ IndexTaxonomy

Constructor for IndexTaxonomy.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :output_dir (String)

    Path to output directory.

  • :prefix (String)

    Database file name prefix.

  • :kmer_size (Integer)

    Kmer size to use for indexing.

  • :step_size (Integer)

    Step size to use for indexing.

  • :force (Boolean)

    Flag for force-overwriting output files.



134
135
136
137
138
139
140
141
142
143
# File 'lib/BioDSL/commands/index_taxonomy.rb', line 134

def initialize(options)
  @options = options

  defaults
  check_options
  create_output_dir
  check_output_files

  @index = BioDSL::Taxonomy::Index.new(options)
end

Instance Method Details

#lmbProc

Return command lambda for index_taxonomy.

Returns:

  • (Proc)

    Command lambda.



148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# File 'lib/BioDSL/commands/index_taxonomy.rb', line 148

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each do |record|
      @status[:records_in] += 1

      add_to_index(record) if record[:SEQ_NAME] && record[:SEQ]

      output << record
      @status[:records_out] += 1
    end

    @index.save
  end
end