Class: BioDSL::Taxonomy::Search

Inherits:

Object

Object
BioDSL::Taxonomy::Search

Defined in:: lib/BioDSL/taxonomy.rb

Overview

Class for searching sequences in a taxonomic database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

Defined Under Namespace

Classes: Node, Result, TaxPath

Constant Summary collapse

MAX_COUNT =

200_000

MAX_HITS = Max num of shared oligos between two sequences.

2_000

BYTES_IN_INT =

BYTES_IN_HIT =

2 * BYTES_IN_INT

Instance Method Summary collapse

#execute(entry) ⇒ Object

Method to execute a search for a given sequence entry.
#initialize(options) ⇒ Search constructor

Constructor for initializing a Search object.

Constructor Details

#initialize(options) ⇒ `Search`

Constructor for initializing a Search object.

# File 'lib/BioDSL/taxonomy.rb', line 307

def initialize(options)
  @options    = options

  symbols = %i(kmer_size step_size dir prefix consensus coverage hits_max)

  symbols.each do |opt|
    fail TaxonomyError, "missing #{opt} option" unless @options[opt]
  end

  @count_ary  = BioDSL::CAry.new(MAX_COUNT, BYTES_IN_INT)
  @hit_ary    = BioDSL::CAry.new(MAX_HITS, BYTES_IN_HIT)
  @tax_index  = load_tax_index
  @kmer_index = load_kmer_index
end

Instance Method Details

#execute(entry) ⇒ `Object`

Method to execute a search for a given sequence entry. First the sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size. See Taxonomy::Index.add. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:

* hits_max  - Include maximally this number of hits in the consensus.
* best_only - Include only the best scoring hits in the consensus.
              That is if a hit consists of 344 kmers out of 345
              possible, only hits with 344 kmers are included.
* coverage  - Filter hits based on kmer coverage. If a hit contains
              fewer kmers than the total amount of kmers x coverage
              it will be filtered.
* consensus - For a number of hits accept consensus at a given level
              if within this percentage.

# File 'lib/BioDSL/taxonomy.rb', line 341

def execute(entry)
  kmers = entry.to_kmers(kmer_size: @options[:kmer_size],
                         step_size: @options[:step_size])

  puts "DEBUG Q: #{entry.seq_name}" if BioDSL.debug

  TAX_LEVELS.reverse.each do |level|
    kmers_lookup(kmers, level)

    hit_count = hits_select_C(@count_ary.ary, @count_ary.count,
                              @hit_ary.ary, kmers.size,
                              (@options[:best_only] ? 1 : 0),
                              @options[:coverage])
    hit_count = @options[:hits_max] if @options[:hits_max] < hit_count

    if hit_count == 0
      puts "DEBUG no hits @ #{level}" if BioDSL.debug
    else
      puts "DEBUG hit(s) @ #{level}"  if BioDSL.debug
      taxpaths = []

      (0...hit_count).each do |i|
        start = BYTES_IN_HIT * i
        stop  = BYTES_IN_HIT * i + BYTES_IN_HIT

        node_id, count = @hit_ary.ary[start...stop].unpack('II')

        taxpath = TaxPath.new(node_id, count, kmers.size, @tax_index)

        if BioDSL.debug
          seq_id = @tax_index[node_id].seq_id
          puts "DEBUG S_ID: #{seq_id} KMERS: [#{count}/#{kmers.size}] \
                #{taxpath}"
        end

        taxpaths << taxpath
      end

      return Result.new(hit_count, compile_consensus(taxpaths, hit_count).
             tr('_', ' '))
    end
  end

  Result.new(0, 'Unclassified')
end