Class: BioDSL::Taxonomy::Search

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/taxonomy.rb

Overview

Class for searching sequences in a taxonomic database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

Defined Under Namespace

Classes: Node, Result, TaxPath

Constant Summary collapse

MAX_COUNT =
200_000
MAX_HITS =

Max num of shared oligos between two sequences.

2_000
BYTES_IN_INT =
4
BYTES_IN_HIT =
2 * BYTES_IN_INT

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ Search

Constructor for initializing a Search object.



307
308
309
310
311
312
313
314
315
316
317
318
319
320
# File 'lib/BioDSL/taxonomy.rb', line 307

def initialize(options)
  @options    = options

  symbols = %i(kmer_size step_size dir prefix consensus coverage hits_max)

  symbols.each do |opt|
    fail TaxonomyError, "missing #{opt} option" unless @options[opt]
  end

  @count_ary  = BioDSL::CAry.new(MAX_COUNT, BYTES_IN_INT)
  @hit_ary    = BioDSL::CAry.new(MAX_HITS, BYTES_IN_HIT)
  @tax_index  = load_tax_index
  @kmer_index = load_kmer_index
end

Instance Method Details

#execute(entry) ⇒ Object

Method to execute a search for a given sequence entry. First the sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size. See Taxonomy::Index.add. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:

* hits_max  - Include maximally this number of hits in the consensus.
* best_only - Include only the best scoring hits in the consensus.
              That is if a hit consists of 344 kmers out of 345
              possible, only hits with 344 kmers are included.
* coverage  - Filter hits based on kmer coverage. If a hit contains
              fewer kmers than the total amount of kmers x coverage
              it will be filtered.
* consensus - For a number of hits accept consensus at a given level
              if within this percentage.


341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
# File 'lib/BioDSL/taxonomy.rb', line 341

def execute(entry)
  kmers = entry.to_kmers(kmer_size: @options[:kmer_size],
                         step_size: @options[:step_size])

  puts "DEBUG Q: #{entry.seq_name}" if BioDSL.debug

  TAX_LEVELS.reverse.each do |level|
    kmers_lookup(kmers, level)

    hit_count = hits_select_C(@count_ary.ary, @count_ary.count,
                              @hit_ary.ary, kmers.size,
                              (@options[:best_only] ? 1 : 0),
                              @options[:coverage])
    hit_count = @options[:hits_max] if @options[:hits_max] < hit_count

    if hit_count == 0
      puts "DEBUG no hits @ #{level}" if BioDSL.debug
    else
      puts "DEBUG hit(s) @ #{level}"  if BioDSL.debug
      taxpaths = []

      (0...hit_count).each do |i|
        start = BYTES_IN_HIT * i
        stop  = BYTES_IN_HIT * i + BYTES_IN_HIT

        node_id, count = @hit_ary.ary[start...stop].unpack('II')

        taxpath = TaxPath.new(node_id, count, kmers.size, @tax_index)

        if BioDSL.debug
          seq_id = @tax_index[node_id].seq_id
          puts "DEBUG S_ID: #{seq_id} KMERS: [#{count}/#{kmers.size}] \
                #{taxpath}"
        end

        taxpaths << taxpath
      end

      return Result.new(hit_count, compile_consensus(taxpaths, hit_count).
             tr('_', ' '))
    end
  end

  Result.new(0, 'Unclassified')
end