Class: BioDSL::Taxonomy::Search
- Inherits:
-
Object
- Object
- BioDSL::Taxonomy::Search
- Defined in:
- lib/BioDSL/taxonomy.rb
Overview
Class for searching sequences in a taxonomic database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files:
* taxonomy_tax_index.dat - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
kmer.
Defined Under Namespace
Classes: Node, Result, TaxPath
Constant Summary collapse
- MAX_COUNT =
200_000
- MAX_HITS =
Max num of shared oligos between two sequences.
2_000
- BYTES_IN_INT =
4
- BYTES_IN_HIT =
2 * BYTES_IN_INT
Instance Method Summary collapse
-
#execute(entry) ⇒ Object
Method to execute a search for a given sequence entry.
-
#initialize(options) ⇒ Search
constructor
Constructor for initializing a Search object.
Constructor Details
#initialize(options) ⇒ Search
Constructor for initializing a Search object.
307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
# File 'lib/BioDSL/taxonomy.rb', line 307 def initialize() @options = symbols = %i(kmer_size step_size dir prefix consensus coverage hits_max) symbols.each do |opt| fail TaxonomyError, "missing #{opt} option" unless @options[opt] end @count_ary = BioDSL::CAry.new(MAX_COUNT, BYTES_IN_INT) @hit_ary = BioDSL::CAry.new(MAX_HITS, BYTES_IN_HIT) @tax_index = load_tax_index @kmer_index = load_kmer_index end |
Instance Method Details
#execute(entry) ⇒ Object
Method to execute a search for a given sequence entry. First the sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size. See Taxonomy::Index.add. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:
* hits_max - Include maximally this number of hits in the consensus.
* best_only - Include only the best scoring hits in the consensus.
That is if a hit consists of 344 kmers out of 345
possible, only hits with 344 kmers are included.
* coverage - Filter hits based on kmer coverage. If a hit contains
fewer kmers than the total amount of kmers x coverage
it will be filtered.
* consensus - For a number of hits accept consensus at a given level
if within this percentage.
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 |
# File 'lib/BioDSL/taxonomy.rb', line 341 def execute(entry) kmers = entry.to_kmers(kmer_size: @options[:kmer_size], step_size: @options[:step_size]) puts "DEBUG Q: #{entry.seq_name}" if BioDSL.debug TAX_LEVELS.reverse.each do |level| kmers_lookup(kmers, level) hit_count = hits_select_C(@count_ary.ary, @count_ary.count, @hit_ary.ary, kmers.size, (@options[:best_only] ? 1 : 0), @options[:coverage]) hit_count = @options[:hits_max] if @options[:hits_max] < hit_count if hit_count == 0 puts "DEBUG no hits @ #{level}" if BioDSL.debug else puts "DEBUG hit(s) @ #{level}" if BioDSL.debug taxpaths = [] (0...hit_count).each do |i| start = BYTES_IN_HIT * i stop = BYTES_IN_HIT * i + BYTES_IN_HIT node_id, count = @hit_ary.ary[start...stop].unpack('II') taxpath = TaxPath.new(node_id, count, kmers.size, @tax_index) if BioDSL.debug seq_id = @tax_index[node_id].seq_id puts "DEBUG S_ID: #{seq_id} KMERS: [#{count}/#{kmers.size}] \ #{taxpath}" end taxpaths << taxpath end return Result.new(hit_count, compile_consensus(taxpaths, hit_count). tr('_', ' ')) end end Result.new(0, 'Unclassified') end |