Class: GeneValidator::BlastUtils
- Inherits:
-
Object
- Object
- GeneValidator::BlastUtils
- Extended by:
- Forwardable
- Defined in:
- lib/genevalidator/blast.rb
Overview
Contains methods that run BLAST and methods that analyse sequences
Constant Summary collapse
- EVALUE =
1e-5
Class Method Summary collapse
-
.guess_sequence_type(sequence_string) ⇒ Object
Strips all non-letter characters.
- .guess_sequence_type_from_input_file(file = ) ⇒ Object
-
.parse_next(iterator) ⇒ Object
Parses the next query from the blast xml output query Params:
iterator
: blast xml iterator for hitstype
: the type of the sequence: :nucleotide or :protein Outputs: Array ofSequence
objects corresponding to the list of hits. -
.run_blast_on_input_file ⇒ Object
Runs BLAST on an input file Params:
blast_type
: blast command in String format (e.g ‘blastx’ or ‘blastp’)opt
: Hash made of :input_fasta_file :blast_xml_file, :db, :num_threadsgapopen
: gapopen blast parametergapextend
: gapextend blast parameternr_hits
: max number of hits Output: XML file. -
.type_of_sequences(fasta_format_string) ⇒ Object
Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like “>adsfadsf”); then guesses sequence type for each sequence.
Class Method Details
.guess_sequence_type(sequence_string) ⇒ Object
Strips all non-letter characters. guestimates sequence based on that. If less than 10 useable characters… returns nil If more than 90% ACGTU returns :nucleotide. else returns :protein Params: sequence_string
: String to validate Output: nil, :nucleotide or :protein
105 106 107 108 109 110 111 112 |
# File 'lib/genevalidator/blast.rb', line 105 def guess_sequence_type(sequence_string) # removing non-letter and ambiguous characters cleaned_sequence = sequence_string.gsub(/[^A-Z]|[NX]/i, '') return nil if cleaned_sequence.length < 10 # conservative type = Bio::Sequence.new(cleaned_sequence).guess(0.9) type == Bio::Sequence::NA ? :nucleotide : :protein end |
.guess_sequence_type_from_input_file(file = ) ⇒ Object
116 117 118 119 120 121 |
# File 'lib/genevalidator/blast.rb', line 116 def guess_sequence_type_from_input_file(file = opt[:input_fasta_file]) lines = File.foreach(file).first(10) seqs = '' lines.each { |l| seqs += l.chomp unless l[0] == '>' } guess_sequence_type(seqs) end |
.parse_next(iterator) ⇒ Object
Parses the next query from the blast xml output query Params: iterator
: blast xml iterator for hits type
: the type of the sequence: :nucleotide or :protein Outputs: Array of Sequence
objects corresponding to the list of hits
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/genevalidator/blast.rb', line 53 def parse_next(iterator) iter = iterator.next # parse blast the xml output and get the hits # hits obtained are proteins! (we use only blastp and blastx) hits = [] iter.each do |hit| seq = Query.new seq.length_protein = hit.len.to_i seq.type = :protein seq.identifier = hit.hit_id seq.definition = hit.hit_def seq.accession_no = hit.accession seq.hsp_list = hit.hsps.map { |hsp| Hsp.new(xml_input: hsp) } hits << seq end hits rescue StopIteration nil end |
.run_blast_on_input_file ⇒ Object
Runs BLAST on an input file Params: blast_type
: blast command in String format (e.g ‘blastx’ or ‘blastp’) opt
: Hash made of :input_fasta_file :blast_xml_file, :db, :num_threads gapopen
: gapopen blast parameter gapextend
: gapextend blast parameter nr_hits
: max number of hits Output: XML file
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# File 'lib/genevalidator/blast.rb', line 28 def run_blast_on_input_file remote = opt[:db].match?(/remote/) ? true : false print_blast_info_text(remote) log_file = File.join(dirs[:tmp_dir], 'blast_cmd_output.txt') `#{blast_cmd(opt, config, remote)} > #{log_file} 2>&1` return unless File.zero?(opt[:blast_xml_file]) warn 'Blast failed to run on the input file.' if remote warn 'You are using BLAST with a remote database. Please' warn 'ensure that you have internet access and try again.' else warn 'Please ensure that the BLAST database exists and try again.' end exit 1 end |
.type_of_sequences(fasta_format_string) ⇒ Object
Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like “>adsfadsf”); then guesses sequence type for each sequence. If not enough sequence to determine, returns nil. If 2 kinds of sequence mixed together, raises ArgumentError Otherwise, returns :nucleotide or :protein Params: sequence_string
: String to validate Output: nil, :nucleotide or :protein
86 87 88 89 90 91 92 93 94 95 |
# File 'lib/genevalidator/blast.rb', line 86 def type_of_sequences(fasta_format_string) # the first sequence does not need to have a fasta definition line sequences = fasta_format_string.split(/^>.*$/).delete_if(&:empty?) # get all sequence types sequence_types = sequences.collect { |seq| guess_sequence_type(seq) } .uniq.compact return nil if sequence_types.empty? sequence_types.first if sequence_types.length == 1 end |