Class: GeneValidator::BlastUtils

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Defined in:
lib/genevalidator/blast.rb

Overview

Contains methods that run BLAST and methods that analyse sequences

Constant Summary collapse

EVALUE =
1e-5

Class Method Summary collapse

Class Method Details

.guess_sequence_type(sequence_string) ⇒ Object

Strips all non-letter characters. guestimates sequence based on that. If less than 10 useable characters… returns nil If more than 90% ACGTU returns :nucleotide. else returns :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein



105
106
107
108
109
110
111
112
# File 'lib/genevalidator/blast.rb', line 105

def guess_sequence_type(sequence_string)
  # removing non-letter and ambiguous characters
  cleaned_sequence = sequence_string.gsub(/[^A-Z]|[NX]/i, '')
  return nil if cleaned_sequence.length < 10 # conservative

  type = Bio::Sequence.new(cleaned_sequence).guess(0.9)
  type == Bio::Sequence::NA ? :nucleotide : :protein
end

.guess_sequence_type_from_input_file(file = ) ⇒ Object



116
117
118
119
120
121
# File 'lib/genevalidator/blast.rb', line 116

def guess_sequence_type_from_input_file(file = opt[:input_fasta_file])
  lines = File.foreach(file).first(10)
  seqs = ''
  lines.each { |l| seqs += l.chomp unless l[0] == '>' }
  guess_sequence_type(seqs)
end

.parse_next(iterator) ⇒ Object

Parses the next query from the blast xml output query Params: iterator: blast xml iterator for hits type: the type of the sequence: :nucleotide or :protein Outputs: Array of Sequence objects corresponding to the list of hits



53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# File 'lib/genevalidator/blast.rb', line 53

def parse_next(iterator)
  iter = iterator.next

  # parse blast the xml output and get the hits
  # hits obtained are proteins! (we use only blastp and blastx)
  hits = []
  iter.each do |hit|
    seq                = Query.new
    seq.length_protein = hit.len.to_i
    seq.type           = :protein
    seq.identifier     = hit.hit_id
    seq.definition     = hit.hit_def
    seq.accession_no   = hit.accession
    seq.hsp_list       = hit.hsps.map { |hsp| Hsp.new(xml_input: hsp) }

    hits << seq
  end
  hits
rescue StopIteration
  nil
end

.run_blast_on_input_fileObject

Runs BLAST on an input file Params: blast_type: blast command in String format (e.g ‘blastx’ or ‘blastp’) opt: Hash made of :input_fasta_file :blast_xml_file, :db, :num_threads gapopen: gapopen blast parameter gapextend: gapextend blast parameter nr_hits: max number of hits Output: XML file



28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/genevalidator/blast.rb', line 28

def run_blast_on_input_file
  remote = opt[:db].match?(/remote/) ? true : false
  print_blast_info_text(remote)

  log_file = File.join(dirs[:tmp_dir], 'blast_cmd_output.txt')
  `#{blast_cmd(opt, config, remote)} > #{log_file} 2>&1`

  return unless File.zero?(opt[:blast_xml_file])
  warn 'Blast failed to run on the input file.'
  if remote
    warn 'You are using BLAST with a remote database. Please'
    warn 'ensure that you have internet access and try again.'
  else
    warn 'Please ensure that the BLAST database exists and try again.'
  end
  exit 1
end

.type_of_sequences(fasta_format_string) ⇒ Object

Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like “>adsfadsf”); then guesses sequence type for each sequence. If not enough sequence to determine, returns nil. If 2 kinds of sequence mixed together, raises ArgumentError Otherwise, returns :nucleotide or :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein



86
87
88
89
90
91
92
93
94
95
# File 'lib/genevalidator/blast.rb', line 86

def type_of_sequences(fasta_format_string)
  # the first sequence does not need to have a fasta definition line
  sequences = fasta_format_string.split(/^>.*$/).delete_if(&:empty?)
  # get all sequence types
  sequence_types = sequences.collect { |seq| guess_sequence_type(seq) }
                            .uniq.compact

  return nil if sequence_types.empty?
  sequence_types.first if sequence_types.length == 1
end