Class: GeneValidator::AlignmentValidation

Inherits:

ValidationTest

Object
ValidationTest
GeneValidator::AlignmentValidation

show all

Defined in:: lib/genevalidator/validation_alignment.rb

Overview

This class contains the methods necessary for validations based on multiple alignment

Instance Attribute Summary collapse

#filename ⇒ Object readonly

Returns the value of attribute filename.
#index_file_name ⇒ Object readonly

Returns the value of attribute index_file_name.
#multiple_alignment ⇒ Object readonly

Returns the value of attribute multiple_alignment.
#raw_seq_file ⇒ Object readonly

Returns the value of attribute raw_seq_file.
#raw_seq_file_load ⇒ Object readonly

Returns the value of attribute raw_seq_file_load.

Attributes inherited from ValidationTest

#cli_name, #description, #header, #hits, #prediction, #running_time, #short_header, #type, #validation_report

Instance Method Summary collapse

#array_to_ranges(ar) ⇒ Object

converts an array of integers into array of ranges.
#consensus_validation(prediction_raw, consensus) ⇒ Object

Returns the percentage of consesnsus residues from the ma that are in the prediction Params: prediction_raw: String corresponding to the prediction sequence consensus: String corresponding to the statistical model Output: Fixnum with the score.
#extra_sequence_validation(prediction_raw, sm) ⇒ Object

Returns the percentage of extra sequences in the prediction with respect to the statistical model Params: prediction: String corresponding to the prediction sequence sm: String corresponding to the statistical model Output: Fixnum with the score.
#gap_validation(prediction_raw, sm) ⇒ Object

Returns the percentage of gaps in the prediction with respect to the statistical model Params: prediction: String corresponding to the prediction sequence sm: String corresponding to the statistical model Output: Fixnum with the score.
#get_consensus(ma = @multiple_alignment) ⇒ Object

Returns the consensus regions among a set of multiple aligned sequences i.e positions where there is the same element in all sequences Params: ma: array of Strings, corresponding to the multiple aligned sequences Output: String with the consensus regions.
#get_sm_pssm(ma = @multiple_alignment, threshold = 0.7) ⇒ Object

Builds a statistical model from a set of multiple aligned sequences based on PSSM (Position Specific Matrix) Params: ma: array of Strings, corresponding to the multiple aligned sequences threshold: percentage of genes that are considered in statistical model Output: String representing the statistical model Array with the maximum frequeny of the majoritary residue for each position.
#initialize(type, prediction, hits, filename, raw_seq_file, index_file_name, raw_seq_file_load, db, num_threads) ⇒ AlignmentValidation constructor

Initilizes the object Params: type: type of the predicted sequence (:nucleotide or :protein) prediction: a Sequence object representing the blast query hits: a vector of Sequence objects (representing blast hits) filename: name of the fasta file mafft_path: path of the MAFFT installation raw_seq_file: name of the fasta file with raw sequences index_file_name: name of the fasta index file raw_seq_file_load: String - loaded content of the index file.
#isalpha(str) ⇒ Object

Returns true if the string contains only letters and false otherwise.
#multiple_align_mafft(prediction = @prediction, hits = @hits) ⇒ Object

Builds the multiple alignment between all the hits and the prediction using MAFFT tool Also creates a fasta file with the alignment Params: prediction: a Sequence object representing the blast query hits: a vector of Sequience objects (usually representing blast hits) path: path of mafft installation Output: Array of Strings, corresponding to the multiple aligned sequences the prediction is the last sequence in the vector.
#plot_alignment(freq, output = "#{@filename}_ma.json", ma = @multiple_alignment) ⇒ Object

Generates a json file cotaining data used for plotting lines for multiple hits alignment, prediction and statistical model Params: freq: String residue frequency from the statistical model output: filename of the json file ma: String array with the multiple alignmened hits and prediction.
#remove_isolated_residues(seq, len = 2) ⇒ Object

Remove isolated residues inside long gaps from a given sequence Params: seq:String: sequence of residues len:Fixnum: number of isolated residues to be removed Output: String: the new sequence.
#run(n = 10) ⇒ Object

Find gaps/extra regions based on the multiple alignment of the first n hits Output: AlignmentValidationOutput object.

Constructor Details

#initialize(type, prediction, hits, filename, raw_seq_file, index_file_name, raw_seq_file_load, db, num_threads) ⇒ `AlignmentValidation`

Initilizes the object Params: type: type of the predicted sequence (:nucleotide or :protein) prediction: a Sequence object representing the blast query hits: a vector of Sequence objects (representing blast hits) filename: name of the fasta file mafft_path: path of the MAFFT installation raw_seq_file: name of the fasta file with raw sequences index_file_name: name of the fasta index file raw_seq_file_load: String - loaded content of the index file

# File 'lib/genevalidator/validation_alignment.rb', line 101

def initialize(type, prediction, hits, filename, raw_seq_file,
               index_file_name, raw_seq_file_load, db, num_threads)
  super
  @short_header       = 'MA'
  @header             = 'Missing/Extra sequences'
  @description        = 'Finds missing and extra sequences in the' \
                        ' prediction, based on the multiple alignment of' \
                        ' the best hits. Also counts the percentage of' \
                        ' the conserved regions that appear in the' \
                        ' prediction.'
  @filename           = filename
  @raw_seq_file       = raw_seq_file
  @index_file_name    = index_file_name
  @raw_seq_file_load  = raw_seq_file_load
  @db                 = db
  @multiple_alignment = []
  @cli_name           = 'align'
  @num_threads        = num_threads
end

Instance Attribute Details

#filename ⇒ `Object` (readonly)

Returns the value of attribute filename.



84
85
86

# File 'lib/genevalidator/validation_alignment.rb', line 84

def filename
  @filename
end

#index_file_name ⇒ `Object` (readonly)

Returns the value of attribute index_file_name.



87
88
89

# File 'lib/genevalidator/validation_alignment.rb', line 87

def index_file_name
  @index_file_name
end

#multiple_alignment ⇒ `Object` (readonly)

Returns the value of attribute multiple_alignment.



85
86
87

# File 'lib/genevalidator/validation_alignment.rb', line 85

def multiple_alignment
  @multiple_alignment
end

#raw_seq_file ⇒ `Object` (readonly)

Returns the value of attribute raw_seq_file.



86
87
88

# File 'lib/genevalidator/validation_alignment.rb', line 86

def raw_seq_file
  @raw_seq_file
end

#raw_seq_file_load ⇒ `Object` (readonly)

Returns the value of attribute raw_seq_file_load.



88
89
90

# File 'lib/genevalidator/validation_alignment.rb', line 88

def raw_seq_file_load
  @raw_seq_file_load
end

Instance Method Details

#array_to_ranges(ar) ⇒ `Object`

converts an array of integers into array of ranges

# File 'lib/genevalidator/validation_alignment.rb', line 405

def array_to_ranges(ar)
  prev = ar[0]

  ranges = ar.slice_before { |e|
    prev, prev2 = e, prev
    prev2 + 1 != e
  }.map { |a| a[0]..a[-1] }

  ranges
end

#consensus_validation(prediction_raw, consensus) ⇒ `Object`

Returns the percentage of consesnsus residues from the ma that are in the prediction Params: prediction_raw: String corresponding to the prediction sequence consensus: String corresponding to the statistical model Output: Fixnum with the score

# File 'lib/genevalidator/validation_alignment.rb', line 322

def consensus_validation(prediction_raw, consensus)
  return 1 if prediction_raw.length != consensus.length
  # no of conserved residues among the hits
  no_conserved_residues = consensus.length - consensus.scan(/[\?-]/).length

  return 1 if no_conserved_residues == 0

  # no of conserved residues from the hita that appear in the prediction
  no_conserved_pred = consensus.split(//).each_index.select { |j| consensus[j] != '-' && consensus[j] != '?' && consensus[j] == prediction_raw[j] }.length

  no_conserved_pred / (no_conserved_residues + 0.0)
end

#extra_sequence_validation(prediction_raw, sm) ⇒ `Object`

Returns the percentage of extra sequences in the prediction with respect to the statistical model Params: prediction: String corresponding to the prediction sequence sm: String corresponding to the statistical model Output: Fixnum with the score

# File 'lib/genevalidator/validation_alignment.rb', line 303

def extra_sequence_validation(prediction_raw, sm)
  return 1 if prediction_raw.length != sm.length
  # find residues that are in the prediction
  # but not in the statistical model
  no_insertions = 0
  (0..sm.length - 1).each do |i|
    no_insertions += 1 if prediction_raw[i] != '-' && sm[i] == '-'
  end
  no_insertions / (sm.length + 0.0)
end

#gap_validation(prediction_raw, sm) ⇒ `Object`

Returns the percentage of gaps in the prediction with respect to the statistical model Params: prediction: String corresponding to the prediction sequence sm: String corresponding to the statistical model Output: Fixnum with the score

# File 'lib/genevalidator/validation_alignment.rb', line 284

def gap_validation(prediction_raw, sm)
  return 1 if prediction_raw.length != sm.length
  # find gaps in the prediction and
  # not in the statistical model
  no_gaps = 0
  (0..sm.length - 1).each do |i|
    no_gaps += 1 if prediction_raw[i] == '-' && sm[i] != '-'
  end
  no_gaps / (sm.length + 0.0)
end

#get_consensus(ma = @multiple_alignment) ⇒ `Object`

Returns the consensus regions among a set of multiple aligned sequences i.e positions where there is the same element in all sequences Params: ma: array of Strings, corresponding to the multiple aligned sequences Output: String with the consensus regions

# File 'lib/genevalidator/validation_alignment.rb', line 271

def get_consensus(ma = @multiple_alignment)
  align = Bio::Alignment.new(ma)
  align.consensus
end

#get_sm_pssm(ma = @multiple_alignment, threshold = 0.7) ⇒ `Object`

Builds a statistical model from a set of multiple aligned sequences based on PSSM (Position Specific Matrix) Params: ma: array of Strings, corresponding to the multiple aligned sequences threshold: percentage of genes that are considered in statistical model Output: String representing the statistical model Array with the maximum frequeny of the majoritary residue for each position

# File 'lib/genevalidator/validation_alignment.rb', line 345

def get_sm_pssm(ma = @multiple_alignment, threshold = 0.7)
  sm = ''
  freq = []
  (0..ma[0].length - 1).each do |i|
    freqs = Hash.new(0)
    ma.map { |seq| seq[i] }.each { |res| freqs[res] += 1 }
    # get the residue with the highest frequency
    max_freq = freqs.map { |_res, n| n }.max
    residue = (freqs.map { |res, n| n == max_freq ? res : [] }.flatten)[0]

    if residue == '-'
      freq.push(0)
    else
      freq.push(max_freq / (ma.length + 0.0))
    end

    if max_freq / (ma.length + 0.0) >= threshold
      sm << residue
    else
      sm << '?'
    end
  end
  [sm, freq]
end

#isalpha(str) ⇒ `Object`

Returns true if the string contains only letters and false otherwise



399
400
401

# File 'lib/genevalidator/validation_alignment.rb', line 399

def isalpha(str)
  !str.match(/[^A-Za-z]/)
end

#multiple_align_mafft(prediction = @prediction, hits = @hits) ⇒ `Object`

Builds the multiple alignment between all the hits and the prediction using MAFFT tool Also creates a fasta file with the alignment Params: prediction: a Sequence object representing the blast query hits: a vector of Sequience objects (usually representing blast hits) path: path of mafft installation Output: Array of Strings, corresponding to the multiple aligned sequences the prediction is the last sequence in the vector

# File 'lib/genevalidator/validation_alignment.rb', line 240

def multiple_align_mafft(prediction = @prediction, hits = @hits)
  fail Exception unless prediction.is_a?(Sequence) && hits[0].is_a?(Sequence)

  options = ['--maxiterate', '1000', '--localpair', '--anysymbol',
             '--quiet', '--thread', "#{@num_threads}"]
  mafft = Bio::MAFFT.new('mafft', options)
  sequences = hits.map(&:raw_sequence)
  sequences.push(prediction.protein_translation)

  report = mafft.query_align(sequences)
  # Accesses the actual alignment.
  align = report.alignment

  align.each_with_index do |s, _i|
    @multiple_alignment.push(s.to_s)
  end

  @multiple_alignment
rescue Exception
  raise NoMafftInstallationError
end

#plot_alignment(freq, output = "#{@filename}_ma.json", ma = @multiple_alignment) ⇒ `Object`

Generates a json file cotaining data used for plotting lines for multiple hits alignment, prediction and statistical model Params: freq: String residue frequency from the statistical model output: filename of the json file ma: String array with the multiple alignmened hits and prediction

# File 'lib/genevalidator/validation_alignment.rb', line 422

def plot_alignment(freq, output = "#{@filename}_ma.json", ma = @multiple_alignment)
  # get indeces of consensus in the multiple alignment
  consensus = get_consensus(@multiple_alignment[0..@multiple_alignment.length - 2])
  consensus_idxs = consensus.split(//).each_index.select { |j| isalpha(consensus[j]) }
  consensus_ranges = array_to_ranges(consensus_idxs)

  consensus_all = get_consensus(@multiple_alignment)
  consensus_all_idxs = consensus_all.split(//).each_index.select { |j| isalpha(consensus_all[j]) }
  consensus_all_ranges = array_to_ranges(consensus_all_idxs)

  match_alignment = ma[0..ma.length - 2].each_with_index.map { |seq, _j| seq.split(//).each_index.select { |j| isalpha(seq[j]) } }
  match_alignment_ranges = []
  match_alignment.each { |arr| match_alignment_ranges << array_to_ranges(arr) }

  query_alignment = ma[ma.length - 1].split(//).each_index.select { |j| isalpha(ma[ma.length - 1][j]) }
  query_alignment_ranges = array_to_ranges(query_alignment)

  len = ma[0].length

  f = File.open(output, 'w')
  f.write((
  # plot statistical model
  freq.each_with_index.map { |f, j| { 'y' => ma.length, 'start' => j, 'stop' => j + 1, 'color' => 'orange', 'height' => f } } +
  # hits
  match_alignment_ranges.each_with_index.map { |ranges, j| ranges.map { |range| { 'y' => ma.length - j - 1, 'start' => range.first, 'stop' => range.last, 'color' => 'red', 'height' => -1 } } }.flatten +
  ma[0..ma.length - 2].each_with_index.map { |_seq, j|
    consensus_ranges.map { |range| { 'y' => j + 1, 'start' => range.first, 'stop' => range.last, 'color' => 'yellow', 'height' => -1 } }
  }.flatten +
  # plot prediction
  [{ 'y' => 0, 'start' => 0, 'stop' => len, 'color' => 'gray', 'height' => -1 }] +
  query_alignment_ranges.map { |range| { 'y' => 0, 'start' => range.first, 'stop' => range.last, 'color' => 'red', 'height' => -1 } }.flatten +

  # plot consensus
  consensus_all_ranges.map { |range| { 'y' => 0, 'start' => range.first, 'stop' => range.last, 'color' => 'yellow', 'height' => -1 } }.flatten).to_json)

  f.close

  yAxisValues = 'Prediction'
  (1..ma.length - 1).each do |i|
    yAxisValues << ", hit&nbsp;#{i}"
  end

  yAxisValues << ', Statistical Model'

  Plot.new(output.scan(%r{([^/]+)$})[0][0],
           :align,
           'Missing/Extra sequences Validation: Multiple Align. & Statistical model of hits',
           'Conserved Region, Yellow',
           'Offset in the Alignment',
           '',
           ma.length + 1,
           yAxisValues)
end

#remove_isolated_residues(seq, len = 2) ⇒ `Object`

Remove isolated residues inside long gaps from a given sequence Params: seq:String: sequence of residues len:Fixnum: number of isolated residues to be removed Output: String: the new sequence

# File 'lib/genevalidator/validation_alignment.rb', line 378

def remove_isolated_residues(seq, len = 2)
  gap_starts = seq.to_enum(:scan, /(-\w{1,#{len}}-)/i).map { |_m| $`.size + 1 }
  # remove isolated residues
  gap_starts.each do |i|
    (i..i + len - 1).each do |j|
      seq[j] = '-' if isalpha(seq[j])
    end
  end
  # remove isolated gaps
  res_starts = seq.to_enum(:scan, /([?\w]-{1,2}[?\w])/i).map { |_m| $`.size + 1 }
  res_starts.each do |i|
    (i..i + len - 1).each do |j|
      seq[j] = '?' if seq[j] == '-'
    end
  end
  seq
end

#run(n = 10) ⇒ `Object`

Find gaps/extra regions based on the multiple alignment of the first n hits Output: AlignmentValidationOutput object

# File 'lib/genevalidator/validation_alignment.rb', line 126

def run(n = 10)
  n = 50 if n > 50

  fail NotEnoughHitsError unless hits.length >= n
  fail Exception unless prediction.is_a?(Sequence) &&
                        hits[0].is_a?(Sequence)
  start = Time.new
  # get the first n hits
  less_hits    = @hits[0..[n - 1, @hits.length].min]
  useless_hits = []

  # get raw sequences for less_hits
  less_hits.map do |hit|
    # get gene by accession number
    next unless hit.raw_sequence.nil?

    hit.get_sequence_from_index_file(@raw_seq_file, @index_file_name,
                                     hit.identifier, @raw_seq_file_load)

    if hit.raw_sequence.nil? || hit.raw_sequence.empty?
      seq_type = (hit.type == :protein) ? 'protein' : 'nucleotide'
      hit.get_sequence_by_accession_no(hit.accession_no, seq_type, @db)
    end

    useless_hits.push(hit) if hit.raw_sequence.nil?
    useless_hits.push(hit) if hit.raw_sequence.empty?
  end

  useless_hits.each { |hit| less_hits.delete(hit) }

  fail NoInternetError if less_hits.length == 0
  # in case of nucleotide prediction sequence translate into protein
  # translate with the reading frame of all hits considered for alignment
  reading_frames = less_hits.map(&:reading_frame).uniq
  fail ReadingFrameError if reading_frames.length != 1

  if @type == :nucleotide
    s = Bio::Sequence::NA.new(prediction.raw_sequence)
    prediction.protein_translation = s.translate(reading_frames[0])
  end

  # multiple align sequences from less_hits with the prediction
  # the prediction is the last sequence in the vector
  multiple_align_mafft(prediction, less_hits)

  out = get_sm_pssm(@multiple_alignment[0..@multiple_alignment.length - 2])
  sm = out[0]
  freq = out[1]

  # remove isolated residues from the predicted sequence
  index          = @multiple_alignment.length - 1
  prediction_raw = remove_isolated_residues(@multiple_alignment[index])
  # remove isolated residues from the statistical model
  sm = remove_isolated_residues(sm)

  a1 = get_consensus(@multiple_alignment[0..@multiple_alignment.length - 2])

  plot1     = plot_alignment(freq)
  gaps      = gap_validation(prediction_raw, sm)
  extra_seq = extra_sequence_validation(prediction_raw, sm)
  consensus = consensus_validation(prediction_raw, a1)

  @validation_report = AlignmentValidationOutput.new(@short_header, @header,
                                                     @description, gaps,
                                                     extra_seq, consensus)
  @validation_report.plot_files.push(plot1)
  @validation_report.running_time = Time.now - start
  @validation_report

rescue NotEnoughHitsError
  @validation_report = ValidationReport.new('Not enough evidence',
                                            :warning, @short_header,
                                            @header, @description,
                                            @approach, @explanation,
                                            @conclusion)
rescue NoMafftInstallationError
  @validation_report = ValidationReport.new('Mafft error', :error,
                                            @short_header, @header,
                                            @description, @approach,
                                            @explanation, @conclusion)
  @validation_report.errors.push NoMafftInstallationError
rescue NoInternetError
  @validation_report = ValidationReport.new('Internet error', :error,
                                            @short_header, @header,
                                            @description, @approach,
                                            @explanation, @conclusion)
  @validation_report.errors.push NoInternetError
rescue ReadingFrameError
  @validation_report = ValidationReport.new('Multiple reading frames',
                                            :error, @short_header,
                                            @header, @description,
                                            @approach, @explanation,
                                            @conclusion)
  @validation_report.errors.push 'Multiple reading frames Error'
rescue Exception
  @validation_report = ValidationReport.new('Unexpected error', :error,
                                            @short_header, @header,
                                            @description, @approach,
                                            @explanation, @conclusion)
  @validation_report.errors.push 'Unexpected Error'
end

Class: GeneValidator::AlignmentValidation

Overview

Instance Attribute Summary collapse

Attributes inherited from ValidationTest

Instance Method Summary collapse

Constructor Details

#initialize(type, prediction, hits, filename, raw_seq_file, index_file_name, raw_seq_file_load, db, num_threads) ⇒ AlignmentValidation

Instance Attribute Details

#filename ⇒ Object (readonly)

#index_file_name ⇒ Object (readonly)

#multiple_alignment ⇒ Object (readonly)

#raw_seq_file ⇒ Object (readonly)

#raw_seq_file_load ⇒ Object (readonly)

Instance Method Details

#array_to_ranges(ar) ⇒ Object

#consensus_validation(prediction_raw, consensus) ⇒ Object

#extra_sequence_validation(prediction_raw, sm) ⇒ Object

#gap_validation(prediction_raw, sm) ⇒ Object

#get_consensus(ma = @multiple_alignment) ⇒ Object

#get_sm_pssm(ma = @multiple_alignment, threshold = 0.7) ⇒ Object

#isalpha(str) ⇒ Object

#multiple_align_mafft(prediction = @prediction, hits = @hits) ⇒ Object

#plot_alignment(freq, output = "#{@filename}_ma.json", ma = @multiple_alignment) ⇒ Object

#remove_isolated_residues(seq, len = 2) ⇒ Object

#run(n = 10) ⇒ Object