Module: PasvLib::Alignment

Defined in:
lib/pasv_lib/alignment.rb

Constant Summary collapse

GAP_CHARS =

If you need to check if a residue is a gap, use this Set.

Set.new %w[- .]

Instance Method Summary collapse

Instance Method Details

#adjust_scoring_matrix(scoring_matrix) ⇒ Object

If the overall min of the scoring matrix is < 0, then this scales it so that the overall min becomes zero.

Parameters:

  • scoring_matrix

    The scoring matrix to rescale. E.g., Blosum::BLOSUM62.

Returns:

  • A new hash scaled to zero, or a deep copy of the old one if the overall min >= 0.



14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# File 'lib/pasv_lib/alignment.rb', line 14

def adjust_scoring_matrix scoring_matrix
  overal_min = scoring_matrix.values.map(&:values).flatten.min

  scaling_value = overal_min < 0 ? overal_min.abs : 0

  # We want a deep copy to prevent things from getting weird while using the new hash later.
  new_matrix = {}
  scoring_matrix.each do |residue, scores|
    new_matrix[residue] = {}
    scores.each do |other_residue, score|
      new_matrix[residue][other_residue] = score + scaling_value
    end
  end

  new_matrix
end

#alignment_columns(seqs) ⇒ Array<Array<String>>

Get the columns of an aligned set of sequences.

Any spaces in the alignment are ignored (e.g., like those spaces NCBI puts in their alignments sometimes). Gap characters are ‘.’ or ‘-’

Examples:

klass.alignment_columns ["AA-", "A-A"] #=> [["A", "A"], ["A", "-"], ["-", "A"]]

Parameters:

  • seqs (Array<String>)

    an array of sequences, normally aligned sequences

Returns:

  • (Array<Array<String>>)

    an array of alignment columns (which are arrays of single residue strings)

Raises:

  • PasvLib::Error if seqs.empty?

  • PasvLib::Error if no comparisons can be made, e.g., all gaps

  • PasvLib::Error if not all sequences are the same length



46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/pasv_lib/alignment.rb', line 46

def alignment_columns seqs
  seqs_no_spaces = seqs.map { |seq| seq.tr " ", "" }
  len            = seqs_no_spaces.first.length

  seqs_no_spaces.map do |seq|
    unless seq.length == len
      raise PasvLib::Error, "Aligned seqs must be the same length"
    end

    seq.chars
  end.transpose
end

#gap?(residue) ⇒ Boolean

Returns:

  • (Boolean)


59
60
61
# File 'lib/pasv_lib/alignment.rb', line 59

def gap? residue
  GAP_CHARS.include? residue
end

#geometric_index(aln) ⇒ Float

A wrapper for #geometric_score that takes the average of the by sequence and by residue scores for an alignment.

Parameters:

  • aln (Array<String>)

    aligned sequenecs

Returns:

  • (Float)

    a score between 0 and 1, with 1 being very homogeneous and 1 being very heterogeneous.



98
99
100
101
102
103
# File 'lib/pasv_lib/alignment.rb', line 98

def geometric_index aln
  by_seq_score     = geometric_score aln, "sequence"
  by_residue_score = geometric_score aln, "residue"

  (by_seq_score + by_residue_score) / 2.0
end

#geometric_score(aln, by) ⇒ Float

Note:

The original Anvi’o code uses a clever bitshifting scheme to avoid storing array of arrays. It may also speed up the XOR part as you’re doing fewer XORs, but I’m not positive about that.

Calculate the geometric index for an alignment.

Basically, you change all residues to 0 and all gaps to 1. Then you take the permutations of the sequences and then the residues and XOR each of the permutations. Then you add it up and take averages and you’ll get the score. That is a pretty bad explanation, so see merenlab.org/2016/11/08/pangenomics-v2/#geometric-homogeneity-index for more information.

Parameters:

  • aln (Array<String>)

    aligned sequenecs

  • by (String)

    either “sequence” or “residue”. Controls whether to do the calculation by sequence, or by residue.

Returns:

  • (Float)

    a score between 0 and 1, with 1 being very homogeneous and 1 being very heterogeneous.



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/pasv_lib/alignment.rb', line 73

def geometric_score aln, by
  binary_aln = to_binary_matrix aln

  if by == "residue"
    binary_aln = binary_aln.transpose
  end

  num_rows                 = binary_aln.length
  max_differences_per_row  = binary_aln.first.length
  num_comparisions_per_row = num_rows - 1

  diff_score = binary_aln.permutation(2).map do |(row1, row2)|
    row1.zip(row2).map do |elem1, elem2|
      elem1 ^ elem2
    end.sum / max_differences_per_row.to_f
  end.sum / num_comparisions_per_row.to_f / num_rows

  1 - diff_score
end

#similarity_score(seqs, scoring_matrix = Blosum::BLOSUM62) ⇒ Object

Note:

Technically you could get a negative score if you don’t have enough high scoring residue pairs to keep the total score above zero. If this is the case, you’re alignment probably isn’t very good. Alternatively, you could use #adjust_scoring_matrix to avoid this issue.…

Returns the similarity score for the alignment.

For each colunm, each pair of residues are scored using the similarity matrix to get a points accrued / max_points. Then all column scores are averaged.

Raises:

  • PasvLib::Error if any residue is not present in the scoring matrix.

  • PasvLib::Error if seqs is empty

  • PasvLib::Error if max_points is zero. Could happen if one of the seqs has all gaps, or no omparisons could be made.



114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/pasv_lib/alignment.rb', line 114

def similarity_score seqs, scoring_matrix = Blosum::BLOSUM62
  raise PasvLib::Error if seqs.empty?
  return 1.0 if seqs.count == 1

  aln_cols = alignment_columns seqs

  actual_points = 0
  max_points    = 0

  aln_cols.each do |residues|
    residues.map(&:upcase).combination(2).each do |r1, r2|
      unless gap?(r1) || gap?(r2)
        # Check that scoring matrix has the residues.
        [r1, r2].each do |res|
          unless scoring_matrix.has_key? res
            raise PasvLib::Error, "Residue '#{res}' is missing from the " \
                                "scoring matrix."
          end
        end

        r1_max_score = scoring_matrix[r1].values.max
        r2_max_score = scoring_matrix[r2].values.max
        pair_max     = [r1_max_score, r2_max_score].max

        # TODO check that residues exist in the scoring matrix.
        actual_points += scoring_matrix[r1][r2]
        max_points    += pair_max
      end
    end
  end

  if max_points.zero?
    raise PasvLib::Error, "Something went wrong and max_points was 0.  " \
                          "Maybe one of your sequences is all gaps?"

  end

  actual_points / max_points.to_f
end

#to_binary_matrix(aln) ⇒ Array<Array<integer>>

Convert an aligment to a binary matrix where 1 represents gaps and 0 represents residues.

Parameters:

  • aln (Array<String>)

    an array of (aligned) sequences

Returns:

  • (Array<Array<integer>>)

    an array of arrays. Each row is a sequence.



159
160
161
162
163
164
165
166
# File 'lib/pasv_lib/alignment.rb', line 159

def to_binary_matrix aln
  aln.map do |seq|
    seq.chars.map do |char|
      # Gaps turn to 1, residues turn to 0.
      gap?(char) ? 1 : 0
    end
  end
end