Module: PasvLib::Alignment
- Defined in:
- lib/pasv_lib/alignment.rb
Constant Summary collapse
- GAP_CHARS =
If you need to check if a residue is a gap, use this Set.
Set.new %w[- .]
Instance Method Summary collapse
-
#adjust_scoring_matrix(scoring_matrix) ⇒ Object
If the overall min of the scoring matrix is < 0, then this scales it so that the overall min becomes zero.
-
#alignment_columns(seqs) ⇒ Array<Array<String>>
Get the columns of an aligned set of sequences.
- #gap?(residue) ⇒ Boolean
-
#geometric_index(aln) ⇒ Float
A wrapper for #geometric_score that takes the average of the by sequence and by residue scores for an alignment.
-
#geometric_score(aln, by) ⇒ Float
Calculate the geometric index for an alignment.
-
#similarity_score(seqs, scoring_matrix = Blosum::BLOSUM62) ⇒ Object
Returns the similarity score for the alignment.
-
#to_binary_matrix(aln) ⇒ Array<Array<integer>>
Convert an aligment to a binary matrix where 1 represents gaps and 0 represents residues.
Instance Method Details
#adjust_scoring_matrix(scoring_matrix) ⇒ Object
If the overall min of the scoring matrix is < 0, then this scales it so that the overall min becomes zero.
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/pasv_lib/alignment.rb', line 14 def adjust_scoring_matrix scoring_matrix overal_min = scoring_matrix.values.map(&:values).flatten.min scaling_value = overal_min < 0 ? overal_min.abs : 0 # We want a deep copy to prevent things from getting weird while using the new hash later. new_matrix = {} scoring_matrix.each do |residue, scores| new_matrix[residue] = {} scores.each do |other_residue, score| new_matrix[residue][other_residue] = score + scaling_value end end new_matrix end |
#alignment_columns(seqs) ⇒ Array<Array<String>>
Get the columns of an aligned set of sequences.
Any spaces in the alignment are ignored (e.g., like those spaces NCBI puts in their alignments sometimes). Gap characters are ‘.’ or ‘-’
46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/pasv_lib/alignment.rb', line 46 def alignment_columns seqs seqs_no_spaces = seqs.map { |seq| seq.tr " ", "" } len = seqs_no_spaces.first.length seqs_no_spaces.map do |seq| unless seq.length == len raise PasvLib::Error, "Aligned seqs must be the same length" end seq.chars end.transpose end |
#gap?(residue) ⇒ Boolean
59 60 61 |
# File 'lib/pasv_lib/alignment.rb', line 59 def gap? residue GAP_CHARS.include? residue end |
#geometric_index(aln) ⇒ Float
A wrapper for #geometric_score that takes the average of the by sequence and by residue scores for an alignment.
98 99 100 101 102 103 |
# File 'lib/pasv_lib/alignment.rb', line 98 def geometric_index aln by_seq_score = geometric_score aln, "sequence" by_residue_score = geometric_score aln, "residue" (by_seq_score + by_residue_score) / 2.0 end |
#geometric_score(aln, by) ⇒ Float
The original Anvi’o code uses a clever bitshifting scheme to avoid storing array of arrays. It may also speed up the XOR part as you’re doing fewer XORs, but I’m not positive about that.
Calculate the geometric index for an alignment.
Basically, you change all residues to 0 and all gaps to 1. Then you take the permutations of the sequences and then the residues and XOR each of the permutations. Then you add it up and take averages and you’ll get the score. That is a pretty bad explanation, so see merenlab.org/2016/11/08/pangenomics-v2/#geometric-homogeneity-index for more information.
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/pasv_lib/alignment.rb', line 73 def geometric_score aln, by binary_aln = to_binary_matrix aln if by == "residue" binary_aln = binary_aln.transpose end num_rows = binary_aln.length max_differences_per_row = binary_aln.first.length num_comparisions_per_row = num_rows - 1 diff_score = binary_aln.permutation(2).map do |(row1, row2)| row1.zip(row2).map do |elem1, elem2| elem1 ^ elem2 end.sum / max_differences_per_row.to_f end.sum / num_comparisions_per_row.to_f / num_rows 1 - diff_score end |
#similarity_score(seqs, scoring_matrix = Blosum::BLOSUM62) ⇒ Object
Technically you could get a negative score if you don’t have enough high scoring residue pairs to keep the total score above zero. If this is the case, you’re alignment probably isn’t very good. Alternatively, you could use #adjust_scoring_matrix to avoid this issue.…
Returns the similarity score for the alignment.
For each colunm, each pair of residues are scored using the similarity matrix to get a points accrued / max_points. Then all column scores are averaged.
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/pasv_lib/alignment.rb', line 114 def similarity_score seqs, scoring_matrix = Blosum::BLOSUM62 raise PasvLib::Error if seqs.empty? return 1.0 if seqs.count == 1 aln_cols = alignment_columns seqs actual_points = 0 max_points = 0 aln_cols.each do |residues| residues.map(&:upcase).combination(2).each do |r1, r2| unless gap?(r1) || gap?(r2) # Check that scoring matrix has the residues. [r1, r2].each do |res| unless scoring_matrix.has_key? res raise PasvLib::Error, "Residue '#{res}' is missing from the " \ "scoring matrix." end end r1_max_score = scoring_matrix[r1].values.max r2_max_score = scoring_matrix[r2].values.max pair_max = [r1_max_score, r2_max_score].max # TODO check that residues exist in the scoring matrix. actual_points += scoring_matrix[r1][r2] max_points += pair_max end end end if max_points.zero? raise PasvLib::Error, "Something went wrong and max_points was 0. " \ "Maybe one of your sequences is all gaps?" end actual_points / max_points.to_f end |
#to_binary_matrix(aln) ⇒ Array<Array<integer>>
Convert an aligment to a binary matrix where 1 represents gaps and 0 represents residues.
159 160 161 162 163 164 165 166 |
# File 'lib/pasv_lib/alignment.rb', line 159 def to_binary_matrix aln aln.map do |seq| seq.chars.map do |char| # Gaps turn to 1, residues turn to 0. gap?(char) ? 1 : 0 end end end |