Class: ViralSeq::SeqHash

Inherits:
Object
  • Object
show all
Defined in:
lib/viral_seq/seq_hash.rb,
lib/viral_seq/hivdr.rb

Overview

ViralSeq::SeqHash class for operation on multiple sequences.

Examples:

read a FASTA sequence file of HIV PR sequences, make alignment, perform the QC location check, filter sequences with stop codons and APOBEC3g/f hypermutations, calculate pairwise diversity, calculate minority cut-off based on Poisson model, and examine for drug resistance mutations.

my_pr_seqhash = ViralSeq::SeqHash.fa('my_pr_fasta_file.fasta')
  # new ViralSeq::SeqHash object from a FASTA file
aligned_pr_seqhash = my_pr_seqhash.align
  # align with MUSCLE
filtered_seqhash = aligned_pr_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
  # filter nt sequences with the reference coordinates
filtered_seqhash = aligned_pr_seqhash.stop_codon[:without_stop_codon]
  # return a new ViralSeq::SeqHash object without stop codons
filtered_seqhash = filtered_seqhash.a3g[1]
  # further filter out sequences with A3G hypermutations
filtered_seqhash.pi
  # return pairwise diveristy π
cut_off = filtered_seqhash.pm
  # return cut-off for minority variants based on Poisson model
filtered_seqhash.sdrm_hiv_pr(cut_off)
  # examine for drug resistance mutations for PR region.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "") ⇒ SeqHash

initialize a ViralSeq::SeqHash object



25
26
27
28
29
30
31
# File 'lib/viral_seq/seq_hash.rb', line 25

def initialize (dna_hash = {}, aa_hash = {}, qc_hash = {}, title = "", file = "")
  @dna_hash = dna_hash
  @aa_hash = aa_hash
  @qc_hash = qc_hash
  @title = title
  @file = file
end

Instance Attribute Details

#aa_hashHash

Returns Hash object for :name => :amino_acid_sequence_string pairs.

Returns:

  • (Hash)

    Hash object for :name => :amino_acid_sequence_string pairs



37
38
39
# File 'lib/viral_seq/seq_hash.rb', line 37

def aa_hash
  @aa_hash
end

#dna_hashHash

Returns Hash object for :name => :sequence_string pairs.

Returns:

  • (Hash)

    Hash object for :name => :sequence_string pairs



34
35
36
# File 'lib/viral_seq/seq_hash.rb', line 34

def dna_hash
  @dna_hash
end

#fileString

Returns the file that is used to initialize SeqHash object, if it exists.

Returns:

  • (String)

    the file that is used to initialize SeqHash object, if it exists



47
48
49
# File 'lib/viral_seq/seq_hash.rb', line 47

def file
  @file
end

#qc_hashHash

Returns Hash object for :name => :qc_score_string pairs.

Returns:

  • (Hash)

    Hash object for :name => :qc_score_string pairs



40
41
42
# File 'lib/viral_seq/seq_hash.rb', line 40

def qc_hash
  @qc_hash
end

#titleString

default as the file basename if SeqHash object is initialized using ::fa or ::fq

Returns:

  • (String)

    the title of the SeqHash object.



44
45
46
# File 'lib/viral_seq/seq_hash.rb', line 44

def title
  @title
end

Class Method Details

.new_from_aa_fasta(infile) ⇒ ViralSeq::SeqHash Also known as: aa_fa

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file of amino acid sequences

Parameters:

  • infile (String)

    path to the FASTA format sequence file of aa sequences

Returns:



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/viral_seq/seq_hash.rb', line 82

def self.new_from_aa_fasta(infile)
  f=File.open(infile,"r")
  return_hash = {}
  name = ""
  while line = f.gets do
    line.tr!("\u0000","")
    next if line == "\n"
    next if line =~ /^\=/
    if line =~ /^\>/
      name = line.chomp
      return_hash[name] = ""
    else
      return_hash[name] += line.chomp.upcase
    end
  end
  f.close
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.aa_hash = return_hash
  seq_hash.title = File.basename(infile,".*")
  seq_hash.file = infile
  return seq_hash
end

.new_from_array(seq_array, master_tag = 'seq') ⇒ ViralSeq::SeqHash Also known as: array

initialize a ViralSeq::SeqHash object with an array of sequence strings

Parameters:

  • master_tag (String) (defaults to: 'seq')

    master tag to put in the sequence names

Returns:



148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/viral_seq/seq_hash.rb', line 148

def self.new_from_array(seq_array,master_tag = 'seq')
  n = 1
  hash = {}
  seq_array.each do |seq|
    hash[master_tag + "_" + n.to_s] = seq
    n += 1
  end
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = hash
  seq_hash.title = master_tag
  return seq_hash
end

.new_from_fasta(infile) ⇒ ViralSeq::SeqHash Also known as: fa

initialize a new ViralSeq::SeqHash object from a FASTA format sequence file

Examples:

new ViralSeq::SeqHash object from a FASTA file

ViralSeq::SeqHash.fa('my_fasta_file.fasta')

Parameters:

  • infile (String)

    path to the FASTA format sequence file

Returns:



55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/viral_seq/seq_hash.rb', line 55

def self.new_from_fasta(infile)
  f=File.open(infile,"r")
  return_hash = {}
  name = ""
  while line = f.gets do
    line.tr!("\u0000","")
    next if line == "\n"
    next if line =~ /^\=/
    if line =~ /^\>/
      name = line.chomp
      return_hash[name] = ""
    else
      return_hash[name] += line.chomp.upcase
    end
  end
  f.close
  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = return_hash
  seq_hash.title = File.basename(infile,".*")
  seq_hash.file = infile
  return seq_hash
end

.new_from_fastq(fastq_file) ⇒ ViralSeq::SeqHash Also known as: fq

initialize a new ViralSeq::SeqHash object from a FASTQ format sequence file

Examples:

new ViralSeq::SeqHash object from a FASTQ file

ViralSeq::SeqHash.fq('my_fastq_file.fastq')

Parameters:

  • fastq_file (String)

    path to the FASTA format sequence file

Returns:



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/viral_seq/seq_hash.rb', line 111

def self.new_from_fastq(fastq_file)
  count = 0
  sequence_a = []
  quality_a = []
  count_seq = 0

  File.open(fastq_file,'r') do |file|
    file.readlines.collect do |line|
      count +=1
      count_m = count % 4
      if count_m == 1
        line.tr!('@','>')
        sequence_a << line.chomp
        quality_a << line.chomp
        count_seq += 1
      elsif count_m == 2
        sequence_a << line.chomp
      elsif count_m == 0
        quality_a << line.chomp
      end
    end
  end
  sequence_hash = Hash[sequence_a.each_slice(2).to_a]
  quality_hash = Hash[quality_a.each_slice(2).to_a]

  seq_hash = ViralSeq::SeqHash.new
  seq_hash.dna_hash = sequence_hash
  seq_hash.qc_hash = quality_hash
  seq_hash.title = File.basename(fastq_file,".*")
  seq_hash.file = fastq_file
  return seq_hash
end

Instance Method Details

#+(sh2) ⇒ ViralSeq::SeqHash

combine SeqHash objects

Parameters:

Returns:



180
181
182
183
184
185
186
187
188
# File 'lib/viral_seq/seq_hash.rb', line 180

def +(sh2)
  new_seqhash = ViralSeq::SeqHash.new
  new_seqhash.dna_hash = self.dna_hash.merge(sh2.dna_hash)
  new_seqhash.aa_hash = self.aa_hash.merge(sh2.aa_hash)
  new_seqhash.qc_hash = self.qc_hash.merge(sh2.qc_hash)
  new_seqhash.title = self.title + "_with_" + sh2.title
  new_seqhash.file = self.file + "," + sh2.file
  return new_seqhash
end

#a3g_hypermutHash Also known as: a3g

function to determine if the sequences have APOBEC3g/f hypermutation.

# APOBEC3G/F pattern: GRD -> ARD
# control pattern: G[YN|RC] -> A[YN|RC]
# use the sample consensus to determine potential a3g sites
# Two criteria to identify hypermutation
# 1. Fisher's exact test on the frequencies of G to A mutation at A3G positions vs. non-A3G positions
# 2. Poisson distribution of G to A mutations at A3G positions, outliers sequences
# note:  criteria 2 only applies on a sequence file containing more than 20 sequences,
#        b/c Poisson model does not do well on small sample size.

Examples:

identify apobec3gf mutations from a sequence fasta file

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_a3g_sequence1.fasta')
hypermut = my_seqhash.a3g
hypermut[:a3g_seq].dna_hash.keys
=> [">Seq7", ">Seq14"]
hypermut[:filtered_seq].dna_hash.keys
=> [">Seq1", ">Seq2", ">Seq5"]
hypermut[:stats]
=> [[">Seq7", 23, 68, 1, 54, 18.26, 4.308329383112348e-06], [">Seq14", 45, 68, 9, 54, 3.97, 5.2143571971582974e-08]]

identify apobec3gf mutations from another sequence fasta file

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_a3g_sequence2.fasta')
hypermut = my_seqhash.a3g
hypermut[:stats]
=> [[">CTAACACTCA_134_a3g-sample2", 4, 35, 0, 51, Infinity, 0.02465676660128911], [">ATAGTGCCCA_60_a3g-sample2", 4, 35, 1, 51, 5.83, 0.1534487353839561]]
# notice sequence ">ATAGTGCCCA_60_a3g-sample2" has a p value at 0.15, greater than 0.05,
# but it is still called as hypermutation sequence b/c it's Poisson outlier sequence.

Returns:

  • (Hash)

    three paris. :a3g_seq: a ViralSeq:SeqHash object for sequences with hypermutations :filtered_seq : a ViralSeq:SeqHash object for sequences without hypermutations :stats : a two-demensional array ‘[[a,b], [c,d]]` for statistic_info, including the following information,

    # sequence tag
    # G to A mutation numbers at potential a3g positions
    # total potential a3g G positions
    # G to A mutation numbers at non a3g positions
    # total non a3g G positions
    # a3g G to A mutation rate / non-a3g G to A mutation rate
    # Fishers Exact P-value
    

See Also:



442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
# File 'lib/viral_seq/seq_hash.rb', line 442

def a3g_hypermut
  # mut_hash number of apobec3g/f mutations per sequence
  mut_hash = {}
  hm_hash = {}
  out_hash = {}

  # total G->A mutations at apobec3g/f positions.
  total = 0

  # make consensus sequence for the input sequence hash
  ref = self.consensus

  # obtain apobec3g positions and control positions
  apobec = apobec3gf(ref)
  mut = apobec[0]
  control = apobec[1]

  self.dna_hash.each do |k,v|
    a = 0 # muts
    b = 0 # potential mut sites
    c = 0 # control muts
    d = 0 # potenrial controls
    mut.each do |n|
      next if v[n] == "-"
      if v[n] == "A"
        a += 1
        b += 1
      else
        b += 1
      end
    end
    mut_hash[k] = a
    total += a

    control.each do |n|
      next if v[n] == "-"
      if v[n] == "A"
        c += 1
        d += 1
      else
        d += 1
      end
    end
    rr = (a/b.to_f)/(c/d.to_f)

    t1 = b - a
    t2 = d - c

    fet = ViralSeq::Rubystats::FishersExactTest.new
    fisher = fet.calculate(t1,t2,a,c)
    perc = fisher[:twotail]
    info = [k, a, b, c, d, rr.round(2), perc]
    out_hash[k] = info
    if perc < 0.05
      hm_hash[k] = info
    end
  end

  if self.dna_hash.size > 20
    rate = total.to_f/(self.dna_hash.size)
    count_mut = mut_hash.values.count_freq
    maxi_count = count_mut.values.max
    poisson_hash = ViralSeq::Math::PoissonDist.new(rate,maxi_count).poisson_hash
    cut_off = 0
    poisson_hash.each do |k,v|
      cal = self.dna_hash.size * v
      obs = count_mut[k]
      if obs >= 20 * cal
        cut_off = k
        break
      elsif k == maxi_count
        cut_off = maxi_count
      end
    end
    mut_hash.each do |k,v|
      if v > cut_off
        hm_hash[k] = out_hash[k]
      end
    end
  end
  hm_seq_hash = ViralSeq::SeqHash.new
  hm_hash.each do |k,_v|
    hm_seq_hash.dna_hash[k] = self.dna_hash[k]
  end
  hm_seq_hash.title = self.title + "_hypermut"
  hm_seq_hash.file = self.file
  filtered_seq_hash = self.sub(self.dna_hash.keys - hm_hash.keys)
  return { a3g_seq: hm_seq_hash,
           filtered_seq: filtered_seq_hash,
           stats: hm_hash.values
          }
end

#align(path_to_muscle = false) ⇒ SeqHash

align the @dna_hash sequences, return a new ViralSeq::SeqHash object with aligned @dna_hash using MUSCLE

Parameters:

  • path_to_muscle (String) (defaults to: false)

    , path to MUSCLE excutable. if not provided (as default), it will use RubyGem::MuscleBio

Returns:

  • (SeqHash)

    new SeqHash object of the aligned @dna_hash, the title has “_aligned”



578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
# File 'lib/viral_seq/seq_hash.rb', line 578

def align(path_to_muscle = false)
  seq_hash = self.dna_hash
  if self.file.size > 0
    temp_dir = File.dirname(self.file)
  else
    temp_dir=File.dirname($0)
  end

  temp_file = temp_dir + "/_temp_muscle_in"
  temp_aln = temp_dir + "/_temp_muscle_aln"
  File.open(temp_file, 'w'){|f| seq_hash.each {|k,v| f.puts k; f.puts v}}
  if path_to_muscle
    unless ViralSeq.check_muscle?(path_to_muscle)
      File.unlink(temp_file)
      return nil
    end
    print `#{path_to_muscle} -in #{temp_file} -out #{temp_aln} -quiet`
  else
    MuscleBio.run("muscle -in #{temp_file} -out #{temp_aln} -quiet")
  end
  out_seq_hash = ViralSeq::SeqHash.fa(temp_aln)
  out_seq_hash.title = self.title + "_aligned"
  out_seq_hash.file = self.file
  File.unlink(temp_file)
  File.unlink(temp_aln)
  return out_seq_hash
end

#collapse(cutoff = 1) ⇒ ViralSeq::SeqHash

Collapse sequences by difference cut-offs. Suggesting aligning before using this function.

Parameters:

  • cutoff (Integer) (defaults to: 1)

    nt base differences. collapse sequences within [cutoff] differences

Returns:



883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
# File 'lib/viral_seq/seq_hash.rb', line 883

def collapse(cutoff=1)
  seq_array = self.dna_hash.values
  new_seq_freq = {}
  seq_freq = seq_array.count_freq
  if seq_freq.size == 1
    new_seq_freq = seq_freq
  else
    uniq_seq = seq_freq.keys
    unique_seq_pair = uniq_seq.combination(2)
    dupli_seq = []
    unique_seq_pair.each do |pair|
      seq1 = pair[0]
      seq2 = pair[1]
      diff = seq1.compare_with(seq2)
      if diff <= cutoff
        freq1 = seq_freq[seq1]
        freq2 = seq_freq[seq2]
        freq1 >= freq2 ? dupli_seq << seq2 : dupli_seq << seq1
      end
    end

    seq_freq.each do |seq,freq|
      unless dupli_seq.include?(seq)
        new_seq_freq[seq] = freq
      end
    end
  end
  seqhash = ViralSeq::SeqHash.new
  n = 1
  new_seq_freq.each do |seq,freq|
    name = ">seq_" + n.to_s + '_' + freq.to_s
    seqhash.dna_hash[name] = seq
    n += 1
  end
  return seqhash
end

#consensus(cutoff = 0.5) ⇒ String

create one consensus sequence from @dna_hash with an optional majority cut-off for mixed bases.

Examples:

consensus sequence from an array of sequences.

seq_array = %w{ ATTTTTTTTT
                AATTTTTTTT
                AAATTTTTTT
                AAAATTTTTT
                AAAAATTTTT
                AAAAAATTTT
                AAAAAAATTT
                AAAAAAAATT
                AAAAAAAAAT
                AAAAAAAAAA }
my_seqhash = ViralSeq::SeqHash.array(seq_array)
my_seqhash.consensus
=> 'AAAAAWTTTT'
my_seqhash.consensus(0.7)
=> 'AAAANNNTTT'

Parameters:

  • cutoff (Float) (defaults to: 0.5)

    majority cut-off for calling consensus bases. defult at (0.5), position with 15% “A” and 85% “G” will be called as “G” with 20% cut-off and as “R” with 10% cut-off. Using (0) will return use simply majority rule (no cutoff)

Returns:

  • (String)

    consensus sequence



373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
# File 'lib/viral_seq/seq_hash.rb', line 373

def consensus(cutoff = 0.5)
  seq_array = self.dna_hash.values
  seq_length = seq_array[0].size
  seq_size = seq_array.size
  consensus_seq = ""
  (0..(seq_length - 1)).each do |position|
    all_base = []
    seq_array.each do |seq|
      all_base << seq[position]
    end
    base_count = all_base.count_freq
    max_base_list = []

    if cutoff.zero?
      max_count = base_count.values.max
      max_base_hash = base_count.select {|_k,v| v == max_count}
      max_base_list = max_base_hash.keys
    else
      base_count.each do |k,v|
        if v/seq_size.to_f >= cutoff
          max_base_list << k
        end
      end
    end

    consensus_seq += call_consensus_base(max_base_list)
  end
  return consensus_seq
end

#error_table(ref = self.consensus, head = true) ⇒ Array

return an table of frequencies of nucleotides at each position.

Examples:

error table for an array of sequences

array = %w{ AACCGGTT
            AGCCGGTT
            AACTGCTT
            AACCGTTA
            AACCGGTA }
my_seqhash = ViralSeq::SeqHash.array(array)
my_seqhash.error_table.each {|r| puts r.join(',')}
  position,consensus,total_seq_number,A,C,G,T
  1,A,5,-,,,
  2,A,5,-,,0.2,
  3,C,5,,-,,
  4,C,5,,-,,0.2
  5,G,5,,,-,
  6,G,5,,0.2,-,0.2
  7,T,5,,,,-
  8,T,5,0.4,,,-

Parameters:

  • ref (String) (defaults to: self.consensus)

    a reference sequence to compare with, default as the sample consensus sequence

  • head (Boolean) (defaults to: true)

    if the head of table is included.

Returns:

  • (Array)

    a two-dimension array of the frequency table, including the following info:

    position on the sequence (starting from 1)
    consensus nucleotide
    total sequence numbers
    percentage of A, shows "-" if agrees with consensus
    percentage of C, shows "-" if agrees with consensus
    percentage of G, shows "-" if agrees with consensus
    percentage of T, shows "-" if agrees with consensus
    


1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
# File 'lib/viral_seq/seq_hash.rb', line 1099

def error_table(ref = self.consensus, head = true)

  table = []
  if head
    table << %w{
      position
      consensus
      total_seq_number
      A
      C
      G
      T
    }
  end
  ref_size = ref.size

  (0..(ref_size - 1)).each do |position|
    ref_base = ref[position]
    nts = []

    self.dna_hash.each do |_k,v|
      nts << v[position]
    end

    freq = nts.count_freq
    freq2 = {}

    freq.each do |nt,c|
      if nt == ref_base
        freq2[nt] = '-'
      else
        freq2[nt] = (c/(self.size).to_f)
      end
    end

    table << [(position + 1),ref_base,self.size,freq2['A'],freq2['C'],freq2['G'],freq2['T']]
  end

  return table

end

#filter_similar_pid(cutoff = 10) ⇒ ViralSeq::SeqHash

Remove squences with residual offspring Primer IDs.

Compare PID with sequences which have identical sequences.
PIDs differ by 1 base will be recognized. If PID1 is x time (cutoff) greater than PID2, PID2 will be disgarded.
  each sequence tag starting with ">" and the Primer ID sequence
  followed by the number of Primer ID appeared in the raw sequence
  the information sections in the tags are separated by underscore "_"
  example sequence tag: >AGGCGTAGA_32_sample1_RT

Parameters:

  • cutoff (Integer) (defaults to: 10)

    the fold cut-off to remove the potential residual offspring Primer IDs

Returns:

  • (ViralSeq::SeqHash)

    a new SeqHash object without sqeuences containing residual offspring Primer ID



821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
# File 'lib/viral_seq/seq_hash.rb', line 821

def filter_similar_pid(cutoff = 10)
  seq = self.dna_hash.dup
  uni_seq = seq.values.uniq
  uni_seq_pid = {}
  uni_seq.each do |k|
    seq.each do |name,s|
      name = name[1..-1]
      if k == s
        if uni_seq_pid[k]
          uni_seq_pid[k] << [name.split("_")[0],name.split("_")[1]]
        else
          uni_seq_pid[k] = []
          uni_seq_pid[k] << [name.split("_")[0],name.split("_")[1]]
        end
      end
    end
  end

  dup_pid = []
  uni_seq_pid.values.each do |v|
    next if v.size == 1
    pid_hash = Hash[v]
    list = pid_hash.keys
    list2 = Array.new(list)
    pairs = []

    list.each do |k|
      list2.delete(k)
      list2.each do |k1|
        pairs << [k,k1]
      end
    end

    pairs.each do |p|
      pid1 = p[0]
      pid2 = p[1]
      if pid1.compare_with(pid2) <= 1
        n1 = pid_hash[pid1].to_i
        n2 = pid_hash[pid2].to_i
        if n1 >= cutoff * n2
          dup_pid << pid2
        elsif n2 >= cutoff * n1
          dup_pid << pid1
        end
      end
    end
  end

  new_seq = {}
  seq.each do |name,s|
    pid = name.split("_")[0][1..-1]
    unless dup_pid.include?(pid)
      new_seq[name] = s
    end
  end
  self.sub(new_seq.keys)
end

#gap_strip(option = :nt) ⇒ ViralSeq::SeqHash

gap strip from a sequence alignment, all positions that contains gaps (‘-’) will be removed

Examples:

gap strip for an array of sequences

array = ["AACCGGTT", "A-CCGGTT", "AAC-GGTT", "AACCG-TT", "AACCGGT-"]
array = %w{ AACCGGTT
            A-CCGGTT
            AAC-GGTT
            AACCG-TT
            AACCGGT- }
my_seqhash = ViralSeq::SeqHash.array(array)
puts my_seqhash.gap_strip.dna_hash.values
  ACGT
  ACGT
  ACGT
  ACGT
  ACGT

Parameters:

  • option (Symbol) (defaults to: :nt)

    sequence options for ‘:nt` or `:aa`

Returns:

  • (ViralSeq::SeqHash)

    a new SeqHash object containing nt or aa sequences without gaps



938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
# File 'lib/viral_seq/seq_hash.rb', line 938

def gap_strip(option = :nt)
  if option == :nt
    sequence_alignment = self.dna_hash
  elsif option == :aa
    sequence_alignment = self.aa_hash
  else
    raise "Option `#{option}` not recognized"
  end

  new_seq = {}
  seq_size = sequence_alignment.values[0].size
  seq_matrix = {}
  (0..(seq_size - 1)).each do |p|
    seq_matrix[p] = []
    sequence_alignment.values.each do |s|
      seq_matrix[p] << s[p]
    end
  end

  seq_matrix.delete_if do |_p, list|
    list.include?("-")
  end

  sequence_alignment.each do |n,s|
    new_s = ""
    seq_matrix.keys.each {|p| new_s += s[p]}
    new_seq[n] = new_s
  end
  new_seq_hash = ViralSeq::SeqHash.new
  if option == :nt
    new_seq_hash.dna_hash = new_seq
    new_seq_hash.aa_hash = self.aa_hash
  elsif option == :aa
    new_seq_hash.dna_hash = self.dna_hash
    new_seq_hash.aa_hash = new_seq
  end
  new_seq_hash.qc_hash = self.qc_hash
  new_seq_hash.title = self.title + "_strip"
  new_seq_hash.file = self.file
  return new_seq_hash
end

#gap_strip_ends(option = :nt) ⇒ ViralSeq::SeqHash

gap strip from a sequence alignment at both ends, only positions at the ends that contains gaps (‘-’) will be removed.

Examples:

gap strip for an array of sequences only at the ends

array = %w{ AACCGGTT
            A-CCGGTT
            AAC-GGTT
            AACCG-TT
            AACCGGT- }
my_seqhash = ViralSeq::SeqHash.array(array)
puts my_seqhash.gap_strip_ends.dna_hash.values
  AACCGGT
  A-CCGGT
  AAC-GGT
  AACCG-T
  AACCGGT

Parameters:

  • option (Symbol) (defaults to: :nt)

    sequence options for ‘:nt` or `:aa`

Returns:

  • (ViralSeq::SeqHash)

    a new SeqHash object containing nt or aa sequences without gaps at the ends



997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
# File 'lib/viral_seq/seq_hash.rb', line 997

def gap_strip_ends(option = :nt)
  if option == :nt
    sequence_alignment = self.dna_hash
  elsif option == :aa
    sequence_alignment = self.aa_hash
  else
    raise "Option #{option} not recognized"
  end
  new_seq = {}
  seq_size = sequence_alignment.values[0].size
  seq_matrix = {}
  (0..(seq_size - 1)).each do |p|
    seq_matrix[p] = []
    sequence_alignment.values.each do |s|
      seq_matrix[p] << s[p]
    end
  end
  n1 = 0
  n2 = 0
  seq_matrix.each do |_p, list|
    if list.include?("-")
      n1 += 1
    else
      break
    end
  end

  seq_matrix.keys.reverse.each do |p|
    list = seq_matrix[p]
    if list.include?("-")
      n2 += 1
    else
      break
    end
  end

  sequence_alignment.each do |n,s|
    new_s = s[n1..(- n2 - 1)]
    new_seq[n] = new_s
  end
  new_seq_hash = ViralSeq::SeqHash.new
  if option == :nt
    new_seq_hash.dna_hash = new_seq
    new_seq_hash.aa_hash = self.aa_hash
  elsif option == :aa
    new_seq_hash.dna_hash = self.dna_hash
    new_seq_hash.aa_hash = new_seq
  end
  new_seq_hash.qc_hash = self.qc_hash
  new_seq_hash.title = self.title + "_strip"
  new_seq_hash.file = self.file
  return new_seq_hash
end

#hiv_seq_qc(start_nt, end_nt, indel = true, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

quality check for HIV sequences based on ViralSeq::Sequence#locator, check if sequences are in the target range

Examples:

QC for sequences in a FASTA files

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_seq.fasta')
filtered_seqhash = my_seqhash.hiv_seq_qc([4384,4386], 4750..4752, false, :HXB2)
my_seqhash.dna_hash.size
=> 6
filtered_seqhash.dna_hash.size
=> 4

Parameters:

  • start_nt (Integer, Range, Array)

    start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array

  • end_nt (Integer, Range, Array)

    end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array

  • indel (Boolean) (defaults to: true)

    allow indels or not, ‘ture` or `false`

  • ref_option (Symbol) (defaults to: :HXB2)

    , name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

  • path_to_muscle (String) (defaults to: false)

    , path to the muscle executable, if not provided, use MuscleBio to run Muscle

Returns:

  • (ViralSeq::SeqHash)

    a new ViralSeq::SeqHash object with only the sequences that meet the QC criterias



737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
# File 'lib/viral_seq/seq_hash.rb', line 737

def hiv_seq_qc(start_nt, end_nt, indel=true, ref_option = :HXB2, path_to_muscle = false)
  start_nt = start_nt..start_nt if start_nt.is_a?(Integer)
  end_nt = end_nt..end_nt if end_nt.is_a?(Integer)
  seq_hash = self.dna_hash.dup
  seq_hash_unique = seq_hash.values.uniq
  seq_hash_unique_pass = []

  seq_hash_unique.each do |seq|
    loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
    next unless loc # if locator tool fails, skip this seq.
    if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
      if indel
        seq_hash_unique_pass << seq
      elsif loc[3] == false
        seq_hash_unique_pass << seq
      end
    end
  end
  seq_pass = []
  seq_hash_unique_pass.each do |seq|
    seq_hash.each do |seq_name, orginal_seq|
      if orginal_seq == seq
        seq_pass << seq_name
        seq_hash.delete(seq_name)
      end
    end
  end
  self.sub(seq_pass)
end

#mutation(error_rate = 0.01) ⇒ ViralSeq::SeqHash

mutate @dna_hash based on the error_rate

Parameters:

  • error_rate (Float) (defaults to: 0.01)

    error rate used to mutate sequences.

Returns:



1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
# File 'lib/viral_seq/seq_hash.rb', line 1056

def mutation(error_rate = 0.01)
  new_seqhash = ViralSeq::SeqHash.new
  dna = {}
  self.dna_hash.each do |name, seq|
    dna[name + '_mut-' + error_rate.to_s] = seq.mutation(error_rate)
  end
  new_seqhash.dna_hash = dna
  new_seqhash.title = self.title + "_mut-" + error_rate.to_s
  new_seqhash.file = self.file
  return new_seqhash
end

#nucleotide_piFloat Also known as: pi

Function to calculate nucleotide diversity π, for nt sequence only

Examples:

calculate π

sequences = %w{ AAGGCCTT ATGGCCTT AAGGCGTT AAGGCCTT AACGCCTT AAGGCCAT }
my_seqhash = ViralSeq::SeqHash.array(sequences)
my_seqhash.pi
=> 0.16667

Returns:

  • (Float)

    nucleotide diversity π

See Also:



656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
# File 'lib/viral_seq/seq_hash.rb', line 656

def nucleotide_pi
  sequences = self.dna_hash.values
  seq_length = sequences[0].size - 1
  nt_position_hash = {}
  (0..seq_length).each do |n|
    nt_position_hash[n] = []
    sequences.each do |s|
      nt_position_hash[n] << s[n]
    end
  end
  diver = 0
  com = 0
  nt_position_hash.each do |_p,nt|
    nt.delete_if {|n| n =~ /[^A|^C|^G|^T]/}
    next if nt.size == 1
    nt_count = nt.count_freq
    combination = (nt.size)*(nt.size - 1)/2
    com += combination
    a = nt_count["A"]
    c = nt_count["C"]
    t = nt_count["T"]
    g = nt_count["G"]
    div = a*c + a*t + a*g + c*t + c*g + t*g
    diver += div
  end
  pi = (diver/com.to_f).round(5)
  return pi
end

#poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20) ⇒ Integer Also known as: pm

Define Poission cut-off for minority variants.

Examples:

obtain Poisson minority cut-off from the example sequence FASTA file.

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_sequence_for_poisson.fasta')
my_seqhash.pm
=> 2 # means that mutations appear at least 2 times are very likely to be a true mutation instead of random methods errors.

Parameters:

  • error_rate (Float) (defaults to: 0.0001)

    estimated sequencing error rate

  • fold_cutoff (Integer) (defaults to: 20)

    a fold cut-off to determine poisson minority cut-off. default = 20. i.e. <5% mutations from random methods error.

Returns:

  • (Integer)

    a cut-off for minority variants (>=).

See Also:



547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
# File 'lib/viral_seq/seq_hash.rb', line 547

def poisson_minority_cutoff(error_rate = 0.0001, fold_cutoff = 20)
  sequences = self.dna_hash.values
  if sequences.size == 0
    return 0
  else
    cut_off = 1
    l = sequences[0].size
    rate = sequences.size * error_rate
    count_mut = variant_for_poisson(sequences)
    max_count = count_mut.keys.max
    poisson_hash = ViralSeq::Math::PoissonDist.new(rate, max_count).poisson_hash

    poisson_hash.each do |k,v|
      cal = l * v
      obs = count_mut[k] ? count_mut[k] : 0
      if obs >= fold_cutoff * cal
        cut_off = k
        break
      end
    end
    return cut_off
  end
end

#random_select(n = 100) ⇒ ViralSeq::SeqHash

randomly select n number of sequences from the orginal SeqHash object

Parameters:

  • n (Integer) (defaults to: 100)

    number of sequences to randomly select

Returns:



1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
# File 'lib/viral_seq/seq_hash.rb', line 1145

def random_select(n = 100)
  new_sh = ViralSeq::SeqHash.new
  dna_hash = self.dna_hash
  aa_hash = self.aa_hash
  qc_hash = self.qc_hash

  keys = dna_hash.keys.sample(n)

  keys.each do |k|
    new_sh.dna_hash[k] = dna_hash[k]
    new_sh.aa_hash[k] = aa_hash[k]
    new_sh.qc_hash[k] = qc_hash[k]
  end
  new_sh.title = self.title + "_" + n.to_s
  return new_sh
end

#sdrm_hiv_in(cutoff = 0) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV IN region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
IN codon 53-174

Parameters:

  • cutoff (Integer) (defaults to: 0)

    cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function

Returns:

  • (Array)

    three elements ‘[point_mutation_list, linkage_list, report_list]`

    # point_mutation_list: two demensional array for the following information,

    # [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,label]
    

    # linkage_list: two demensional array for the following information,

    # [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
    

    # report_list: two demensional array for the following information,

    # [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
    


368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
# File 'lib/viral_seq/hivdr.rb', line 368

def sdrm_hiv_in(cutoff = 0)
  sequences = self.dna_hash
  region = "IN"
  rf_label = 2
  start_codon_number = 53
  n_seq = sequences.size
  mut = {}
  mut_com = []
  aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)
    aa[name] = s.aa_string
    record = s.sdrm(:hiv_in, start_codon_number)
    mut_com << record
    record.each do |position,mutation|
      if mut[position]
        mut[position][1] << mutation[1]
      else
        mut[position] = [mutation[0],[]]
        mut[position][1] << mutation[1]
      end
    end
  end

  mut.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      label = number < cutoff ? "*" : ""
      point_mutation_list << [region, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
    end
  end
  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = start_codon_number

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sdrm_hiv_pr(cutoff = 0) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV PR region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
PR codon 1-99
RT codon 34-122 (HXB2 2650-2914) and 152-236(3001-3257)
IN codon 53-174 (HXB2 4384-4751)

Examples:

identify SDRMs from a FASTA sequence file of HIV PR sequences obtained after MPID-DR sequencing

my_seqhash = ViralSeq::SeqHash.fa('spec/sample_files/sample_dr_sequences/pr.fasta')
p_cut_off = my_seqhash.pm
pr_sdrm = my_seqhash.sdrm_hiv_pr(p_cut_off)
puts "region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,label"; pr_sdrm[0].each {|n| puts n.join(',')}
=> region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,label
=> PR,396,30,D,N,247,0.62374,0.57398,0.67163,
=> PR,396,50,I,V,1,0.00253,6.0e-05,0.01399,*
=> PR,396,88,N,D,246,0.62121,0.57141,0.66919,

puts "region,tcs_number,linkage,count,%,CI_low,CI_high,label"; pr_sdrm[1].each {|n| puts n.join(',')}
=> region,tcs_number,linkage,count,%,CI_low,CI_high,label
=> PR,396,D30N+N88D,245,0.61869,0.56884,0.66674,
=> PR,396,WT,149,0.37626,0.32837,0.42602,
=> PR,396,D30N,1,0.00253,6.0e-05,0.01399,*
=> PR,396,D30N+I50V+N88D,1,0.00253,6.0e-05,0.01399,*

puts "position,codon,tcs_number," + ViralSeq::AMINO_ACID_LIST.join(","); pr_sdrm[2].each {|n|puts n.join(",")}
=> position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*
=> PR,1,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,2,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,3,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,4,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,5,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,6,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
=> PR,7,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,8,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,9,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,10,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,11,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,12,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,62.1212,0.0,0.0,0.0,0.0
=> PR,13,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.1313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.8687,0.0,0.0,0.0
=> PR,14,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,15,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.3737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.6263,0.0,0.0,0.0
=> PR,16,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,17,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,18,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.5051,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,19,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,20,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,21,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,22,396,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,23,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,24,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,25,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,26,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,27,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,28,396,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,29,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,30,396,0.0,0.0,37.6263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.3737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,31,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,32,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,33,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,34,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,35,396,0.0,0.0,62.1212,37.6263,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,36,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,37,396,0.0,0.0,37.8788,61.8687,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,38,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,39,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.5051,0.0,0.0,0.0,0.0,0.0
=> PR,40,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,41,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,42,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
=> PR,43,396,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,44,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,45,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,46,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,47,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,48,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,49,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,50,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.0
=> PR,51,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,52,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,53,396,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,54,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,55,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,56,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,57,396,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,58,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,59,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0
=> PR,60,396,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,61,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,62,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,63,396,0.0,0.0,0.0,0.0,0.0,0.0,0.2525,0.0,0.0,37.8788,0.0,0.0,61.8687,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,64,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,65,396,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,66,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,67,396,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,68,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,69,396,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,70,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,71,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,37.8788,0.0,0.0,0.0
=> PR,72,396,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62.1212,0.0,0.0,0.0,0.0
=> PR,73,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,74,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,75,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,76,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,77,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,78,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,79,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,80,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,81,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,82,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0
=> PR,83,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.4949,0.0,0.0,0.0,0.5051,0.0,0.0,0.0,0.0,0.0
=> PR,84,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,85,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,86,396,0.0,0.0,0.0,0.5051,0.0,99.4949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,87,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,88,396,0.0,0.0,62.1212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.8788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,89,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,90,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,91,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,92,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,93,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,94,396,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,95,396,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,96,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0
=> PR,97,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
=> PR,98,396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.7475,0.0,0.0,0.0,0.2525,0.0,0.0,0.0,0.0,0.0
=> PR,99,396,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Parameters:

  • cutoff (Integer) (defaults to: 0)

    cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function

Returns:

  • (Array)

    three elements ‘[point_mutation_list, linkage_list, report_list]`

    # point_mutation_list: two demensional array for the following information,

    # [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,label]
    

    # linkage_list: two demensional array for the following information,

    # [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
    

    # report_list: two demensional array for the following information,

    # [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
    


139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# File 'lib/viral_seq/hivdr.rb', line 139

def sdrm_hiv_pr(cutoff = 0)
  sequences = self.dna_hash
  region = "PR"
  rf_label = 0
  start_codon_number = 1
  n_seq = sequences.size
  mut = {}
  mut_com = []
  aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)
    aa[name] = s.aa_string
    record = s.sdrm(:hiv_pr)
    mut_com << record
    record.each do |position,mutation|
      if mut[position]
        mut[position][1] << mutation[1]
      else
        mut[position] = [mutation[0],[]]
        mut[position][1] << mutation[1]
      end
    end
  end
  mut.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      label = number < cutoff ? "*" : ""
      point_mutation_list << [region, n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
    end
  end
  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  aa_start = start_codon_number

  aa_size = aa.values[0].size - 1

  (0..aa_size).to_a.each do |p|
    aas = []
    aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sdrm_hiv_rt(cutoff = 0) ⇒ Array

functions to identify SDRMs from a ViralSeq::SeqHash object at HIV RT region.

works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
RT codon 34-122, 152-236, two regions are linked

Parameters:

  • cutoff (Integer) (defaults to: 0)

    cut-off for minimal abundance of a mutation to be called as valid mutation, can be obtained using ViralSeq::SeqHash#poisson_minority_cutoff function

Returns:

  • (Array)

    three elements ‘[point_mutation_list, linkage_list, report_list]`

    # point_mutation_list: two demensional array for the following information,

    # [region,tcs_number,position,wildtype,mutation,count,%,CI_low,CI_high,label]
    

    # linkage_list: two demensional array for the following information,

    # [region,tcs_number,linkage,count,%,CI_low,CI_high,label]
    

    # report_list: two demensional array for the following information,

    # [position,codon,tcs_number,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*]
    


232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
# File 'lib/viral_seq/hivdr.rb', line 232

def sdrm_hiv_rt(cutoff = 0)
  sequences = self.dna_hash
  region = "RT"
  rf_label = 1
  start_codon_number = 34
  gap = "AGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCAC"

  n_seq = sequences.size
  mut_nrti = {}
  mut_nnrti = {}
  mut_com = []
  r1_aa = {}
  r2_aa = {}
  point_mutation_list = []
  sequences.each do |name,seq|
    r1 = seq[0,267]
    r2 = seq[267..-1]
    seq = r1 + gap + r2
    s = ViralSeq::Sequence.new(name,seq)
    s.translate(rf_label)

    r1_aa[name] = s.aa_string[0,89]
    r2_aa[name] = s.aa_string[-85..-1]
    nrti = s.sdrm(:nrti, start_codon_number)
    nnrti = s.sdrm(:nnrti, start_codon_number)
    mut_com << (nrti.merge(nnrti))

    nrti.each do |position,mutation|
      if mut_nrti[position]
        mut_nrti[position][1] << mutation[1]
      else
        mut_nrti[position] = [mutation[0],[]]
        mut_nrti[position][1] << mutation[1]
      end
    end
    nnrti.each do |position,mutation|
      if mut_nnrti[position]
        mut_nnrti[position][1] << mutation[1]
      else
        mut_nnrti[position] = [mutation[0],[]]
        mut_nnrti[position][1] << mutation[1]
      end
    end
  end

  mut_nrti.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      label = number < cutoff ? "*" : ""
      point_mutation_list << ["NRTI", n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
    end
  end

  mut_nnrti.each do |position,mutation|
    wt = mutation[0]
    mut_list = mutation[1]
    count_mut_list = mut_list.count_freq
    count_mut_list.each do |m,number|
      ci = ViralSeq::Math::BinomCI.new(number, n_seq)
      label = number < cutoff ? "*" : ""
      point_mutation_list << ["NNRTI", n_seq, position, wt, m, number, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
    end
  end

  point_mutation_list.sort_by! {|record| record[2]}

  link = mut_com.count_freq
  link2 = {}
  link.each do |k,v|
    pattern = []
    if k.size == 0
      pattern = ['WT']
    else
      k.each do |p,m|
        pattern << (m[0] + p.to_s + m[1])
      end
    end
    link2[pattern.join("+")] = v
  end
  linkage_list = []
  link2.sort_by{|_key,value|value}.reverse.to_h.each do |k,v|
    ci = ViralSeq::Math::BinomCI.new(v, n_seq)
    label = v < cutoff ? "*" : ""
    linkage_list << [region, n_seq, k, v, ci.mean.round(5), ci.lower.round(5), ci.upper.round(5), label]
  end

  report_list = []

  div_aa = {}
  r1_aa_start = 34
  r2_aa_start = 152

  r1_aa_size = r1_aa.values[0].size - 1
  r2_aa_size = r2_aa.values[0].size - 1

  (0..r1_aa_size).to_a.each do |p|
    aas = []
    r1_aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[r1_aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    r1_aa_start += 1
  end

  (0..r2_aa_size).to_a.each do |p|
    aas = []
    r2_aa.values.each do |r1|
      aas << r1[p]
    end
    count_aas = aas.count_freq
    div_aa[r2_aa_start] = count_aas.sort_by{|_k,v|v}.reverse.to_h
    r2_aa_start += 1
  end

  div_aa.each do |k,v|
    record = [region, k, n_seq]
    ViralSeq::AMINO_ACID_LIST.each do |amino_acid|
      aa_count = v[amino_acid]
      record << (aa_count.to_f/n_seq*100).round(4)
    end
    report_list << record
  end

  return [point_mutation_list, linkage_list, report_list]
end

#sequence_locator(ref_option = :HXB2) ⇒ Array Also known as: loc

sequence locator for SeqHash object, resembling HIV Sequence Locator from LANL

Parameters:

  • ref_option (Symbol) (defaults to: :HXB2)

    , name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

Returns:

  • (Array)

    two dimensional array ‘[[],[],[],…]` for each sequence, including the following information:

    title of the SeqHash object (String)

    sequence taxa (String)

    start_location (Integer)

    end_location (Integer)

    percentage_of_similarity_to_reference_sequence (Float)

    containing_indel? (Boolean)

    direction (‘forward’ or ‘reverse’)

    aligned_input_sequence (String)

    aligned_reference_sequence (String)

See Also:



789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
# File 'lib/viral_seq/seq_hash.rb', line 789

def sequence_locator(ref_option = :HXB2)
  out_array = []
  dna_seq = self.dna_hash
  title = self.title

  uniq_dna = dna_seq.uniq_hash

  uniq_dna.each do |seq,names|
    s = ViralSeq::Sequence.new('',seq)
    loc1 = s.locator(ref_option)
    s.rc!
    loc2 = s.locator(ref_option)
    loc1[2] >= loc2[2] ? (direction = :+; loc = loc1): (direction = :-; loc = loc2)

    names.each do |name|
      out_array << ([title, name, ref_option.to_s, direction.to_s] + loc)
    end
  end
  return out_array
end

#shannons_entropy(option = :nt) ⇒ Hash

calculate Shannon’s entropy, Euler’s number as the base of logarithm

Examples:

caculate entropy from the example file

sequence_file = 'spec/sample_files/sample_sequence_alignment_for_entropy.fasta'
sequence_hash = ViralSeq::SeqHash.aa_fa(sequence_file)
entropy_hash = sequence_hash.shannons_entropy(:aa)
entropy_hash[3]
=> 0.0
entropy_hash[14].round(3)
=> 0.639
# This example is the sample input of LANL Entropy-One
# https://www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_one.html?sample_input=1

Parameters:

  • option (Symbol) (defaults to: :nt)

    the sequence type ‘:nt` or `:aa`

Returns:

  • (Hash)

    entropy score at each position in the alignment :position => :entropy , # position starts at 1.

See Also:



622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
# File 'lib/viral_seq/seq_hash.rb', line 622

def shannons_entropy(option = :nt)
  sequences = if option == :aa
                self.aa_hash.values
              else
                self.dna_hash.values
              end
  entropy_hash = {}
  seq_l = sequences[0].size
  (0..(seq_l - 1)).each do |position|
    element = []
    sequences.each do |seq|
      element << seq[position]
    end
    entropy = 0
    element.delete('*')
    element_size = element.size
    element.count_freq.each do |_k,v|
      p = v/element_size.to_f
      entropy += (-p * ::Math.log(p))
    end
    entropy_hash[(position + 1)] = entropy
  end
  return entropy_hash
end

#sizeInteger

the size of nt sequence hash of the SeqHash object

Returns:

  • (Integer)

    size of nt sequence hash of the SeqHash object



172
173
174
# File 'lib/viral_seq/seq_hash.rb', line 172

def size
  self.dna_hash.size
end

#stop_codon(codon_position = 0) ⇒ Hash

screen for sequences with stop codons.

Examples:

given a hash of sequences, return a sub-hash with sequences only contains stop codons

my_seqhash = ViralSeq::SeqHash.fa('my_fasta_file.fasta')
my_seqhash.dna_hash
=> {">seq1"=>"ATAAGAACG", ">seq2"=>"ATATGAACG", ">seq3"=>"ATGAGAACG", ">seq4"=>"TATTAGACG", ">seq5"=>"CGCTGAACG"}
stop_codon_seqhash = my_seqhash.stop_codon[:with_stop_codon]
stop_codon_seqhash.dna_hash
=> {">seq2"=>"ATATGAACG", ">seq4"=>"TATTAGACG", ">seq5"=>"CGCTGAACG"}
stop_codon_seqhash.aa_hash
=> {">seq2"=>"I*T", ">seq4"=>"Y*T", ">seq5"=>"R*T"}
stop_codon_seqhash.title
=> "my_fasta_file_stop"
filtered_seqhash = my_seqhash.stop_codon[:without_stop_codon]
filtered_seqhash.aa_hash
{">seq1"=>"IRT", ">seq3"=>"MRT"}

Parameters:

  • codon_position (Integer) (defaults to: 0)

    option ‘0`, `1` or `2`, indicating 1st, 2nd, 3rd reading frames

Returns:

  • (Hash)

    of two SeqHash objects seqHash, without_stop_codon: seqHash,

    # :with_stop_codon : ViralSeq::SeqHash object with stop codons # :without_stop_codon: ViralSeq::SeqHash object without stop codons



335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
# File 'lib/viral_seq/seq_hash.rb', line 335

def stop_codon(codon_position = 0)
  self.translate(codon_position)
  keys = []
  aa_seqs = self.aa_hash
  aa_seqs.uniq_hash.each do |seq,array_of_name|
    keys += array_of_name if seq.include?('*')
  end
  seqhash1 = self.sub(keys)
  seqhash1.title = self.title + "_stop"
  keys2 = aa_seqs.keys - keys
  seqhash2 = self.sub(keys2)
  return {
    with_stop_codon: seqhash1,
    without_stop_codon: seqhash2
  }
end

#sub(keys) ⇒ SeqHash

given an Array of sequence tags, return a sub ViralSeq::SeqHash object with the sequence tags

Parameters:

  • keys (Array)

    array of sequence tags

Returns:

  • (SeqHash)

    new SeqHash object with sequences of the input keys



295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# File 'lib/viral_seq/seq_hash.rb', line 295

def sub(keys)
  h1 = {}
  h2 = {}
  h3 = {}

  keys.each do |k|
    dna = self.dna_hash[k]
    next unless dna
    h1[k] = dna
    aa = self.aa_hash[k]
    h2[k] = aa
    qc = self.qc_hash[k]
    h3[k] = qc
  end
  title = self.title
  file = self.file
  ViralSeq::SeqHash.new(h1,h2,h3,title,file)
end

#tn93Hash

TN93 distance functionl, tabulate pairwise comparison of sequence pairs in a sequence alignment, nt sequence only

Examples:

calculate TN93 distribution

sequences = %w{ AAGGCCTT ATGGCCTT AAGGCGTT AAGGCCTT AACGCCTT AAGGCCAT }
my_seqhash = ViralSeq::SeqHash.array(sequences)
my_seqhash.tn93
=> {0=>1, 1=>8, 2=>6}

Returns:

  • (Hash)

    pairwise distance table in Hash object {:diff => :freq, … } # Note: :diff in different positions (Integer), not percentage.



697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
# File 'lib/viral_seq/seq_hash.rb', line 697

def tn93
  sequences = self.dna_hash.values
  diff = []
  seq_hash = sequences.count_freq
  seq_hash.values.each do |v|
    comb = v * (v - 1) / 2
    comb.times {diff << 0}
  end

  seq_hash.keys.combination(2).to_a.each do |pair|
    s1 = pair[0]
    s2 = pair[1]
    diff_temp = s1.compare_with(s2)
    comb = seq_hash[s1] * seq_hash[s2]
    comb.times {diff << diff_temp}
  end

  count_diff = diff.count_freq
  out_hash = Hash.new(0)
  Hash[count_diff.sort_by{|k,_v|k}].each do |k,v|
    out_hash[k] = v
  end
  return out_hash
end

#to_rsphylipString

generate sequences in relaxed sequencial phylip format from a ViralSeq::SeqHash object

Examples:

convert fasta format to relaxed sequencial phylip format

# my_fasta_file.fasta
#   >seq1
#   ATAAGAACG
#   >seq2
#   ATATGAACG
#   >seq3
#   ATGAGAACG
my_seqhash = ViralSeq::SeqHash.fa(my_fasta_file.fasta)
puts my_seqhash.to_rsphylip
  #  3 9
  # seq1       ATAAGAACG
  # seq2       ATATGAACG
  # seq3       ATGAGAACG

Returns:

  • (String)

    relaxed sequencial phylip format in a String object



220
221
222
223
224
225
226
227
228
229
230
231
# File 'lib/viral_seq/seq_hash.rb', line 220

def to_rsphylip
  seqs = self.dna_hash
  outline = "\s" + seqs.size.to_s + "\s" + seqs.values[0].size.to_s + "\n"
  names = seqs.keys
  names.collect!{|n| n.tr(">", "")}
  max_name_l = names.max.size
  max_name_l > 10 ? name_block_l = max_name_l : name_block_l = 10
  seqs.each do |k,v|
    outline += k + "\s" * (name_block_l - k.size + 2) + v.scan(/.{1,10}/).join("\s") + "\n"
  end
  return outline
end

#translate(codon_position = 0) ⇒ NilClass

translate the DNA sequences in @dna_hash to amino acid sequences. generate value for @aa_hash

Examples:

translate dna sequences from a FASTA format sequence file

# my_fasta_file.fasta
#   >seq1
#   ATAAGAACG
#   >seq2
#   ATATGAACG
#   >seq3
#   ATGAGAACG
my_seqhash = ViralSeq::SeqHash.fa(my_fasta_file.fasta)
my_seqhash.translate
my_seqhash.aa_sequence
=> {">seq1"=>"IRT", ">seq2"=>"I*T", ">seq3"=>"MRT"}

Parameters:

  • codon_position (Integer) (defaults to: 0)

    option ‘0`, `1` or `2`, indicating 1st, 2nd, 3rd reading frames

Returns:

  • (NilClass)


249
250
251
252
253
254
255
256
257
258
259
260
# File 'lib/viral_seq/seq_hash.rb', line 249

def translate(codon_position = 0)
  seqs = self.dna_hash
  @aa_hash = {}
  seqs.uniq_hash.each do |seq, array_of_name|
    s = ViralSeq::Sequence.new('name', seq)
    s.translate(codon_position)
    array_of_name.each do |name|
      @aa_hash[name] = s.aa_string
    end
  end
  return nil
end

#trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false) ⇒ ViralSeq::SeqHash

trim dna sequences based on the provided reference coordinates.

Parameters:

  • start_nt (Integer, Range, Array)

    start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array

  • end_nt (Integer, Range, Array)

    end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array

  • ref_option (Symbol) (defaults to: :HXB2)

    , name of reference genomes, options are ‘:HXB2`, `:NL43`, `:MAC239`

  • path_to_muscle (String) (defaults to: false)

    , path to the muscle executable, if not provided, use MuscleBio to run Muscle

Returns:



1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
# File 'lib/viral_seq/seq_hash.rb', line 1169

def trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false)
  seq_hash = self.dna_hash.dup
  seq_hash_unique = seq_hash.uniq_hash
  trimmed_seq_hash = {}
  seq_hash_unique.each do |seq, names|
    trimmed_seq = ViralSeq::Sequence.new('', seq).sequence_clip(start_nt, end_nt, ref_option, path_to_muscle).dna
    names.each do |name|
      trimmed_seq_hash[name] = trimmed_seq
    end
  end
  return_seq_hash = self.dup
  return_seq_hash.dna_hash = trimmed_seq_hash
  return return_seq_hash
end

#uniq_dna_hash(tag = "sequence") ⇒ ViralSeq::SeqHash Also known as: uniq

collapse @dna_hash to unique sequence hash. sequences will be named as (tag + “_” + order(Integer) + “_” + counts(Integer))

Examples:

dna_hash = {'>seq1' => 'AAAA','>seq2' => 'AAAA', '>seq3' => 'AAAA', '>seq4' => 'CCCC', '>seq5' => 'CCCC', '>seq6' => 'TTTT'} }
a_seq_hash = ViralSeq::SeqHash.new
a_seq_hash.dna_hash = dna_hash
uniq_sequence = a_seq_hash.uniq_dna_hash('master')
=> {">master_1_3"=>"AAAA", ">master_2_2"=>"CCCC", ">master_3_1"=>"TTTT"}

Parameters:

  • tag (defaults to: "sequence")

    # the master tag for unique sequences,

Returns:



273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# File 'lib/viral_seq/seq_hash.rb', line 273

def uniq_dna_hash(tag = "sequence")
  seqs = self.dna_hash
  uni = seqs.values.count_freq
  new_seq = {}
  n = 1
  uni.each do |s,c|
    name = ">" + tag + "_" + n.to_s + "_" + c.to_s
    new_seq[name] = s
    n += 1
  end
  seq_hash = ViralSeq::SeqHash.new(new_seq)
  seq_hash.title = self.title + "_uniq"
  seq_hash.file = self.file
  return seq_hash
end

#write_nt_fa(file) ⇒ NilClass

write the nt sequences to a FASTA format file

Parameters:

  • file (String)

    path to the FASTA output file

Returns:

  • (NilClass)


194
195
196
197
198
199
200
201
# File 'lib/viral_seq/seq_hash.rb', line 194

def write_nt_fa(file)
  File.open(file, 'w') do |f|
    self.dna_hash.each do |k,v|
      f.puts k
      f.puts v
    end
  end
end