Class: Bio::Util::Gngm

Inherits:
Object
  • Object
show all
Defined in:
lib/bio/util/bio-gngm.rb

Overview

A Bio::Util::Gngm object represents a single region on a reference genome that is to be examined using the NGM technique described in Austin et al (2011) bar.utoronto.ca/ngm/description.html and onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2011.04619.x/abstract;jsessionid=F73E2DA628523B26205297CEE95526DA.d02t04 Austin et al (2011) Next-generation mapping of Arabidopsis genes Plant Journal 67(4):7125-725 .

Bio::Util::Gngm provides methods for finding SNPs, small INDELS and larger INDELS, creating histograms of polymorphism frequency, creating and clustering density curves, creating signal plots and finding peaks. The ratio of reference-agreeing and reference-differing reads can be specified.

Background

The basic concept of the technique is that density curves of polymorphism frequency across the region of interest are plotted and analysed. Each curve is called a thread, as it represents a polymorphism that was called with a statistic within a certain user-specified range, eg if a SNP was called with 50% non-reference bases from sequence reads (say all A), and 50% reference reads (all T) then a discordant chastity statistic (ChD) of 0.5 would be calculated and assigned to that SNP. Depending on the width and slide of the windows the user had specified, the frequency of SNPs with ChD in the specified range would be drawn in the same density curve. In the figure below each different coloured curve represents the frequency of SNPs with similar ChD.

Each of these density curves is called a thread. Threads are clustered into groups called bands and the bands containing the expected and control polymorphisms extracted. In the figure below, the control band is 0.5, the expected mutation in 1.0. Typically and in the Austin et al (2011) description of NGM the control band is the heterophasic band that represents natural variation, the thing taken to be the baseline. For a simple SNP, numerically the discordant chastity is expected to be 0.5. Conversely the expected band is the homophasic band that represents the selected for SNP region. Normally the discordant chastity is expected to be 1.0.

The points where the signal from the control and expected band converge most is a likely candidate region for the causative mutation, so here at about the 1.6 millionth nucleotide.

Example

require 'bio-gngm'

g = Bio::Util::Gngm.new(:file => "aln.sorted.bam", 
             :format => :bam, 
             :fasta => "reference.fasta",
             :start => 100,
             :stop => 200,
             :write_pileup => "my_pileup_file.pileup",
             :write_vcf => "my_vcf_file.vcf",
             :ignore_file => "my_known_snps.txt" 
             :samtools => {
                           :q => 20,
                           :Q => 50
             },
             :min_non_ref_freq => 0.5,
             :min_non_ref => 3,
             :start => 1,
             :stop => 100000,
             :chromosome => "Chr1",
             :variant_call => {
               :indels => false,  
               :min_depth => 6, 
               :max_depth => 250, 
               :mapping_quality => 20.0, 
               :min_non_ref_count => 2, 
               :ignore_reference_n => true,
               :min_snp_quality => 20,
               :min_consensus_quality => 20,
               :substitutions => ["C:T","G:A"] 
               }

  )

  g.snp_positions
  g.collect_threads(:start => 0.2, :stop => 1.0, :slide => 0.01, :size => 0.1 )
  [0.25, 0.5, 1.0].each do |kernel_adjust| # loop through different kernel values
    [4, 9, 11].each do | k |  # loop through different cluster numbers 

      #cluster
      g.calculate_clusters(:k => k, :adjust => kernel_adjust, :control_chd => 0.7, :expected_chd => 0.5)
      #draw thread and bands
      filename = "#{name}_#{k}_#{kernel_adjust}_all_threads.png"
      g.draw_threads(filename)

      filename = "#{name}_#{k}_#{kernel_adjust}_clustered_bands.png"
      g.draw_bands(filename, :add_lines => [100,30000,675432])

      #draw signal
      filename = "#{name}_#{k}_#{kernel_adjust}_signal.png"
      g.draw_signal(filename)

      #auto-guess peaks
      filename = "#{name}_#{k}_#{kernel_adjust}_peaks.png"
      g.draw_peaks(filename)
    end
  end
  g.close #close BAM file

Polymorphisms and statistics

Bio::Util::Gngm will allow you to look for polymorphisms that are SNPs, INDELS (as insertions uniquely, deletions uniquely or both) and longer insertions or deletions based on the insert size on paired-end read alignments. Each has a different statistic attached to it.

SNPs

Simple Single Nucleotide Polymorphisms are called and its ChD statistic calculated as described in Austin et al (2011).

Short INDELS

These are called via SAMtools/BCFtools so are limited to the INDELs that can be called that way. The implementation at the moment only considers positions with one INDEL, sites with more than one potential INDEL (ie multiple alleles) are disregarded as a position at all. See the Bio::DB::Vcf extensions in this package for a description of what constitutes an INDEL. The Vcf attribute Bio::DB::Vcf#non_ref_allele_freq is used as the statistic in this case.

Insertion Size

Paired-end alignments have an expected distance between the paired reads (called insert size, or isize). Groups of reads in one position with larger or smaller than expected isize can indicate large deletions or insertions. Due to the details of read preparation the actual isize varies around a mean value with an expected proportion of 50% of reads having isize above the mean, and 50% below. To create density curves of insertion size frequency a moves along the window of user-specified size is moved along the reference genome in user-specified steps and all alignments in that window are examined. The Bio::DB::Sam#isize attribute is inspected for all alignments passing user-specified quality and the proportion of reads in that window that have an insert size > the expected insert size is used as the statistic in this case. Proportions approaching 1 indicate that the sequenced organism has a deletion in that section relative to the reference. Proportions approaching 0 indicate an insertion in that section relative to the reference. Proportions around 0.5 indicate random variation of insert size, IE no INDEL. Seems to be a good idea to keep the window size similar to the read + isize. Useful in conjunction with assessing unmapped mates.

Unmapped Mate Pairs / Paired Ends.

Paired-end alignments where one mate finds a mapping but the other doesnt, can indicate an insertion/deletion larger than the insert size of the reads used (IE one read disappeared into the deleted section). This method uses a statistic based on proportion of mapped/unmapped reads in a window. Proportions of reads that are mapped but the mate is unmapped should be about 0.5 in a window over an insertion/deletion (since the reads can go in either direction..). With no insertion deletion, the proportion should be closer to 0.

Input types

A sorted BAM file is used as the source of alignments. Pileup is not used nor likely to be as it is a deprecated function within SAMtools. With the BAM file you will need the reference FASTA and the BAM index (.bai).

Workflow

  1. Create Bio::Util::Gngm object for a specific region in the reference genome

  2. Polymorphisms are found

  3. Density curves (threads) are calculated

  4. Clustering density threads into bands is done

  5. Signal is compared between band of interest and control

  6. Figures are printed

Prerequisites

  • Ruby 1.9.3 or greater (if you have an earlier version, try RVM for installing different versions of Ruby alongside your system install and switching nicely between them)

  • R 2.11.1 or greater

The following ruby-gems are required

  • rinruby >= 2.0.2

  • bio-samtools >= 0.5.0

The following R packages are required

  • ggplot2

  • peaks

Acknowledgements

Thanks very much indeed to Ryan Austin, who invented NGM in the first place and was very forthcoming with R code, around which this implementation is based.

Using bio-gngm

require 'bio-gngm'

API

Constant Summary collapse

ERROR_MARGIN =

Ruby 1.9.3 has a rounding error in the Range#step function such that some decimal places are rounded off to 0.00000000000000…1 above their place. So this constant is used to identify windows within a short distance and prevent any rounding errors. Hopefully I should be able to remove this in later versions.

0.000001

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ Gngm

Returns a new Bio::Util::Gngm object.

g = Bio::Util::Gngm.new(:file => “aln.sort.bam”,

           :format => :bam,
           :samtools => {:q => 20, :Q => 50}, 
           :fasta => "reference.fa"
           :start => 100,
           :stop => 200,
           :write_pileup => "my_pileup_file.pileup",
           :write_vcf => "my_vcf_file.vcf",
           :ignore_file => "my_known_snps.txt"

)

Required parameters and defaults:

  • :file => nil -the path to the bam file containing the alignments, a .bai index must be present. A pileup file, or tab-delimited text file can be used.

  • :format => :bam -either :bam, :pileup, :txt (pileup expected to be 10 col format from samtools -vcf)

  • :chromosome => "nil" -sequence id to look at

  • :start => nil -start position on that sequence

  • :stop => nil -stop position on that sequence

  • :fasta => nil -the path to the FASTA formatted reference sequence

  • :write_pileup => false -the path to a file. SNPs will be written in pileup to this file (indels not output)

  • :write_vcf => false -the path to a file. SNPs will be written in VCF to this file (indels not output)

  • :ignore_file => false -file of SNPs in format “reference sequence id t position t mapping line nucleotide identity t reference line nucleotide identity”. All SNPs in this file will be ignored

  • :samtools => {:q => 20, :Q => 50} -options for samtools, see bio-samtools documentation for further details.

Optional parameters and defaults:

Most of these are parameters for specific methods and can be over-ridden when particular methods are called

  • :variant_call => {:indels => false,

  • :min_depth => 2,

  • :max_depth => 10000000,

  • :min_snp_quality => 20,

  • :mapping_quality => 10.0,

  • :min_non_ref_count => 2,

  • :ignore_reference_n => true,

  • :min_consensus_quality => 20,

  • :min_snp_quality => 20 }.

  • For Pileup files from old samtools pileup -vcf <tt>:min_consensus_quality can be applied

  • :threads => {:start => 0.2, :stop => 1.0, :slide => 0.01, :size => 0.1 } -options for thread windows

  • :insert_size_opts => {:ref_window_size => 200, :ref_window_slide => 50, :isize => 150} -options for insert size calculations

  • :histo_bin_width => 250000 -bin width for histograms of SNP frequency

  • :graphics => {:width => 1000, :height => 500, :draw_legend => false, :add_boxes => nil} -graphics output options, :draw_legend draws a legend plot for band figures only

  • :peaks => {:sigma => 3.0, :threshold => 10.0, :background => false, :iterations => 13, :markov => false, :window => 3, :range => 10000} -parameters for automated peak calling, parameters relate to R package Peaks. :range is the width of the box to draw on the peak plot



393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# File 'lib/bio/util/bio-gngm.rb', line 393

def initialize(options)
  @file = nil
  @snp_positions = nil
  @threads = nil
  @densities = nil
  @clusters = nil
  @control_band = nil
  @expected_band = nil
  @signal = nil
  @peak_indices = nil
  @peak_y_values = nil
  @density_max_y = nil #the maximum y value needed to plot the entire set density plots of threads and maintain a consistent scale for plots
  @colours = %w[#A6CEE3 #1F78B4 #B2DF8A #33A02C #FB9A99 #E31A1C #FDBF6F #FF7F00 #CAB2D6 #6A3D9A #FFFF99 #B15928]
  @thread_colours = {}
  @known_variants = nil #a list of variants to keep track of
  @opts = {
    :file => nil,
    :format => :bam,
    :fasta => nil,
    :samtools => {:q => 20, :Q => 50},
    :indels => false,
    :write_pileup => false,
    :write_vcf => false,
    :ignore_file => false,
    :insert_size_opts => {:ref_window_size => 200, :ref_window_slide => 50, :isize => 150},
    :variant_call => { :indels => false,
                       :min_depth => 2, 
                       :max_depth => 10000000, 
                       :mapping_quality => 10.0, 
                       :min_non_ref_count => 2, 
                       :ignore_reference_n => true, 
                       :shore_map => false, 
                       :snp_file => :false, 
                       :min_consensus_quality => 20, 
                       :min_snp_quality => 20},
    ## some options are designed to be equivalent to vcfutils.pl from bvftools options when using vcf
    ##:min_depth (-d)
    ##:max_depth (-D)
    ##:mapping_quality (-Q) minimum RMS mappinq quality for SNPs (mq in info fields)
    ##:min_non_ref_count (-a) minimum num of alt bases ... the sum of the last two numbers in DP4 in info fields
    ##doesnt do anything with window filtering or pv values... 
    :histo_bin_width => 250000,
    :graphics => {:width => 1000, :height => 500, :draw_legend => false, :add_boxes => nil},
    :adjust => 1, 
    :control_chd => 0.5, 
    :expected_chd => 1.0,
    :threads => {:start => 0.2, :stop => 1.0, :slide => 0.01, :size => 0.1 },
    :peaks => {:sigma => 3.0, :threshold => 10.0, :background => false, :iterations => 13, :markov => false, :window => 3, :range => 10000} ##range is the width of the box to draw on the peak plot
  }
  @opts.merge!(options)
  @opts[:samtools][:r] = "#{options[:chromosome]}:#{options[:start]}-#{options[:stop]}"
  @pileup_outfile, @vcf_outfile = nil,nil
  if @opts[:variant_call][:indels] and (@opts[:write_pileup] or @opts[:write_vcf])
    $stderr.puts "Cannot yet output VCF/Pileup when generating INDELs. Turning output off."
    @opts[:write_pileup] = false
    @opts[:write_vcf] = false
  end
  if @opts[:write_pileup]
    @pileup_outfile = File.open(@opts[:write_pileup], "w")
  end
  if @opts[:write_vcf]
    @vcf_outfile = File.open(@opts[:write_vcf], "w")
  end
  
  @known_snps = Hash.new
  if @opts[:ignore_file]
    File.open(@opts[:ignore_file], "r").each do |line|
      col = line.chomp.split(/\t/)
      if @known_snps[col[0]]
        @known_snps[col[0]][col[1].to_i] = 1
      else
        @known_snps[col[0]] = Hash.new
        @known_snps[col[0]][col[1].to_i] = 1
      end
    end
  end
  open_file
end

Instance Attribute Details

#fileObject

Returns the value of attribute file.



343
344
345
# File 'lib/bio/util/bio-gngm.rb', line 343

def file
  @file
end

Instance Method Details

#calculate_clusters(opts = {}) ⇒ Object

Calculates the k-means clusters of density curves (groups threads into bands), [density curve y values] ]</tt> Calculates the clusters using the R function kmeans() Recalculates @densities as it does with Bio::Util::Gngm#calculate_densities, so clustering can be done without having to explicitly call Bio::Util::Gngm#calculate_densities. Clusters are recalulated every time regardless of whether its been done before contains anything or not so is useful for trying out different values for the parameters. When clusters are calculated the expected and control bands are compared with the Bio::Util::Gngm#calculate_signal method and the @signal array populated. Resets the instance variables @control_band, @expected_band, @signal, @peak_indices, @peak_y_values and @clusters

Options and defaults

  • :k => 9, -the number of clusters for the R kmeans function

  • :seed => false -set this to a number to make the randomized clustering reproducible

  • :control_chd => 0.5 -the value of the control thread/window

  • :expected_chd => 1.0 -the value of the expected thread/window

  • :adjust => 1.0 -the kernel adjustment parameter for the R density function

  • :pseudo => false - force the densities into a single thread cluster when the number of distinct threads with SNPs is < the value of k. This is only useful in a situation where the spread of the statistic is very limited. EG for using mapped/unmapped mate pairs then almost all windows will have proportion 1.0 but a tiny number will be close to 0.5 with few other values considered.



916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
# File 'lib/bio/util/bio-gngm.rb', line 916

def calculate_clusters( opts={} )
  options = {:k => 9, :seed => false, :adjust => 1, :control_chd => 0.5, :expected_chd => 1.0, :pseudo => false}
  options = options.merge(opts)
  if options[:pseudo]
    put_threads_into_individual_clusters(options)
    return
  end
  r = new_r
  names = []
  name = "a"
  @control_band = nil #needs resetting as we are working with new clusters
  @expected_band = nil #needs resetting as we are working with new clusters
  @signal = nil #needs resetting as we are working with new clusters
  @peak_indices = nil #needs resetting as we are working with new cluster
  @peak_y_values = nil #needs resetting as we are working with new cluster
  self.calculate_densities(options[:adjust]).each do |d|
    density_array = d.last
    r.assign name, density_array ##although windows go in in numeric order, r wont allow numbers as names in data frames so we need a proxy
    names << "#{name}=#{name}"
    name = name.next
  end
  data_frame_command = "data = data.frame(" + names.join(",") + ")"
  r.eval data_frame_command
  r.eval "set.seed(#{options[:seed]})" if options[:seed]
  r.eval "k = kmeans(cor(data),#{options[:k]},nstart=1000)"
  @clusters = r.pull "k$cluster" ##clusters are returned in the order in densities
  r.quit
  ##now set the cluster colours.. 
  colours = %w[#A6CEE3 #1F78B4 #B2DF8A #33A02C #FB9A99 #E31A1C #FDBF6F #FF7F00 #CAB2D6 #6A3D9A #FFFF99 #B15928]
  ci = 0
  col_nums = {} ##hash of cluster numbers and colours
  @clusters.each_index do |i|
    if not col_nums[@clusters[i]]
      col_nums[@clusters[i]] = colours[ci]
      ci += 1
      ci = 0 if ci > 11
    end
   @thread_colours[self.densities[i].first] = col_nums[@clusters[i]]
  end
  @control_band = get_band(options[:control_chd])
  @expected_band = get_band(options[:expected_chd])
  calculate_signal
end

#calculate_densities(adjust = 1) ⇒ Object

Sets and returns the array of arrays [window, [density curve x values], [density curve y values] ] Calculates the density curve using the R function density() Always sets @densities regardless of whether it contains anything or not so is useful for trying out adjustment values. Ignores threads with fewer than 2 polymorphisms since density can’t be computed with so few polymorphisms.

Options and defaults

  • adjust = 1, -the kernel adjustment parameter for the R density function



831
832
833
834
835
836
837
838
839
840
841
842
843
844
# File 'lib/bio/util/bio-gngm.rb', line 831

def calculate_densities(adjust=1)
  r = new_r
  densities = []
  self.threads.each do |t|
    next if t.last.length < 2 ##length of density array is smaller or == threads, since too small windows are ignored...
    r.curr_win = t.last
    r.eval "d = density(curr_win,n=240,kernel=\"gaussian\", from=#{@snp_positions.first[0]}, to=#{@snp_positions.last[0]}, adjust=#{adjust})"
    densities << [t.first, r.pull("d$x"), r.pull("d$y")]
  end
  r.quit
  @densities = densities
  calculate_density_max_y ##need to re-do every time we get new densities
  densities
end

#calculate_signalObject

Returns an array of values representing the ratio of average of the expected threads/windows to the control threads/windows. Sets @signal, the signal curve.



1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
# File 'lib/bio/util/bio-gngm.rb', line 1121

def calculate_signal
   r = new_r
    name = "a"
    control_names = []
    expected_names = []
    self.densities.each do |d|
      if @control_band.include?(d.first)
        density_array = d.last
        r.assign name, density_array ##although windows go in in numeric order, r wont allow numbers as names in data frames so we need a proxy
        control_names << "#{name}=#{name}"
      elsif @expected_band.include?(d.first)
        density_array = d.last
        r.assign name, density_array
        expected_names << "#{name}=#{name}"
      end
      name = name.next
    end
    data_frame_command = "control = data.frame(" + control_names.join(",") + ")"
    r.eval data_frame_command
    r.eval "control_mean = apply(control, 1, function(ecks) mean((as.numeric(ecks))) )"
    data_frame_command = "expected = data.frame(" + expected_names.join(",") + ")"
    r.eval data_frame_command
    r.eval "expected_mean = apply(expected, 1, function(ecks) mean((as.numeric(ecks))) )"
    r.eval "signal = expected_mean / control_mean"
    signal = r.pull "signal"
    r.quit
    @signal = signal
end

#closeObject

for BAM files calls Bio::DB::Sam#close to close the connections to input files safely



494
495
496
497
498
# File 'lib/bio/util/bio-gngm.rb', line 494

def close
  case @opts[:format]
  when :bam then @file.close
  end
end

#clusters(opts = {}) ⇒ Object

Returns the array instance variable @clusters. The R function kmeans() is used to calculate the clusters based on a correlation matrix of the density curves. If @clusters is nil when called this method will run the Bio::Util::Gngm#calculate_clusters method and set @clusters With this method you cannot recalculate the clusters after they have been done once.

Options and defaults

  • :k => 9, -the number of clusters for the R kmeans function

  • :seed => false -set this to a number to make the randomized clustering reproducible

  • :control_chd => 0.5 -the value of the control thread/window

  • :expected_chd => 1.0 -the value of the expected thread/window

  • :adjust => 1.0 -the kernel adjustment parameter for the R density function



900
901
902
# File 'lib/bio/util/bio-gngm.rb', line 900

def clusters(opts={})
  @clusters ||= calculate_clusters(opts={})
end

#collect_threads(options = {}) ⇒ Object

Resets contents of instance variable @threads and returns an array of arrays [[window 1, snp position 1, snp position 2 ... snp position n],[window 2, snp position 1, snp position 2 ... snp position n] ]. Always sets @threads regardless of whether it contains anything or not so is useful for trying out different window sizes etc

Options and defaults:

  • :start => 0.2 -first window

  • :stop => 1.0 -last window

  • :slide => 0.01 -distance between windows

  • :size => 0.1 -window width

Raises:

  • (RuntimeError)


748
749
750
751
752
753
754
755
756
757
758
# File 'lib/bio/util/bio-gngm.rb', line 748

def collect_threads(options={})
  opts = @opts[:threads].merge(options)
  opts[:slide] = 0.000001 if opts[:slide] < 0.000001 ##to allow for the rounding error in the step function... 
  raise RuntimeError, "snp positions have not been calculated yet" if not @snp_positions
  start,stop,slide,size = opts[:start].to_f, opts[:stop].to_f, opts[:slide].to_f, opts[:size].to_f
  arr = []
  (start..stop).step(slide) do |win|
    arr << [win, @snp_positions.select {|x| x.last >= win and x.last < win + size }.collect {|y| y.first} ]
  end
  @threads = arr
end

#densities(adjust = 1) ⇒ Object

Returns the instance variable @densities array of arrays [window, [density curve x values], [density curve y values] ]. The R function density() is used to calculate the values. If @densities is nil when called this method will run the Bio::Util::Gngm#calculate_densities method and set @densities With this method you cannot recalculate the densities after they have been done once.

Options and defaults

  • adjust = 1, -the kernel adjustment parameter for the R density function



821
822
823
# File 'lib/bio/util/bio-gngm.rb', line 821

def densities(adjust=1)
  @densities ||= calculate_densities(adjust)
end

#draw_bands(file = "myfile.png", optsa = {}) ⇒ Object

Draws the clustered bands that correspond to the expected and control window in a single PNG file file

Options and defaults

  • :add_lines => nil -if an array of positions is provided eg +[100,345] , vertical lines will be drawn at these positions. Useful for indicating feature positions on the plot

  • :width => 1000 -width of the PNG in pixels

  • :height => 500 -height of the PNG in pixels

Raises:

  • (RuntimeError)


854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
# File 'lib/bio/util/bio-gngm.rb', line 854

def draw_bands(file="myfile.png", optsa={})
  opts = @opts[:graphics].merge(optsa)
  pp optsa
  raise RuntimeError, "Can't draw threads until clustering is done" unless @clusters
  #uses R's standard plot functions.
  ##same as draw_threads, but skips threads that aren't on the bands lists
  ## 
  r = new_r 
  r.eval "png('#{file}', width=#{opts[:width]}, height=#{opts[:height]})"
  plot_open = false
  self.densities.each do |t|
      if @control_band.include?(t[0]) or @expected_band.include?(t[0])
        r.dx = t[1]
        r.dy = t[2]
        r.curr_win = t.last
        #r.eval "d = density(curr_win,n=240,kernel=\"gaussian\", from=#{@snp_positions.first[0]}, to=#{@snp_positions.last[0]})"
        if plot_open
          r.eval "lines(dx, dy, col=\"#{@thread_colours[t.first]}\")"
        else
          r.eval "plot(dx, dy, type=\"l\", col=\"#{@thread_colours[t.first]}\",ylim=c(0,#{density_max_y}), main='#{file}',xlab='position', ylab='density')"
          plot_open = true
        end
      end
  end
  label1 = "Control band: " + @control_band.min.to_s + " < ChD < " + @control_band.max.to_s
  label2 = "Expected band: " + @expected_band.min.to_s + " < ChD < " + @expected_band.max.to_s
  r.eval "legend('top', c('#{label1}','#{label2}'), lty=c(1,1),lwd=c(2.5,2.5),col=c('#{@thread_colours[@control_band.first]}','#{@thread_colours[@expected_band.first]}'))"
  if opts[:add_lines] and opts[:add_lines].instance_of?(Array)
    opts[:add_lines].each do |pos|
      r.eval "abline(v=#{pos})"
    end
  end
  r.eval "dev.off()"
  r.quit
end

#draw_hit_count(file = "myfile.png", opts = ) ⇒ Object

Draws a barplot of the number of polymorphisms in each thread/window in a single PNG file file



1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
# File 'lib/bio/util/bio-gngm.rb', line 1100

def draw_hit_count(file="myfile.png",opts=@opts[:graphics])
  r = new_r
  wins = []
  hits = []
  self.threads.each do |thread|
    wins << thread.first
    if thread.last.empty?
      hits << 0.01 ##pseudovalue gets around the case where a thread has no hits... which messes up barplot in R
    else
      hits << thread.last.length
    end
  end
  r.wins = wins
  r.hits = hits
  r.eval "png('#{file}', width=#{opts[:width]}, height=#{opts[:height]})"
  r.eval "barplot(hits, names.arg=c(wins), xlab='window', log='y', ylab='number of hits', main='Number of Polymorphisms #{file}', col=rgb(r=0,g=1,b=1, alpha=0.3), na.rm = TRUE)"
  r.eval "dev.off()"
end

#draw_peaks(file = "myfile.png", opts = ) ⇒ Object

Draws the peaks calculated from the signal curve by the R function Peaks in Bio::Util::Gngm#calculate_peaks. Adds boxes of width :range to each peak and annotates the limits. Options are set in the global options hash :peaks. and relate to the Peaks function in R



1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
# File 'lib/bio/util/bio-gngm.rb', line 1043

def draw_peaks(file="myfile.png",opts=@opts[:graphics])
  opts_a = @opts[:peaks]
  opts_a.merge!(opts)
  opts = opts_a ##sigh ... 
  #opts[:background] = opts[:background].to_s.upcase
  #opts[:markov] = opts[:markov].to_s.upcase  
  self.get_peaks(opts)
  r = new_r
  #r.eval "suppressMessages ( library('Peaks') )"
  r.signal = self.signal
  r.x_vals = self.densities[0][1]
  r.eval "png('#{file}', width=#{opts[:width]}, height=#{opts[:height]})"
  #r.eval "spec = SpectrumSearch(signal,#{opts[:sigma]},threshold=#{opts[:threshold]},background=#{opts[:background]},iterations=#{opts[:iterations]},markov=#{opts[:markov]},window=#{opts[:window]})"
  #peak_positions = r.pull "spec$pos"
  #y = r.pull "spec$y"
  r.y = @peak_y_values
  r.pos = @peak_indices
  r.eval "plot(x_vals,y, type=\"l\", xlab='position', ylab='Peaks', main='#{file}' )"
  @peak_indices.each do |peak|
    r.eval "rect(x_vals[#{peak}]-(#{opts[:range]/2}), 0, x_vals[#{peak}]+#{opts[:range]/2}, max(y), col=rgb(r=0,g=1,b=0, alpha=0.3) )"
    r.eval "text(x_vals[#{peak}]-(#{opts[:range]/2}),max(y) + 0.05, floor(x_vals[#{peak}]-(#{opts[:range]/2})) )"
    r.eval "text(x_vals[#{peak}]+(#{opts[:range]/2}), max(y) + 0.05, floor(x_vals[#{peak}]+(#{opts[:range]/2})) )"
  end
  r.eval "dev.off()"
  r.quit
end

#draw_signal(file = "myfile.png", opts = ) ⇒ Object

Draws the contents of the @signal instance variable in a single PNG file file



1020
1021
1022
1023
1024
1025
1026
1027
1028
# File 'lib/bio/util/bio-gngm.rb', line 1020

def draw_signal(file="myfile.png", opts=@opts[:graphics]) #data.frame(bubs=data$bubbles_found,conf=data$bubbles_confirmed)
  r = new_r    
  x_vals = self.densities[0][1]
  r.eval "png('#{file}', width=#{opts[:width]}, height=#{opts[:height]})"
  r.x_vals = x_vals
  r.signal =  self.signal
  r.eval "plot(x_vals,signal, type=\"l\", xlab='position', ylab='ratio of signals (expected / control ~ homo / hetero)', main='#{file}' )"
  r.eval "dev.off()"
end

#draw_threads(file = "myfile.png", options = {}) ⇒ Object

Draws the threads in a single PNG file file

Options and defaults

  • :draw_legend => nil -if a filename is provided a legend will be drawn in a second plot

  • :width => 1000 -width of the PNG in pixels

  • :height => 500 -height of the PNG in pixels

Raises:

  • (RuntimeError)


784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
# File 'lib/bio/util/bio-gngm.rb', line 784

def draw_threads(file="myfile.png", options={})
  opts = @opts[:graphics].merge(options)
  #uses R's standard plot functions.. needed because ggplot can die unexpectedly...
  raise RuntimeError, "Can't draw threads until clustering is done" unless @clusters
  r = new_r
  r.eval "png('#{file}', width=#{opts[:width]}, height=#{opts[:height]})"
  plot_open = false
  self.densities.each do |t|
   r.curr_win = t.last
   r.dx = t[1]
   r.dy = t[2]
    if plot_open
      r.eval "lines(dx,dy, col=\"#{@thread_colours[t.first]}\", xlab='position', ylab='density')"
    else
      r.eval "plot(dx,dy, type=\"l\", col=\"#{@thread_colours[t.first]}\",ylim=c(0,#{density_max_y}), main='#{file}',xlab='position', ylab='density')"
      plot_open = true
    end
  end
  r.eval "dev.off()"
  if opts[:draw_legend]
    r.eval "png('#{opts[:draw_legend]}', width=#{opts[:width]}, height=#{opts[:height]})"
    colours = @thread_colours.each.sort.collect {|x| x.last}.join("','")
    names =  @thread_colours.each.sort.collect {|x| x.first}.join("','")
    r.eval "plot(1,xlab="",ylab="",axes=FALSE)"
    r.eval "legend('top', c('#{names}'), lty=c(1),lwd=c(1),col=c('#{colours}'), ncol=4)"
    r.eval "dev.off()"
  end
  r.quit
end

#frequency_histogram(file = "myfile.png", bin_width = , opts = ) ⇒ Object

Draws a histogram of polymorphism frequencies across the reference genome section defined in Bio::Util::Gngm#initialize with bin width bin_width and writes it to a PNG file file



713
714
715
716
717
718
719
720
721
722
723
724
# File 'lib/bio/util/bio-gngm.rb', line 713

def frequency_histogram(file="myfile.png", bin_width=@opts[:histo_bin_width], opts=@opts[:graphics])
  posns = self.snp_positions.collect {|a| a.first}
  r = new_r
  r.eval "suppressMessages ( library(ggplot2) )" #setup R environment... 
  r.posns = posns
  r.eval "data = data.frame(position=posns)"
  r.eval "png('#{file}', width=#{opts[:width]}, height=#{opts[:height]})"
  graph_cmd = "qplot(position,data=data, geom='histogram', binwidth = #{bin_width}, alpha=I(1/3), main='#{file}', color='red')"
  r.eval(graph_cmd)
  r.eval "dev.off()"
  r.quit
end

#get_band(window = 1.0) ⇒ Object

Raises:

  • (RuntimeError)


1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
# File 'lib/bio/util/bio-gngm.rb', line 1000

def get_band(window=1.0)
  ##because of the weird step rounding error we need to find the internal name of the window.. so find it from the list from the name the user
  ##expects it to be, may give more than one passing window so keep only first one..
  windows = find_window(window)
  raise RuntimeError, "Couldnt find window #{window}, or window has no data to calculate: \n windows are #{self.densities.collect {|d| d.first} }" if windows.empty? ##if we have a window that is close enough to the specified window
  idx = find_index(windows.first)
  #find out which cluster the window is in
  cluster = self.clusters[idx]
  ##get the other windows in the same cluster, ie the band...
  band = []
  self.clusters.each_index do |i|
    if self.clusters[i] == cluster
      band << self.densities[i].first
    end
  end
  band
end

#get_insert_size_frequency(options = {}) ⇒ Object

Returns array of arrays [[window start position, proportion of alignments > insert size]]. Does this by taking successive windows across reference and collects the proportion of the reads in that window that have an insert size > the expected insert size. Proportions approaching 1 indicate that the sequenced organism has a deletion in that section, proportions approaching 0 indicate an insertion in that section, proportions around 0.5 indicate random variation of insert size, IE no indel.

Each section should be approximately the size of the insertion you expect to find and should increment in as small steps as possible.

Options and defaults:

  • :ref_window_size => 200 width of window in which to calculate proportions

  • :ref_window_slide => 50 number of bases to move window in each step

  • :isize => 150 expected insert size

Sets the instance variable @snp_positions. Only gets positions the first time it is called, in subsequent calls pre-computed positions and statistics are returned, so changing parameters has no effect



682
683
684
685
686
687
688
# File 'lib/bio/util/bio-gngm.rb', line 682

def get_insert_size_frequency(options={})
  opts = @opts[:insert_size_opts].merge(options)
  return @snp_positions if @snp_positions
  case
  when @file.instance_of?(Bio::DB::Sam) then get_insert_size_frequency_from_bam(opts)
  end
end

#get_peaks(opts = ) ⇒ Object

private Calculates the position of peaks in the signal curve



1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
# File 'lib/bio/util/bio-gngm.rb', line 1072

def get_peaks(opts=@opts[:peaks])
  opts[:background] = opts[:background].to_s.upcase
  opts[:markov] = opts[:markov].to_s.upcase  
  r = new_r
  r.eval "suppressMessages ( library('Peaks') )"
  r.signal = self.signal
  r.x_vals = self.densities[0][1]
  r.eval "spec = SpectrumSearch(signal,#{opts[:sigma]},threshold=#{opts[:threshold]},background=#{opts[:background]},iterations=#{opts[:iterations]},markov=#{opts[:markov]},window=#{opts[:window]})"
  @peak_indices = r.pull "spec$pos"
  if @peak_indices.instance_of?(Fixnum)
    @peak_indices = [@peak_indices]
  end
  @peak_y_values = r.pull "spec$y"
  r.quit
end

#get_unmapped_mate_frequency(options = {}) ⇒ Object

Returns array of arrays [[window start position, proportion of reads with unmapped mates]]. Does this by taking successive windows across reference and counting the reads with unmapped mates Proportions approaching 0.5 indicate that the sequenced organism has an insertion in that section, proportions approaching 0 indicate nothing different in that section.

Each section should be approximately the size of the insertion you expect to find and should increment in as small steps as possible.

Options and defaults:

  • :ref_window_size => 200 width of window in which to calculate proportions

  • :ref_window_slide => 50 number of bases to move window in each step

Sets the instance variable @snp_positions. Only gets positions the first time it is called, in subsequent calls pre-computed positions and statistics are returned, so changing parameters has no effect



701
702
703
704
705
706
707
# File 'lib/bio/util/bio-gngm.rb', line 701

def get_unmapped_mate_frequency(options={})
  opts = @opts[:insert_size_opts].merge(options)
  return @snp_positions if @snp_positions
  case
  when @file.instance_of?(Bio::DB::Sam) then get_unmapped_mate_frequency_from_bam(opts)
  end
end

#hit_countObject

Returns an array of polymorphisms in each thread/window <tt>[[window, polymorphism count] ]. Useful for sparse polymorphism counts or over small regions where small polymorphism counts can cause artificially large peaks in density curves.



1090
1091
1092
1093
1094
1095
1096
# File 'lib/bio/util/bio-gngm.rb', line 1090

def hit_count
  arr = []
  self.threads.each do |thread|
    arr << [thread.first, thread.last.length]
  end
  arr
end

#is_allowed_substitution?(ref, alt, opts) ⇒ Boolean

Returns:

  • (Boolean)


531
532
533
534
535
536
# File 'lib/bio/util/bio-gngm.rb', line 531

def is_allowed_substitution?(ref,alt,opts)
  if opts[:substitutions].instance_of?(Array)
    return false unless opts[:substitutions].include?("#{ref}:#{alt}")
  end
  true
end

#keep_known_variants(file = nil) ⇒ Object

Deletes everything from self.snp_positions not mentioned by position in self.known_variants. Directly modifies self.snp_positions



1190
1191
1192
1193
1194
1195
# File 'lib/bio/util/bio-gngm.rb', line 1190

def keep_known_variants(file=nil)
  raise "file of known variants not provided and @known_variants is nil" if @known_variants.nil? and file.nil?
  @known_variants = parse_known_variants(file) if @known_variants.nil? and file
  @snp_positions.each do |snp|
  end
end

#peaksObject

Returns the positions of the peaks in the signal curve calculated by Bio::Util::Gngm#get_peaks as an array



1036
1037
1038
# File 'lib/bio/util/bio-gngm.rb', line 1036

def peaks
  @peak_indices.collect {|x| self.densities[0][1][x].to_f.floor} 
end

#signalObject



1151
1152
1153
# File 'lib/bio/util/bio-gngm.rb', line 1151

def signal
  @signal ||= calculate_signal
end

#snp_positions(optsa = {}) ⇒ Object

Returns array of arrays [[position, statistic]] for polymorphisms passing filters in optsa Default options are those in the :variant_call global options hash which can be over ridden in the method call

Options and defaults:

  • :indels => false -call small insertions AND deletions instead of simple SNPs

  • :deletions_only => false -call just deletions instead of simple SNPs

  • :insertions_only => false -call small insertions instead of simple SNPs

  • :min_depth => 2 -minimum quality passing depth of coverage at a position for a SNP call

  • :max_depth => 10000000 -maximum quality passing depth of coverage at a position for a SNP call

  • :mapping_quality => 10.0 -minimum mapping quality required for a read to be used in depth calculation

  • :min_non_ref_count => 2 -minimum number of reads not matching the reference for SNP to be called

  • :ignore_reference_n => true -ignore positions where the reference is N or n

When INDEL calling only one of :indels should be used. If false, SNPs are called.

calculates or returns the value of the instance variable @snp_positions. Only gets positions the first time it is called, in subsequent calls pre-computed positions and statistics are returned, so changing parameters has no effect.



516
517
518
519
520
521
522
523
524
# File 'lib/bio/util/bio-gngm.rb', line 516

def snp_positions(optsa={})
  opts = @opts[:variant_call].merge(optsa)
  return @snp_positions if @snp_positions
  case @opts[:format]
  when :bam then get_snp_positions_from_bam(opts)
  when :text then get_snp_positions_from_text(opts)
  when :pileup then get_snp_positions_from_pileup(opts)
  end
end

#snp_positions=(arr) ⇒ Object

allows the user to assign SNP positions



527
528
529
# File 'lib/bio/util/bio-gngm.rb', line 527

def snp_positions=(arr)
  @snp_positions = arr
end

#threads(opts = ) ⇒ Object



735
736
737
# File 'lib/bio/util/bio-gngm.rb', line 735

def threads(opts=@opts[:threads]) 
  @threads ||= collect_threads(opts)
end