bio-samtools

The original project samtools-ruby belongs to Ricardo H. Ramirez @ https://github.com/homonecloco/samtools-ruby

Introduction

Documentation and code come from that project and we'll adapt it for a better integration in BioRuby.

Binder of samtools for ruby, on the top of FFI.

This project was born from the need to add support of BAM files to the gee_fu genome browser.

Installation

Add this line to your application's Gemfile:

gem 'bio-samtools'

And then execute:

bundle

Or install it yourself as:

$ gem install bio-samtools

Getting started

Creating a new SAM object

A SAM object represents the alignments in the BAM file, and is very straightforward to create, you will need a sorted BAM file, to access the alignments and a reference sequence in FASTA format to use the reference sequence. The object can be created and opened as follows:

require 'bio-samtools'

bam = Bio::DB::Sam.new(:bam=>"my_sorted.bam", :fasta=>'ref.fasta')
bam.open

Getting Reference Sequence

Retrieving the reference can only be done if the reference has been loaded, which isn't done automatically in order to save memory. Reference need only be loaded once, and is accessed using reference name, start, end in 1-based co-ordinates. A standard Ruby String object is returned.

bam.load_reference 
sequence_fragment = bam.fetch_reference("Chr1", 1, 500)

Getting Alignments

Alignments can be obtained one at a time by looping over a specified region using the fetch() function.

bam.load_reference 
bam.fetch("1",3000,4000).each do |alignment|
    #do something with the alignment...
end

Tutorial

Creating a BAM file

Often, the output from a next-generation sequence alignment tool will be a file in the SAM format.

Typically, we'd create a compressed, indexed binary version of the SAM file, which would allow us to operate on it in a quicker and more efficient manner, being able to randomly access various parts of the alignment. We'd use the view to do this. This step would involve takeing our sam file, sorting it and indexing it.

#create the sam object
sam = Bio::DB::Sam.new(:bam => 'my.sam', :fasta => 'ref.fasta')

#create a bam file from the sam file
sam.view(:b=>true, :S=>true, :o=>'bam.bam')

#create a new sam object from the bam file
unsortedBam = Bio::DB::Sam.new(:bam => 'bam.bam', :fasta => 'ref.fasta')

#the bam file might not be sorted (necessary for samtools), so sort it
unsortedBam.sort(:prefix=>'sortedBam')

#create a new sam object
bam = Bio::DB::Sam.new(:bam => 'sortedBam.bam', :fasta => 'ref.fasta')
#create a new index
bam.index()

#creates index file sortedBam.bam.bai

Working with BAM files

Creating a new SAM object

A SAM object represents the alignments in the BAM file. BAM files (and hence SAM objects here) are what most of SAMtools methods operate on and are very straightforward to create. You will need a sorted and indexed BAM file, to access the alignments and a reference sequence in FASTA format to use the reference sequence. Let's revisit the last few lines of code from the code above.

bam = Bio::DB::Sam.new(:bam => 'sortedBam.bam', :fasta => 'ref.fasta')
bam.index()

Creating the new Bio::DB::Sam (named 'bam' in this case) only to be done once for multiple operations on it, access to the alignments is random so you don't need to loop over the entries in the file.

Getting Reference Sequence

The reference is accessed using reference name, start, end in 1-based co-ordinates. A standard Ruby String object is returned.

sequence_fragment = bam.fetch_reference("Chr1", 1, 100)
puts sequence_fragment
=> cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta

A reference sequence can be returned as a Bio::Sequence::NA object buy the use of :as_bio => true

sequence_fragment = bam.fetch_reference("Chr1", 1, 100, :as_bio => true)

The printed output from this would be a fasta-formatted string

puts sequence_fragment

=> >Chr1:1-100
=> cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta

Concatenating BAM files

BAM files may be concatenated using the cat command. The sequence dictionary of each input BAM must be identical, although the cat method does not check this.

#create an array of BAM files to cat
bam_files = [bam1, bam2]
cat_file = "maps_cated.bam" #the outfile
#cat the files
@sam.cat(:out=>cat_file, :bams=>bam_files)
#create a new Bio::DB::Sam object from the new cat file
cat_bam = Bio::DB::Sam.new(:fasta => "ref.fasta", :bam => cat_file)

Removing duplicate reads

The remove_duplicates method removes potential PCR duplicates: if multiple read pairs have identical external coordinates it only retain the pair with highest mapping quality. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).


unduped = "dupes_rmdup.bam" #an outfile for the removed duplicates bam
#remove single-end duplicates
bam.remove_duplicates(:s=>true, :out=>unduped)
#create new Bio::DB::Sam object
unduped_bam = Bio::DB::Sam.new(:fasta => "ref.fasta", :bam => unduped)

Alignment Objects

The individual alignments represent a single read and are returned as Bio::DB::Alignment objects. These have numerous methods of their own, using require 'pp' will allow you to check the attributes contained in each object. Here is an example alignment object. Remember @ represents a Ruby instance variable and can be accessed as any other method. Thus the @is_mapped attribute of an object a is accessed a.is_mapped

require 'pp'
pp an_alignment_object ##some Bio::DB::Alignment object
#<Bio::DB::Alignment:0x101113f80
@al=#<Bio::DB::SAM::Tools::Bam1T:0x101116a50>,
@calend=4067,
@cigar="76M",
@failed_quality=false,
@first_in_pair=false,
@flag=163,
@is_duplicate=false,
@is_mapped=true,
@is_paired=true,
@isize=180,
@mapq=60,
@mate_strand=false,
@mate_unmapped=false,
@mpos=4096,
@mrnm="=",
@pos=3992,
@primary=true,
@qlen=76,
@qname="HWI-EAS396_0001:7:115:17904:15958#0",
@qual="IIIIIIIIIIIIHHIHGIHIDGGGG...",
@query_strand=true,
@query_unmapped=false,
@rname="1",
@second_in_pair=true,
@seq="ACAGTCCAGTCAAAGTACAAATCGAG...",
@tags=
    {"MD"=>#<Bio::DB::Tag:0x101114ed0 @tag="MD", @type="Z", @value="76">,
     "XO"=>#<Bio::DB::Tag:0x1011155d8 @tag="XO", @type="i", @value="0">,
     "AM"=>#<Bio::DB::Tag:0x101116280 @tag="AM", @type="i", @value="37">,
     "X0"=>#<Bio::DB::Tag:0x101115fb0 @tag="X0", @type="i", @value="1">,
     "X1"=>#<Bio::DB::Tag:0x101115c68 @tag="X1", @type="i", @value="0">,
     "XG"=>#<Bio::DB::Tag:0x101115240 @tag="XG", @type="i", @value="0">,
     "SM"=>#<Bio::DB::Tag:0x1011162f8 @tag="SM", @type="i", @value="37">,
     "XT"=>#<Bio::DB::Tag:0x1011162a8 @tag="XT", @type="A", @value="U">,
     "NM"=>#<Bio::DB::Tag:0x101116348 @tag="NM", @type="i", @value="0">,
     "XM"=>#<Bio::DB::Tag:0x101115948 @tag="XM", @type="i", @value="0">}>

Getting Alignments

Alignments can be obtained one at a time by looping over a specified region using the fetch() function.

bam.fetch("Chr1",3000,4000).each do |alignment|
    #do something with the alignment...
end

A separate method fetch_with_function() allows you to pass a block (or a Proc object) to the function for efficient calculation. This example takes an alignment object and returns an array of sequences which exactly match the reference.

#an array to hold the matching sequences
exact_matches = []

matches = Proc.new do |a|
    #get the length of each read
    len = a.seq.length
    #get the cigar string
    cigar = a.cigar
    #create a cigar string which represents a full-length match
    cstr = len.to_s << "M"
    if cigar == cstr
        #add the current sequence to the array if it qualifies
        exact_matches << a.seq
    end
end

bam.fetch_with_function("Chr1", 100, 500, &matches) 

puts exact_matches

Alignment stats

The SAMtools flagstat method is implemented in bio-samtools to quickly examine the number of reads mapped to the reference. This includes the number of paired and singleton reads mapped and also the number of paired-reads that map to different chromosomes/contigs.

bam.flag_stats()

An example output would be

34672 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
33196 + 0 mapped (95.74%:nan%)
34672 + 0 paired in sequencing
17335 + 0 read1
17337 + 0 read2
31392 + 0 properly paired (90.54%:nan%)
31728 + 0 with itself and mate mapped
1468 + 0 singletons (4.23%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Getting Coverage Information

Per Base Coverage

It is easy to get the total depth of reads at a given position, the chromosome_coverage function is used. This differs from the previous functions in that a start position and length (rather than end position) are passed to the function. An array of coverages is returned, the first position in the array gives the depth of coverage at the given start position in the genome, the last position in the array gives the depth of coverage at the given start position plus the length given

coverages = bam.chromosome_coverage("Chr1", 3000, 1000)  #=> [16,16,25,25...]

Average Coverage In A Region

Similarly, average (arithmetic mean) of coverage can be retrieved with the average_coverage method.

coverages = bam.average_coverage("Chr1", 3000, 1000)  #=> 20.287

Coverage from a BED file

It is possible to count the number of nucleotides mapped to a given region of a BAM file by providing a BED formatted file and using the bedcov method. The output is the BED file with an extra column providing the number of nucleotides mapped to that region.

bed_file =  "test.bed"
bam.bedcov(:bed=>bed_file)

=> chr_1    1   30  6
=> chr_1    40  45  8

Alternatively, the depth method can be used to get per-position depth information (any unmapped positions will be ignored).

bed_file =  "test.bed"
@sam.depth(:b=>bed_file)

=> chr_1    25  1
=> chr_1    26  1
=> chr_1    27  1
=> chr_1    28  1
=> chr_1    29  1
=> chr_1    30  1
=> chr_1    41  1
=> chr_1    42  1
=> chr_1    43  2
=> chr_1    44  2
=> chr_1    45  2

Getting Pileup Information

Pileup format represents the coverage of reads over a single base in the reference. Getting a Pileup over a region is very easy. Note that this is done with mpileup and NOT the now deprecated SAMtools pileup function. Calling the mpileup method creates an iterator that yields a Pileup object for each base.

bam.mpileup do |pileup|
    puts pileup.consensus #gives the consensus base from the reads for that position
end 

Caching pileups

A pileup can be cached, so if you want to execute several operations on the same set of regions, mpilup won't be executed several times. Whenever you finish using a region, call mpileup_clear_cache to free the cache. The argument 'Region' is required, as it will be the key for the underlying hash. We assume that the options (other than the region) are constant. If they are not, the cache mechanism may not be consistent.

#create an mpileup
reg = Bio::DB::Fasta::Region.new
reg.entry = "Chr1"
reg.start = 1
reg.end = 334

bam.mpileup_cached(:r=>reg,:g => false, :min_cov => 1, :min_per =>0.2) do |pileup|
    puts pileup.consensus
end
bam.mpileup_clear_cache(reg)

Pileup options

The mpileup function takes a range of parameters to allow SAMtools level filtering of reads and alignments. They are specified as key => value pairs eg

bam.mpileup(:r => "Chr1:1000-2000", :Q => 50) do |pileup|
    ##only pileups on Chr1 between positions 1000-2000 are considered, 
    ##bases with Quality Score < 50 are excluded
    ...
end 

Not all the options SAMtools allows you to pass to mpileup will return a Pileup object, The table below lists the SAMtools flags supported and the symbols you can use to call them in the mpileup command.

SAMtools optionsdescriptionshort symbollong symboldefaultexample
rlimit retrieval to a region:r:regionall positions:r => "Chr1:1000-2000"
6assume Illumina scaled quality scores:six:illumina_qualsfalse:six => true
Acount anomalous read pairs scores:A:count_anomalousfalse:A => true
Bdisable BAQ computation:B:no_baqfalse:no_baq => true
Cparameter for adjusting mapQ:C:adjust_mapq0:C => 25
dmax per-BAM depth to avoid excessive memory usage:d:max_per_bam_depth250:d => 123
Eextended BAQ for higher sensitivity but lower specificity:E:extended_baqfalse:E => true
Gexclude read groups listed in FILE:G:exclude_reads_filefalse:G => my_file.txt
llist of positions (chr pos) or regions (BED):l:list_of_positionsfalse:l => my_posns.bed
Mcap mapping quality at value:M:mapping_quality_cap60:M => 40
Rignore RG tags:R:ignore_rgfalse:R => true
qskip alignments with mapping quality smaller than value:q:min_mapping_quality0:q => 30
Qskip bases with base quality smaller than value:Q:imin_base_quality13:Q => 30

Coverage Plots

You can create images that represent read coverage over binned regions of the reference sequence. The output format is svg. A number of parameters can be changed to alter the style of the plot. In the examples below the bin size and fill_color have been used to create plots with different colours and bar widths.

The following lines of code...

bam.plot_coverage("Chr1", 201, 2000, :bin=>20, :svg => "out2.svg", :fill_color => '#F1A1B1')
bam.plot_coverage("Chr1", 201, 2000, :bin=>50, :svg => "out.svg", :fill_color => '#99CCFF')
bam.plot_coverage("Chr1", 201, 1000, :bin=>250, :svg => "out3.svg", :fill_color => '#33AD5C', :stroke => '#33AD5C')

Coverage plot 1 Coverage plot 2 Coverage plot 2

The plot_coverage method will also return the raw svg code, for further use. Simply leave out a file name and assign the method to a variable.

svg = bam.plot_coverage("Chr1", 201, 2000, :bin=>50, :fill_color => '#99CCFF')

VCF methods

For enhanced snp calling, we've included a VCF class which reflects each non-metadata line of a VCF file. The VCF class returns the eight fixed fields present in VCF files, namely chromosome, position, ID, reference base, alt bases, alt quality score, filter and info along with the genotype fields, format and samples. This information allows the comparison of variants and their genotypes across any number of samples. The following code takes a number of VCF objects and examines them for homozygous alt (1/1) SNPs

vcfs = []
vcfs << vcf1 = Bio::DB::Vcf.new("20 14370   rs6054257   G   A   29  0   NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:-1,-1") #from a 3.3 vcf file
vcfs << vcf2 = Bio::DB::Vcf.new("19 111 .   A   C   9.6 .   .   GT:HQ   0|0:10,10   0/0:10,10   0/1:3,3") #from a 4.0 vcf file
vcfs << vcf3 = Bio::DB::Vcf.new("20 14380   rs6054257   G   A   29  PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,") #from a 4.0 vcf file

vcfs.each do |vcf|
    vcf.samples.each do |sample|
        genotype = sample[1]['GT']
        if genotype == '1/1' or genotype == '1|1'
            print vcf.chrom, " "
            puts vcf.pos
        end
    end
end

=> 20 14370
=> 20 14380

Other methods not covered

The SAMtools methods faidx, fixmate, tview, reheader, calmd, targetcut and phase are all included in the current bio-samtools release.

Tests

The easiest way to run the built-in unit tests is to change to the bio-samtools source directory and running 'rake test'

Each test file tests different aspects of the code.

Dependencies

FAQ

  • I want to use Ruby 1.x, what can I do?

We try to ensure backwards compatibility with old rubies. However we only officially support current versions of https://www.ruby-lang.org/en/downloads/. The code should work however the testing suites used in earlier versions are not currently supported and don't work in modern rubies. This decision ensures compatibility with maintained versions of Ruby.

Contributing to bio-samtools

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
  • Fork the project
  • Start a feature/bugfix branch
  • Commit and push until you are happy with your contribution
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

TODO

  1. Filter to the fetching algorithm (give a condition that has to be satisfied to add the alignment to the list)

To whom do I complain?

Try [email protected] and [email protected]

Important Notes

Important Notes for developers

Remember that you must compile and install samtools for you host system. In order to do that there are two possible solutions:

  • download, compile and install the library in bioruby-samtools-your_clone/lib/bio/db/sam/external/samtools and bioruby-samtools-your_clone/lib/bio/db/sam/external/bcftools by yourself
  • in your bioruby-samtools-your_clone create the Rakefile typing cd ext; ruby mkrf_conf.rb; rake -f Rakefile

The latest I think is the easiest way, cause you are replicating the automatic process.

For testing just run rake test. Tests must be improved.

Travis integration

If you are integrating this library into another tool and testing it with travis, add the follwing in .travis.yml:

addons:
  apt:
    packages:
    - zlib1g-dev
    - libncurses5-dev
    - libtinfo-dev

Copyright (c) 2011 Raoul J.P. Bonnal. See LICENSE.txt for further details.