bio-alignment

Alignment handler for multiple sequence alignments (MSA).

This alignment handler makes no assumptions about the underlying sequence object. Support for any nucleotide, amino acid and codon sequences that are lists. Any list with payload can be used (e.g. nucleotide quality score, codon annotation). The only requirement is that the list is iterable and can be indexed.

This work is based on Pjotr's experience designing the BioScala Alignment handler and BioRuby's PAML support. Read the Bio::BioAlignment design document for Ruby.

Note: this software is under active development.

Developers

Codon alignment example

To use the library, load aligned sequences into the Alignment matrix. Here we write an amino acid alignment from a codon aligmment (note codon gaps are represented by '---')

  require 'bio-alignment'
  require 'bigbio' # Fasta reader and writer

  include Bio::BioAlignment
  aln = Alignment.new
  fasta = FastaReader.new('codon-alignment.fa')
  fasta.each do | rec |
    aln.sequences << CodonSequence.new(rec.id, rec.seq)
  end
  # write a matching amino acid alignment
  fasta = FastaWriter.new('aa-aln.fa')
  aln.rows.each do | row |
    fasta.write(row.id, row.to_aa.to_s)
  end

Now add some state - you can define your own row state

  # tell the row to handle state
  aln[0].extend(State)
  # mark the first row for deletion
  aln[0].state = MyStateDeleteObject.new
  if aln.rows[0].state.deleted?
    # do something
  end

Accessing columns

BioAlignment has a module for handling columns in an alignment. As long as the contained sequence objects have the [] and length methods, they can lazily be iterated by column. To get a column and iterate it

  column = aln.columns[3]
  column.each do |element|
    p element
  end

Now add some state - you can define your own column state

  aln.columns[3].state = MyStateDeleteObject.new
  if aln.columns[3].state.deleted?
    # do something
  end

BioRuby Sequence objects

BioAlignment supports adding BioRuby's Bio::Sequence objects:

  require 'bio'  # BioRuby
  require 'bio-alignment'
  require 'bio-alignment/bioruby' # make Bio::Sequence enumerable
  include Bio::BioAlignment

  aln = Alignment.new
  aln.sequences << Bio::Sequence::NA.new("atgcatgcaaaa")
  aln.sequences << Bio::Sequence::NA.new("atg---tcaaaa")

and we can transform BioAlignment into BioRuby's Bio::Alignment and use BioRuby functions

  bioruby_aln = aln.to_bioruby_alignment
  bioruby_aln.consensus_iupac

Pal2nal

A protein (amino acid) to nucleotide alignment would first load the sequences

  aln1 = Alignment.new
  fasta1 = FastaWriter.new('aa-aln.fa')
  aln1.rows.each do | row |
    fasta1.write(row.id, row.to_aa.to_s)
  end
  aln2 = Alignment.new
  fasta2 = FastaReader.new('nt.fa')
  fasta2.each do | rec |
    aln2.sequences << Sequence.new(rec.id, rec.seq)
  end

Writing a (simple) version of pal2nal would be something like

  fasta3 = FastaWriter.new('nt-aln.fa')
  aln.each_with_index do | aaseq, i |
    ntseq = aln2.sequences[i]
    aaseq.id.should == ntseq.id
    codonseq = CodonSequence.new(ntseq.id, ntseq.seq)
    codon_pos = 0
    result = []
    aaseq.each do | aa |
      result <<
        if aa.gap?
          '---'
        else
          codon_pos += 1
          codonseq[codon_pos-1].to_s
        end
    end
    fasta3.write(aaseq.id, result.join(''))
  end

With amino acid aa_aln and nucleotide nt_aln loaded, the library version of pal2nal includes validation

  aln = aa_aln.pal2nal(nt_aln, :codon_table => 3, :do_validate => true)

resulting in the codon alignment.

Phylogeny

BioAlignment has support for attaching a phylogentic tree to an alignment, and traversing the tree.

Alignment marking/masking/editing

One of the primary reasons for creating BioAlignment is to provide easy ways of editing alignments using a functional style of programming. Primitives are provided which take out much of the plumbing, such as maintaining row/column/element state, and allow copy-on-edit (so no conflicts arise in concurrent code). For example, to walk an alignment by row, and update the row state, you can mark all rows for deletion which contain many gaps

  include MarkRows
  mark_rows { |rowstate,row|  # for every row/sequence
    num = row.count { |e| e.gap? }
    if (num.to_f/row.length) > 0.5
      rowstate.delete!  # mark row for deletion
    end
    rowstate   # returns the updated row state
  }

next, return a (deep) copy of the original alignment with the rows that are not marked for deletion

  aln2 = aln.rows_where { |row| !row.state.deleted? }

The general idea is that there are many potential ways of selecting rows, and changing some state. The 'mark_rows' function/iterator takes care of the plumbing. All the programmer needs to do is to set the criterion, in this case a gap percentage, and tell the library what state has to change. In this example we only access one row, but you can also access the other rows. You won't be surprised that marking columns looks much the same

  include MarkColumns
  mark_columns { |colstate,col|  # for every column
    num = col.count { |e| e.gap? }
    if (num.to_f/col.length) > 0.5
      colstate.delete! 
    end
    colstate
  }

''count'' is one of the universal functions that counts elements in a row, column, or alignment.

Next to modifying the state of rows and columns, you can also access the state of alignment elements (i.e. codons, amino acids, nucleotide acids). For example, here we mask every element that has a masked state

  aln = masked_aln.update_each_element { |e| (e.state.masked? ?  Element.new("X"):e)}

and, here we remove every marked element by turning it into a gap

  aln = marked_aln.update_each_element { |e| (e.state.marked? ? Element.new("-"):e)}

''update_each_element'' visits every element in the MSA, and replaces the old with the new.

It is important to note that, instead of directly editing alignments in place, this module always makes it a two step process. First items are masked/marked through the state of the rows/columns/elements, next the alignment is rewritten using this state. The advantage of using an intermediate state is that the state can be queried for creating (for example) nice output/graphics, using both the original and changed alignments. For example, it is really easy to create a nice output showing which columns were deleted in the original alignment, or which amino acids were masked. Still, methods are available, which hide the two step process, as seen in the next example.

BioAlignment supports many alignment editing features, which are listed here. An edit feature is added at runtime(!) Example:

  require 'bio-alignment/edit/del_bridges'

  aln.extend DelBridges         # mix the module into the object 
  aln2 = aln.del_bridges        # execute the alignment editor

where aln2 is a copy of aln with bridging columns deleted.

See also

The API documentation is online. For more code examples see ./spec/*.rb and ./features/*.

Cite

If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475

Copyright (c) 2012 Pjotr Prins. See LICENSE.txt for further details.

Biogems.info

This exciting Ruby Biogem is published on http://biogems.info/