Genfrag version 0.0.0.2

by Pjotr Prins and Trevor Wennblom
http://genfrag.rubyforge.org
http://rubyforge.org/projects/genfrag/
http://github.com/trevor/genfrag/

DESCRIPTION:

This is a development release. Some features are functional at this time.

Genfrag allows for rapid in-silico searching of fragments cut by different restriction enzymes in large nucleotide acid databases, followed by matching specificity adapters which allow a further data reduction when looking for differential expression of genes and markers.

USAGE:

This works

genfrag index -f example.fasta --re5 BstYI --re3 MseI
genfrag search -f example.fasta --re5 BstYI --re3 MseI --adapter5 tt

REQUIREMENTS:

  • bio

Optional

  • sqlite3-ruby

INSTALL:

  • sudo gem install genfrag

EXAMPLES:

Index command

Search command

Example 1

Return all sequences from the file ‘example.fasta.tdf’ that are referenced by the index ‘example.fasta_bstyi_msei_index.tdf’

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v

Only one entry from output is shown below.

---
- sequence
  gattgcaacaatcgctttggaggatgtaattgtgcaattggccaatgcacaaatcgacaatgtccttgttttgctgctaatcgtgaatgcgatccagatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttttaattggggtgcatttacatgggactctcttaaaaagaatgagtatctcggagaatatactggagaactgatcactcatgatgaagctaatgagcgtgggagaatagaagatcggattggttcttcctacctctttaccttgaatgatca
- sequence size
  380
- fragment - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

The first cut is made using RE5 (restriction enzyme with first match in reference to 5’) BstYI. BstYI has the cut patten

5' - r^g a t c y - 3'
3' - y c t a g^r - 5'

The first 96bp of the sequence are removed when BstYI makes its cut, starting the strand fragment. The primary strand fragment begins with ‘gatctttgtc’, four bases are lost from the complement strand due to the cut pattern of BstYI, therefore ‘gatc’ from the primary strand has no hydrogen bonds with the complement strand. These missing nucleotides are represented with a period (‘.’).

The second cut is made using RE3 (restriction enzyme with first match in reference to 3’) MseI. MseI has the cut pattern

5' - t^t a a - 3'
3' - a a t^t - 5'

This leaves a final fragment of 136bp. The way MseI cuts will leave the complement strand two nucleotides longer than the primary strand. This is represented on the primary stand with two periods.

Example 2

This demonstrates using an adapter.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v --adapter5 t

---
- sequence
  gattgcaacaatcgctttggaggatgtaattgtgcaattggccaatgcacaaatcgacaatgtccttgttttgctgctaatcgtgaatgcgatccagatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttttaattggggtgcatttacatgggactctcttaaaaagaatgagtatctcggagaatatactggagaactgatcactcatgatgaagctaatgagcgtgggagaatagaagatcggattggttcttcctacctctttaccttgaatgatca
- sequence size
  380
- fragment - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat
- fragment with adapters - primary strand
  +++++ttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

The adapter can be considered an extension to the restriction enzyme. When searching for a specified adapter, anything that the restriction enzyme would need to make its match is first ignored before comparing the adapter to the sequence.

It was shown previously that BstYI has the cut patten

5' - r^g a t c y - 3'
3' - y c t a g^r - 5'

The ‘y’ symbol indicates a nucleotide of ‘t’ or ‘c’.[Footnote 1] Adapter5 is defined as the nucleotide ‘t’ in this example. 5 nucleotides from the restriction enzyme are matched (‘gatct’) as indicated by the plus (‘+’) symbols, then the 1 nucleotide from the adapter is matched (‘t’).

Note that in this current version of Genfrag only the primary strand has the plus symbols applied. In a future version the complement strand would have a plus symbol in place of the initial ‘a’.

Example 3

The previous example with a longer adapter.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v --adapter5 ttgtcg

---
- sequence
  gattgcaacaatcgctttggaggatgtaattgtgcaattggccaatgcacaaatcgacaatgtccttgttttgctgctaatcgtgaatgcgatccagatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttttaattggggtgcatttacatgggactctcttaaaaagaatgagtatctcggagaatatactggagaactgatcactcatgatgaagctaatgagcgtgggagaatagaagatcggattggttcttcctacctctttaccttgaatgatca
- sequence size
  380
- fragment - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat
- fragment with adapters - primary strand
  +++++ttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Example 4

This demonstrates Adapter3.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v --adapter3 aacca

---
- sequence
  gattgcaacaatcgctttggaggatgtaattgtgcaattggccaatgcacaaatcgacaatgtccttgttttgctgctaatcgtgaatgcgatccagatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttttaattggggtgcatttacatgggactctcttaaaaagaatgagtatctcggagaatatactggagaactgatcactcatgatgaagctaatgagcgtgggagaatagaagatcggattggttcttcctacctctttaccttgaatgatca
- sequence size
  380
- fragment - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggtt+..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

It was shown previously that MseI has the cut patten

5' - t^t a a - 3'
3' - a a t^t - 5'

Looking at primary strand fragment, the ending nucleotide remaining that has also been used by the restriction enzyme to match is ‘t’. When the Adapter3 filter is made, the restriction enzyme match will replace the ‘t’ with a plus symbol.

An end of the primary strand is

5' - atggattcatggtt+.. - 3'

If that end is reversed and complemented, ‘aaca’ is the initial four nucleotides that match.

Note that in this current version of Genfrag only the primary strand has the plus symbols applied. In a future version the complement strand would have a plus symbol in place of the final ‘aat’.

Example 5

The previous example with Adapter3 using alternate notation.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v --adapter3 _tggtt

---
- sequence
  gattgcaacaatcgctttggaggatgtaattgtgcaattggccaatgcacaaatcgacaatgtccttgttttgctgctaatcgtgaatgcgatccagatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttttaattggggtgcatttacatgggactctcttaaaaagaatgagtatctcggagaatatactggagaactgatcactcatgatgaagctaatgagcgtgggagaatagaagatcggattggttcttcctacctctttaccttgaatgatca
- sequence size
  380
- fragment - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggtt+..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

If Adapter3 is supplied with an initial underscore (‘_’) in the sequence, the sequence is matched directly without a reverse complement.

Example 6

Using two adapters together.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v --adapter5 ttgtcg --adapter3 aacca

---
- fragment with adapters - primary strand
  +++++ttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggtt+..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Note that in this current version of Genfrag only the primary strand has the plus symbols applied. In a future version the complement strand would have a plus symbol in place of the initial ‘a’ and the final ‘aat’.

Example 7

Using an adapter and specifying an adapter sequence.

You may have particular adapter sequences that you have used. These can be specified with ‘adapter5-sequence’ or ‘adapter3-sequence’. Note that ‘adapter3-sequence’ will be reversed when applied to the primary strand. Any change to the sequence caused by the adapter sequence will be noted with an equals (‘=’) symbol.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -v --adapter5 ttgtcg --adapter3 aacca --adapter5-sequence NXNXNXNX --adapter3-sequence NZNZNZNZ

---
- sequence
  gattgcaacaatcgctttggaggatgtaattgtgcaattggccaatgcacaaatcgacaatgtccttgttttgctgctaatcgtgaatgcgatccagatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttttaattggggtgcatttacatgggactctcttaaaaagaatgagtatctcggagaatatactggagaactgatcactcatgatgaagctaatgagcgtgggagaatagaagatcggattggttcttcctacctctttaccttgaatgatca
- sequence size
  380
- fragment - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat
- fragment with adapters - primary strand
  NXNXNXNXttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttZNZNZNZN
- fragment with adapters - complement strand
  ===....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat=====

Example 8

The previous example but with short adapter sequences.

genfrag search -f example.fasta --re5 BstYI --re3 MseI --adapter5 ttgtcg --adapter3 aacca --adapter5-sequence X --adapter3-sequence Z

---
- fragment with adapters - primary strand
  ====XttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttZ==
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Example 9

Using an adapter and specifying an adapter size, these can be specified with ‘adapter5-size’ or ‘adapter3-size’.

You may know the size of your adapter, but not have a particular sequence in mind. The unknown nucleotides will be represented with a question mark character (‘?’).

genfrag search -f example.fasta --re5 BstYI --re3 MseI --adapter5 ttgtcg --adapter3 aacca --adapter5-size 6 --adapter3-size 8

---
- fragment with adapters - primary strand
  ????????ttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggtt??????
- fragment with adapters - complement strand
  ===....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat===

Example 10

The previous example but with short adapter sizes.

  genfrag search -f example.fasta --re5 BstYI --re3 MseI --adapter5 ttgtcg --adapter3 aacca --adapter5-size 1 --adapter3-size 2

---
- fragment with adapters - primary strand
  ====?ttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggtt??=
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Example 11

An exact fragmentation length can be searched for with the ‘seqsize’ argument.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -s 136

---
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Example 12

The previous example with multiple fragment result sizes accepted. Different sizes can be separated by commas.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -s 136,166

---
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaattcctccttcaaaccaataaaaagattctcattggaaagtctgatgttcatggatggggtgcatttacatgggactctct..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttaaggaggaagtttggttatttttctaagagtaacctttcagactacaagtacctaccccacgtaaatgtaccctgagagaat
---
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Example 13

The previous example with a sequence size range accepted. Since you may only have an approximate idea of the fragment size you are expecting, you may give a range by using a plus symbol (‘+’) to indicate a tolerance to a size.

genfrag search -f example.fasta --re5 BstYI --re3 MseI -s 144+10,166

---
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaattcctccttcaaaccaataaaaagattctcattggaaagtctgatgttcatggatggggtgcatttacatgggactctct..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttaaggaggaagtttggttatttttctaagagtaacctttcagactacaagtacctaccccacgtaaatgtaccctgagagaat
---
- fragment with adapters - primary strand
  gatctttgtcggagttgtcctcttagctgtggagatggcactcttggtgagacaccagtgcaaatccaatgcaagaacatgcaataataaaaagattctcattggaaagtctgatgttcatggattcatggttt..
- fragment with adapters - complement strand
  ....aaacagcctcaacaggagaatcgacacctctaccgtgagaaccactctgtggtcacgtttaggttacgttcttgtacgttattatttttctaagagtaacctttcagactacaagtacctaagtaccaaaat

Footnotes

[1]:

require 'rubygems'
require 'bio'
puts Bio::Sequence::NA.new('y').to_re   # => (?-mix:[tcy])

LICENSE:

Copyright © 2009 Pjotr Prins and Trevor Wennblom

Distributed under the same terms as the Ruby License - see LICENSE.txt