bio-gff3

GFF3 parser, aimed at parsing big data GFF3 to return sequences of any type, including assembled mRNA, protein and CDS sequences.

Features:

  • Take GFF3 (genome browser) information of any type, and assemble sequences, e.g. mRNA and CDS

  • Options for low memory use and caching of records

  • Support for external FASTA input files

  • Use of multi-cores (NYI)

Currently the output is a FASTA file.

You can use this plugin in two ways. First as a standalone program, second as a plugin library to BioRuby.

Note: a really fast GFF3 parser is in the works at github.com/pjotrp/biolib_hpc/tree/master/modules/gff3

Install and run gff3-fetch

After installing ruby 1.9, or later, you can use rubygems

gem install bio-gff3

Then, fetch mRNA and CDS information from GFF3 files and output to FASTA:

gff3-fetch mrna test/data/gff/test.gff3
gff3-fetch cds test/data/gff/test.gff3

Development

To use the library

require 'bio-gff3'

For coding examples see ./bin/gff3-fetch and the ./spec/*rb

You can run RSpecs with something like

rspec -I ../bioruby/lib/ spec/*.rb

(supposing you are referring a bioruby source repository)

This implementation depends on BioRuby’s basic GFF3 parser, with the possible advantage that the plugin can assemble sequences, is faster and does not consume all memory. The Gff3 specs are based on the output of the Wormbase genome browser.

See also

gff3-fetch --help

For a write-up see thebird.nl/bioruby/BioRuby_GFF3.html


Copyright © 2010,2011 Pjotr Prins <[email protected]>

Fetch and assemble GFF3 types (e.g. ORF, mRNA, CDS) + print in FASTA format. 

  gff3-fetch [options] type [filename.fa] filename.gff3

  --translate      : output as amino acid sequence 
  --validate       : validate GFF3 file by translating
  --fix            : check 3-frame translation and fix, if possible 
  --fix-wormbase   : fix 3-frame translation on ORFs named 'gene1'
  --no-assemble    : output each record as a sequence 
  --phase          : output records using phase (useful w. no-assemble CDS to AA) 

type is any valid type in the GFF3 definition. For example:

  mRNA             : assemble mRNA
  CDS              : assemble CDS 
  exon             : list all exons
  gene|ORF         : list gene ORFs 
  other            : use any type from GFF3 definition, e.g. 'Terminate'

and the following performance options:

–parser bioruby : use BioRuby GFF3 parser (slow)

  --parser line    : use GFF3 line parser (faster, default)
  --block          : parse GFF3 by block (optimistic) -- NYI
  --cache full     : load all in RAM (fast, default)
  --cache none     : do not load anything in memory (slow)
  --cache lru      : use least recently used cache (limit RAM use, fast) -- NYI
  --max-cpus num   : use num threads -- NYI
  --emboss         : use EMBOSS translation (fast) -- NYI

Where (NYI == Not Yet Implemented):

Multiple GFF3 files can be used. With external FASTA files, always the last
one before the GFF3 filename is matched.

Note that above switches are only partially implemented at this stage. Full
feature support is projected Feb. 2011.

Examples:

  Assemble mRNA and CDS information from test.gff3 (which includes sequence information)

    gff3-fetch mRNA test/data/gff/test.gff3
    gff3-fetch CDS test/data/gff/test.gff3

  Find CDS records from external FASTA file, adding phase and translate to protein sequence

    gff3-fetch --no-assemble --phase --translate CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3

  Find mRNA from external FASTA file, without loading everything in RAM

    gff3-fetch --cache none mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3   
    gff3-fetch --cache none mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3   

  Validate GFF3 file using EMBOSS translation and validation

    gff3-fetch --cache none --validate --emboss mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3   

  Find GENEID predicted terminal exons

    gff3-fetch terminal chromosome1.fa geneid.gff3

  Fine tuning output - show errors only

    gff3-fetch mRNA test/data/gff/test.gff3 --trace ERROR

  Fine tuning outpt - show messages matching regex 

    gff3-fetch mRNA test/data/gff/test.gff3 --trace '=msg =~ /component/'

  Fine tuning output - write log messages to file.log

    gff3-fetch mRNA test/data/gff/test.gff3 --trace ERROR --logger file.log

For more information on output, see the bioruby-logger plugin.

Performance

time gff3-fetch cds m_hapla.WS217.dna.fa m_hapla.WS217.gff3 2> /dev/null > test.fa

Digesting parser:

Cache              real     user     sys  version     RAM
------------------------------------------------------------
full,bioruby       12m41    12m28    0m09 (0.8.0)
full,line          12m13    12m06    0m07 (0.8.5)
full,line,lazy     11m51    11m43    0m07 (0.8.6)     6,600M

none,bioruby      504m     477m     26m50 (0.8.0)
none,line         297m     267m     28m36 (0.8.5)       
none,line,lazy    132m     106m     26m01 (0.8.6)       650M

lru,bioruby       533m     510m     22m47 (0.8.5)
lru,line          353m     326m     26m44 (0.8.5)  1K
lru,line          305m     281m     22m30 (0.8.5) 10K
lru,line,lazy     182m     161m     21m10 (0.8.6) 10K
lru,line,lazy      75m      75m      0m17 (0.8.6) 50K   730M
------------------------------------------------------------

Block parser:

Cache              real     user     sys  gff3 version
------------------------------------------------------------
in preparation see also biolib/HPC:
  https://github.com/pjotrp/biolib_hpc/tree/master/modules/gff3
------------------------------------------------------------

where

 52M m_hapla.WS217.dna.fa
456M m_hapla.WS217.gff3

ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-linux] on an 8 CPU, 2.6 GHz (6MB cache), 16 GB RAM machine.

Cite

If you use this software, please cite 

  http://dx.doi.org/10.1093/bioinformatics/btq475

Copyright © 2010,2011 Pjotr Prins <[email protected]>