bio-phyta

PhyTA is a BioRuby program specifically designed for identifying and removing from Expressed Sequence Tag data contaminant sequences from other (non-target) species. PhyTA assigns a higher taxonomic rank to EST sequences based on their BLAST annotation, performs taxonomy-based sequence sorting and constructs a contamination-free sub-library. It consists of the following tools:

phyta-assign

Is in charge of the higher taxonomic rank sequence annotation.

phyta-split

Identifies putative contaminant sequences based on the higher taxonomic rank annotation and user-specified criteria.

phyta-extract

Constructs two sub-libraries: a “clean” sub-library that consists of annotated sequences from the target species and a “contaminant” one that includes putative contaminant sequences.

phyta-setup-taxonomy-db

Facilitates setting up a local copy of the NCBI taxonomy database.

The detailed description of these tool’s function is provided below.

All PhyTA scripts are written in Ruby 1.8.7 and are delivered as a Ruby gem. PhyTA has been tested with MRI 1.8.7 and 1.9.2.

To install PhyTA simply type:

gem install bio-phyta

PhyTA requires Ruby 1.8.7 or higher and a MySQL database. See the “Installation” section for more information.

phyta-assign

phyta-assignparses the NCBI BLASTplus XML format output, assigns a higher taxonomic rank to ESTs based on the BLAST annotation and stores attributes of BLASTplus and the taxonomy assignments in tabular form as a CSV file. To generate an input for phyta-assign, a large set of query sequences is compared to an NCBI database standard stand-alone or network-client BLASTplus programs.

An example the BLASTplus command for generating input for phyta-assign is:

blastx -query Corticium_candelabrum.fasta -db BLASTDB/nr -evalue 0.0001 -max_target_seqs 3 -out Corticium_candelabrum_blast5.xml -outfmt 5

where the blast5.xml file can be used as an input for phyta-assign.

The output of phyta-assign contains:

  1. Query sequence ID and the following information for the three top BLAST hits:

  2. accession number

  3. sgi

  4. e-value

  5. species name

  6. subject annotation

  7. Subject score

  8. Higher rank (e.g. Kingdom) taxonomy information

An example output file could look like this:

AW3C1;ACR38454;238014838;3.34982954962278e-19;Zea mays;unknown;78.5665508561758;Viridiplantae
AW3C1;XP_002489117;253761439;1.33094019753946e-18;Sorghum bicolor;hypothetical protein SORBIDRAFT_0057s002150;76.6405529765891;Viridiplantae
AW3C1;XP_002488963;253760039;1.23820662046332e-15;Sorghum bicolor;hypothetical protein SORBIDRAFT_1150s002010;66.6253640027379;Viridiplantae
AW5C3;XP_001629010;156372369;1.85315736381546e-09;Nematostella vectensis;predicted protein;66.2401644268205;Metazoa

As you see the first three entries are the three best BLAST hits for the query sequence AW3C1. They all get assigned to the Phylum Viridiplantae. The second query sequence only has one hit from the species Nematostella vectensis which gets assigned to the Kingdom Metazoa.

The higher rank taxonomy is assigned based on species name acquired from the hit gi number and NCBI taxonomy information. The default list of the higher rank taxonomic groups used by phyta-assign is provided in the built-in Taxonomy filter I. The default taxonomy list and instructions for creating a custom filter are provided below in the section “Custom filters”.

Usage

Phyta-assign takes the following command line arguments:

--input-file, -i

The output of the BLASTplus alignment in XML format

--output-file, -o

The name of the output table in CSV format

--database-server, -d

Optional: The address of the MySQL database server (default: localhost)

--database-user, -u

Optional: The name of the database user (default: root)

--database-password, -p

Optional: The password of the database user (default: no password)

--database-name, -n

Optional: The name of the NCBI taxonomy database (default: kingdom_assignment_taxonomy)

--filter, -f

A file in YAML format containing a list of the higher rank taxonomic groups The default filter information and instructions for creating your own filters can be found in the section “Custom filters”.

--help, -h

Show a help message

Here is an example for how phyta-assign is used from the command line:

phyta-assign -i Corticium_candelabrum_blast5.xml -o Corticium_candelabrum_blast5_annotated.csv -d localhost -u root -p password -n kingdom_assignment_taxonomy -f default_filter.yaml

phyta-split

Phyta-split takes the CSV file generated by phyta-assign as input, performs taxonomy-based sorting of the annotated ESTs and outputs two new files in CSV format. One file contains annotations for all ESTs that deemed to belong to the target-species. The second file contains annotations for those sequences that received three top hits from taxa defined as contaminant by the phyta-split taxonomy filter.

Usage

--input-file, -i

The output of phyta-assign in CSV format

--output-clean, -c

The name of the clean output table in CSV format (default: [name_of_input_file]_clean.csv)

--output-contaminated, -d

The name of the contaminated output table in CSV format (default: [name_of_input_file]_contaminated.csv)

--filter, -f

A file in YAML format containing a list of taxa to be considered contaminants (default: Use builtin filter capturing Bacteria, Archaea, Viruses and NONE (unidentified species).

--help, -h

Show a help message

Here is an example for how phyta-split can be used from the command line. Note that no custom filter is used, so only the taxa “Bacteria”, “Archaea”, “Viruses” and “NONE” will be considered contaminations.

phyta-split -i Corticium_candelabrum_blast5_annotated.csv -c Corticium_candelabrum_clean.csv -d Corticium_candelabrum_contaminated.csv

Rules

Sequences are included into the “clean” target-species sub-library annotation when at least one of their three top BLAST hits does not match any taxa in the phyta-split contamination filter. The default filter provided with the program contains the following taxonomic groups: Bacteria, Archaea, Viruses and NONE, which represents unknown sequences.

Custom filters

Custom filters for phyta-assign and phyta-split can be provided in YAML format. This file can be passed to the corresponding tools as a command line parameter. In order to write a custom filter, it is not necessary to learn the YAML syntax. Here’s what an example filter looks like:

# Filter file for PhyTA 0.9
--- 
- Bacteria
- Archaea
- Viridiplantae
- Rhodophyta
- Glaucocystophyceae
- Alveolata
- Cryptophyta
- stramenopiles
- Amoebozoa
- Apusozoa
- Euglenozoa
- Fornicata
- Haptophyceae
- Heterolobosea
- Jakobida
- Katablepharidophyta
- Malawimonadidae
- Nucleariidae
- Oxymonadida
- Parabasalia
- Rhizaria
- unclassified eukaryotes
- Fungi
- Metazoa
- Choanoflagellida
- Opisthokonta incertae sedis
- Viruses

The line starting with a # is a comment and is entirely optional. Just copy the text above into a new plain text file and modify it to your liking. Just make sure to copy the line with the three dashes over as well. You can also download this filter directly from github.com/PalMuc/bio-phyta/raw/master/misc/default_filter.yaml (right click the link and select “save as”).

phyta-extract

Constructs two sub-libraries: a “clean” sub-library that consists of annotated sequences from the target species and a “contaminant” one that includes putative contaminant sequences.

The output files will be written in FASTA format.

Usage

--fasta, -f

The file containing the sequences in FASTA format

--input-clean, -c

The name of the clean sequence table in CSV format

--input-contaminated, -d

The name of the contaminated sequence table in CSV format[--output-clean, -o ] The name of the FASTA file where clean sequences will be written to

--output-contaminated, -p

The name of the FASTA file where contaminated sequences will be written to

--help, -h

Show a help message

Here is an example for how phyta-extract can be used from the command line.

phyta-extract -f Corticium_candelabrum.fasta -c Corticium_candelabrum_clean.csv -d Corticium_candelabrum_contaminated.csv -o Corticium_candelabrum_clean.fasta -p Corticium_candelabrum_contaminated.fasta

Installation

Prerequisites

In order to install this gem you need to have several programs installed:

  • Ruby either in version 1.8.7 or 1.9.2. JRuby unfortunately is not supported at the moment.

  • Git

  • cURL

  • MySQL

In the following, the installation procedure is given for Mac OS X and Ubuntu Linux 10.10. The commands for Ubuntu also have been tested to work for Debian Squeeze although you should substitute apt-get by aptitude.

Please note that in order to use the sudo command, your user account must allowed to acquire root user privileges if this is not the case, please ask your administrator.

Installing Git

An installer for Mac OS X can be obtained from the [official website](git-scm.com/). For any Linux distribution it is recommended that you use your system’s package manager to install Git. Look for a package called git or git-core. For Ubuntu 10.10 the command is:

sudo apt-get install git

Installing cURL

Mac OS X comes with curl by default, on a Linux system, cURL can be obtained via the system’s package manager. For Ubuntu 10.10 the command is:

sudo apt-get install curl

Installing Ruby

You can find out what version of Ruby comes with your system by typing the following from the command line:

ruby -v

The output of that command looks like that:

ruby 1.8.7 (2011-06-30 patchlevel 352) [i686-darwin10.8.0]

If you have ruby 1.8.7 or higher, you’re all set. If Ruby is not available on your system or if you have an older version, you should install the most recent version of Ruby.

The easiest way to install the most recent version of Ruby is via the Ruby Version Manager (RVM) by Wayne E. Seguin.

Before you install RVM, make sure you have git and curl installed on your system.

RVM can be installed by calling:

bash < <(curl -s https://raw.github.com/wayneeseguin/rvm/master/binscripts/rvm-installer)

This will install RVM to .rvm in your home folder and print several instructions specific to your platform on how to finish the installation. Please pay close attention to the “dependencies” section and look for the part where it says something like this:

# For Ruby (MRI & ree)  you should install the following OS dependencies:
ruby: /usr/bin/apt-get install build-essential bison openssl libreadline6 libreadline6-dev curl git-core zlib1g zlib1g-dev libssl-dev libyaml-dev libsqlite3-0 libsqlite3-dev sqlite3 libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev

It is advisable that you install all of these prerequisites. Please do not copy the commands from this file, look at the output of the RVM installer. If installing any of these packages gives you an error, consider updating your packages by using your system’s update manager.

Next, you have to make sure that RVM is loaded when you start a new shell. Look for the part where it says: “You must now complete the install by loading RVM in new shells.”

On Ubuntu 10.10 you can edit your .bashrc by calling:

gedit .bashrc

On Mac OS X, you can type:

open -a TextEdit .bash_profile

At the very end of this file add the following line:

[[ -s "$HOME/.rvm/scripts/rvm" ]] && source "$HOME/.rvm/scripts/rvm"  # This loads RVM into a shell session.

Now save the file, close your editor and close your shell. Start a new shell and type:

type rvm | head -1

If you see something like “rvm is a function” the installation was successful. If you run into problems, read the documentation.

The following command is not part of the installation procedure!

You can always delete RVM and start from scratch by typing:

rvm implode

Please note that this will delete all versions of Ruby you installed with RVM as well as all of the gems you installed. It will not reverse the changes you made to your shell’s load configuration.

Now you can install Ruby by calling:

rvm install 1.9.2

Please note that everything RVM installs is placed in the folder .rvm in your home directory. Therefore, it is not necessary to use sudo when calling rvm.

In order to use Ruby instead of your system’s Ruby version you must type

rvm use 1.9.2

every time you open a new shell. You can check which version you are currently using with:

ruby -v

If you want to switch back to the version of Ruby that came with your system, type:

rvm use system

In order to use Ruby as the default Ruby implementation on your system you can type:

rvm --default use 1.9.2

Now Ruby 1.9.2 will be called when you type ruby in a new shell.

Installing MySQL

PhyTA uses a MySQL database in order to store information from the NCBI taxonomy database efficiently.

The database does not have to be hosted on the system that is running PhyTA, but it is advantageous for performance reasons.

The correct installation procedure for MySQL varies widely among different platforms. For many systems (like Mac OS X) binaries can be obtained from the official website. In the following, the setup under Ubuntu 10.10 is explained.

sudo apt-get install mysql-server libmysqlclient-dev

That should conclude the database setup under Ubuntu.

On Mac OS X, you can install the MySQL preference pane and start the server from there. The MySQL binaries are at /usr/local/mysql/bin/. In order to be able to execute the following examples without having to prefix this path every time, you can add aliases to your bash configuration:

open -a TextEdit .bash_profile

Now add the following lines at the end:

alias mysql=/usr/local/mysql/bin/mysql
alias mysqladmin=/usr/local/mysql/bin/mysqladmin
export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:/usr/local/mysql/lib/"

Refer to the ReadMe file that comes with the MySQL installer if you are using tclsh instead of bash.

Starting the database

Usually the MySQL setup creates an administrator account named “root” with an empty password. If your administrator name is different or you have set a password, you must adjust the commands in the next section accordingly. You can now start MySQL by typing

sudo service mysql start

phyta-setup-taxonomy-db

First, you need to set up an empty database for the NCBI taxonomy data. This can be achieved by typing:

mysql -u root -ppassword -e "CREATE DATABASE kingdom_assignment_taxonomy"

In this example, substitute root for your MySQL username, password for your password and kingdom_assignment_taxonomy for the database name.

Please note the lack of a space between the parameter p and the password. Leave this parameter out if your database does not have a password.

After that, the program phyta-setup-taxonomy-db will help you set up the NCBI taxonomy database. Its command line options are the following.

--database-server, -d

Optional: The address of the MySQL database server (default: localhost)

--database-user, -u

Optional: The name of the database user (default: root)

--database-password, -p

Optional: The password of the database user (default: no password)

--database-name, -n

Optional: The name of the NCBI taxonomy database (default: kingdom_assignment_taxonomy)

--help, -h

Show a help message

Here is an example command consistent with the example above:

phyta-setup-taxonomy-db -d localhost -u root -p password -n kingdom_assignment_taxonomy

Phyta-setup-taxonomy-db will now download the NCBI taxonomy dump files and load them into your MySQL database. This might take a while.

Contributing to bio-phyta

  • Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet

  • Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it

  • Fork the project

  • Start a feature/bugfix branch

  • Commit and push until you are happy with your contribution

  • Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Copyright © 2011 Philipp Comans.

The MySQL schema used in phyta-setup-taxonomy-db and phyta-assign has been developed by Matthew Horton of the the Department of Ecology and Evolution of the Division of Biological Sciences at the University of Chicago and is available at bergelson.uchicago.edu/Members/mhorton/taxonomydb.build .

See LICENSE.txt for further details.

Acknowledgements

Development of this program was supported by the Molecular Geo- and Palaeobiology Lab of the Department of Earth and Environmental Sciences and the initiative “Gleichstellung in Forschung und Lehre” of the Ludwig-Maximilians-University Munich (LMU).