ViralSeq

GitHub Gem GitHub last commit

A Ruby Gem containing bioinformatics tools for processing viral NGS data.

Specifically for Primer ID sequencing and HIV drug resistance analysis.

CLI tools tcs, tcs_sdrm, tcs_log and locator included in the gem.

tcs web app - https://primer-id.org/

Illustration for the Primer ID Sequencing

Primer ID Sequencing

Reference readings on the Primer ID sequencing

Explantion of Primer ID sequencing
Primer ID MiSeq protocol
Application of Primer ID sequencing in COVID-19 research

Requirements

Required Ruby Version: >= 2.5

Required RubyGems version: >= 1.3.6

Install

    $ gem install viral_seq

Usage

Excutables

`tcs`

Use executable tcs pipeline to process Primer ID MiSeq sequencing data.

Web-based tcs analysis can be accessed at https://primer-id.org/

Example commands:

    $ tcs -p params.json # run TCS pipeline with params.json
    $ tcs -p params.json -i DIRECTORY
    # run TCS pipeline with params.json and DIRECTORY
    # if DIRECTORY is not defined in params.json
    $ tcs -dr -i DIRECTORY
    # run tcs-dr (MPID HIV drug resistance sequencing) pipeline
    # DIRECTORY needs to be given.
    $ tcs -j # CLI to generate params.json
    $ tcs -h # print out the help

sample params.json for the tcs-dr pipeline

`tcs_log`

Use tcs_log script to pool run logs and TCS fasta files after one batch of tcs jobs. This command generates log.html to visualize the sequencing runs.

Example file structure:

batch_tcs_jobs/  
      ├── lib1  
      ├── lib2  
      ├── lib3  
      ├── lib4  
      ├── ...

Example command:

    $ tcs_log batch_tcs_jobs

`tcs_sdrm`

Use tcs_sdrm pipeline for HIV-1 drug resistance mutation and recency.

Example command:

    $ tcs_sdrm libs_dir

lib_dir file structure:

libs_dir/
├── lib1
  ├── lib1_RT
  ├── lib1_PR
  ├── lib1_IN
  ├── lib1_V1V3
├── lib2
  ├── lib1_RT
  ├── lib1_PR
  ├── lib1_IN
  ├── lib1_V1V3
├── ...

Output data in a new dir as 'libs_dir_SDRM'

Note: R and the following R libraries are required:

phangorn
ape
scales
ggforce
cowplot
magrittr
gridExtra

`locator`

Use executable locator to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal

    $ locator -i sequence.fasta -o sequence.fasta.csv

Some Examples

Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.

#!/usr/bin/env ruby
require 'viral_seq'

Load nucleotide sequences from a FASTA format sequence file

my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')

Make an alignment (using MUSCLE)

aligned_seqhash = my_seqhash.align

Filter nucleotide sequences with the reference coordinates (HIV Protease)

qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)

Further filter out sequences with Apobec3g/f hypermutations

qc_seqhash = qc_seqhash.a3g[:filtered_seq]

Calculate nucleotide diveristy π

qc_seqhash.pi

Calculate cut-off for minority variants based on Poisson model

cut_off = qc_seqhash.pm

Examine for drug resistance mutations for HIV PR region

qc_seqhash.sdrm_hiv_pr(cut_off)

Known issues

~~have a conflict with rails.~~
~~Update on 03032021. Still have conflict. But in rails gem file, can just use requires: false globally and only require "viral_seq" when the module is needed in controller.~~
The conflict seems to be resovled. It was from a combination of using ! as a function for factorial and the gem name viral_seq. @_@

Updates

Version-1.7.1-05120203

Add a size check for the raw sequences. If the size smaller than the input params, error messages will be sent to users. IF the actual size is greater than the input params, extra bases will be truncated.
Now allows mismatch for the primer region sequences. Forward primer region allows 2 nt differences and cDNA primer region allows 3 nt differences.
Bug fix.
TCS version to 2.5.2

Version-1.7.0-08242022

Add warnings if tcs pipeline is excecuting through source instead of installing from gem.
Optimized ViralSeq:SeqHash#a3g hypermut algorithm. Allowing a external reference other than the sample reference.

Version-1.6.4-07182022

Included region "P17" in the default tcs -d pipeline setting. tcs pipeline updated to version 2.5.1.
Loosen the locator params for the "V1V3" end region for rare alignment issues. Now the default "V1V3" region end with position 7205 to 7210 instead of 7208.
tcs_sdrm now analyse "P17" region for pairwise diversity.

Version-1.6.3-02052022

Updated on ViralSeq::Muscle module along with the update of muscle from version 3.8.1 to 5.1.
Optimized the locator algorithm based on muscle v5.1.
Optimized the tcs_sdrm pipeline based on muscle v5.1.

Version-1.6.1-02022022

Fixed the nav bar in tcs_log html file.
Fixed a typo in tcs.

Version 1.6.0-01042022

Update the ViralSeq::TcsCore::detection_limit with pre-calculated values to save processing time.
Update tcs pipeline to v2.5.0. HTML report will generated after running tcs_log script after tcs pipeline.

Version 1.5.0-01042022

Added a function to calcute detection limit/sensitivity for minority variants (R required). ViralSeq::TcsCore::detection_limit
Added a function to get a sub SeqHash object given a range of nt positions. ViralSeq::SeqHash#nt_range
Added a function to quality check dna sequences comparing with sample consensus for indels. ViralSeq::SeqHash#qc_indel
Added a function for DNA variant analysis. Return a Hash object that can output as a JSON file. ViralSeq::SeqHash#nt_variants
Added a function to check the size of sequences of a SeqHash object. ViralSeq::SeqHash#check_nt_size

Version 1.4.0-10132021

Added a function to calculate false detectionr rate (FDR, aka, Benjamini-Hochberg correction) for minority mutations detected in the sequences. ViralSeq::SeqHash#fdr
Updated bin\tcs_sdrm script to add FDR value to each DRMs detected.

Version 1.3.0-08302021

Fixed a bug in the tcs pipeline.

Version 1.2.9-08022021

Fixed a bug when reading the input primer sequences in lowercases.
Fixed a bug in the method ViralSeq::Math::RandomGaussian

Version 1.2.8-07292021

Fixed an issue when reading .fastq files containing blank_lines.

Version 1.2.7-07152021

Optimzed the workflow of the tcs pipeline on raw data with uneven lengths. tcs version to v2.3.6.

Version 1.2.6-07122021

Optimized the workflow of the tcs pipeline in the "end-join/QC/Trimming" section. tcs version to v2.3.5.

Version 1.2.5-06232021

Add error rescue and report in the tcs pipeline. error messages are stored in the .tcs_error file. tcs pipeline updated to v2.3.4.
Use simple majority for the consensus cut-off in the default setting of the tcs -dr pipeline.

Version 1.2.2-05272021

Fixed a bug in the tcs pipeline that sometimes causes SystemStackError. tcs pipeline upgraded to v2.3.2

Version 1.2.1-05172021

Added a function in R to check and install missing R packages for tcs_sdrm pipeline.

Version 1.2.0-05102021

Added tcs_sdrm pipeline as an excutable. tcs_sdrm processes tcs-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
Added function ViralSeq::SeqHash#sample.
Added recency determining function ViralSeq::Recency::define
Fixed a few bugs related to tcs_sdrm.

Version 1.1.2-04262021

Added function ViralSeq::DRMs.sdrm_json to export SDRM as json object.
Added a random string to the temp file names for muscle_bio to avoid issues when running scripts in parallel.
Added --keep-original flag to the tcs pipeline.

Version 1.1.1-04012021

Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
Added option -i WORKING_DIRECTORY to the tcs script. If the params.json file does not contain the path to the working directory, it will append path to the run params.
Added option -dr to the tcs script.

Version 1.1.0-03252021

Optimized the algorithm of end-join.
Fixed a bug in the tcs pipeline that sometimes combined tcs files are not saved.
Added tcs_log command to pool run logs and tcs files from one batch of tcs jobs.
Added the preset of MPID-HIVDR params file dr.json in /docs.
Add platform_format option in the json generator of the tcs Pipeline. Users can choose from 3 MiSeq platforms for processing their sequencing data. MiSeq 300x7x300 is the default option.

Version 1.0.14-03052021

Add a function ViralSeq::TcsCore.validate_file_name to check MiSeq paired-end file names.

Version 1.0.13-03032021

Fixed the conflict with rails.

Version 1.0.12-03032021

Fixed an issue that may cause conflicts with ActiveRecord.

Version 1.0.11-03022021

Fixed an issue when calculating Poisson cutoff for minority mutations ViralSeq::SeqHash.pm.
fixed an issue loading class 'OptionParser'in some ruby environments.

Version 1.0.10-11112020:

Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
tcs_json_generator is removed. This CLI is delivered within the tcs pipeline, by running tcs -j. The scripts are included in the /viral_seq/tcs_json.rb
consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
a few optimizations.
TCS 2.1.0 delivered.
Tried parallel processing. Cannot achieve the goal because parallel gem by default can't pool data from memory of child processors and in_threads does not help with the speed.

Version 1.0.9-07182020:

Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
TCS pipeline updated to version 2.0.1. Add optional export_raw: TRUE/FALSE in json params. If export_raw is TRUE, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.

Version 1.0.8-02282020:

TCS pipeline (version 2.0.0) added as executable. tcs - main TCS pipeline script. tcs_json_generator - step-by-step script to generate json file for tcs pipeline.
Methods added: ViralSeq::SeqHash#trim
Bug fix for several methods.

Version 1.0.7-01282020:

Several methods added, including ViralSeq::SeqHash#error_table ViralSeq::SeqHash#random_select
Improved performance for several functions.

Version 1.0.6-07232019:

Several methods added to ViralSeq::SeqHash, including ViralSeq::SeqHash#size ViralSeq::SeqHash#+ ViralSeq::SeqHash#write_nt_fa ViralSeq::SeqHash#mutation
Update documentations and rspec samples.

Version 1.0.5-07112019:

Update ViralSeq::SeqHash#sequence_locator. Program will try to determine the direction (+ or - of the query sequence)
update executable locator to have a column of direction in output .csv file

Version 1.0.4-07102019:

Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
Fix bugs in bin locator

Version 1.0.3-07102019:

Bug fix.

Version 1.0.2-07102019:

Fixed a gem loading issue.

Version 1.0.1-07102019:

Add keyword argument :model to ViralSeq::SeqHashPair#join2.
Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
update documentations

Version 1.0.0-07092019:

Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq

Development

Bug reports and pull requests are welcome on GitHub at https://github.com/ViralSeq/viral_seq. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the viral_seq project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.