Metadata-Version: 2.1
Name: nPhase
Version: 1.2.0
Summary: nPhase is a command line ploidy agnostic phasing pipeline and algorithm which phases samples of any ploidy with sequence alignment of long and short read data to a reference sequence.
Home-page: https://https://github.com/nPhasePipeline/nPhase
Author: Omar Abou Saada
Author-email: omaroakheart@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.8
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown

Toolkit to download datasets, generate virtual polyploids, phase them, and assess phasing quality.

# Phasing Toolkit

The Phasing Toolkit is an open source collection of scripts which automate the steps of alignment-based phasing operations.

Currently, the focus is on making it easy to benchmark phasing algorithms with a focus on polyploid tools. Ultimately the goal is to develop a general purpose phasing toolkit. Everyone is welcome and encouraged to contribute to this project.

# Installation

Install bioconda and set the correct channels by following the first two steps here: https://bioconda.github.io/user/install.html

Then you can create a new environment and install polyploidBenchmarking with the following commands

### Create new conda environment
conda create -n polyploidBenchmarking python=3.8

### Download prerequisite packages
conda activate polyploidBenchmarking  
conda install -c oakheart nphase  
conda install -c bioconda -c bioinf-mcb sra-downloader  
conda install whatshap  
conda install -c bioconda ncbi-genome-download  
conda install psutil

If you wish to test flopp you must download it manually (any help automating this step would be appreciated), instructions are available here:
https://github.com/bluenote-1577/flopp

Make sure "flopp" is in your $PATH (such that entering "flopp" into the command line launches it)

### Download the phasing toolkit scripts
git clone https://github.com/OmarOakheart/Phasing-Toolkit.git

# Quickstart

There are currently five scripts which all require a parameter file (benchmarkingParameters.tsv) as input. In its current state, the Phasing Toolkit is optimized to easily benchmark polyploid phasing algorithms on virtual polyploid datasets. A virtual polyploid is created by taking the sequencing reads of genetically distinct haploid or homozygous diploid organisms of the same species and merging them, thus using real reads to generate simulated polyploids. We can then run phasing algorithms on these simulated or virtual polyploids and compare the genotypes of the haploids to the predictions to calculate phasing accuracy metrics. This toolkit provides a way to perform all of these steps, from data download to accuracy metric calculation, with minimal effort.

![Alt text](img/polyploidBenchmarking.svg?raw=true "polyploid benchmarking")

Note: Virtual polyploids don't perfectly approximate real polyploid data and we welcome any contribution to this project for alternate ways of generating polyploid phasing benchmarking datasets.

### Parameter file (benchmarkingParameters.tsv)

``benchmarkingParameters.tsv``

>mainPath&emsp;/path/to/outputFolder/  
>longReads&emsp;ACA:ERR4352153&emsp;BMB:ERR4352154&emsp;CCN:ERR4352155&emsp;CRL:ERR4352156  
>longReadMethod&emsp;ont  
>shortReads&emsp;ACA:ERR1309429&emsp;BMB:ERR1308675&emsp;CCN:ERR1308732&emsp;CRL:ERR1308952  
>coverages&emsp;20&emsp;10  
>heterozygosityRates&emsp;0.05&emsp;0.1&emsp;0.5&emsp;1  
>reference&emsp;taxID:559292&emsp;group:fungi&emsp;speciesName:sCerevisiae  
>strainLists&emsp;BMB,CCN&emsp;ACA,BMB,CCN&emsp;ACA,BMB,CCN,CRL  
>genomeSize&emsp;12500000  
>threads&emsp;8

The parameter file contains all of the information required by the various scripts of the phasing toolkit. It is tab-separated and contains multiple distinct lines. Line order is not important. We provide an example file, benchmarkingParameters.tsv, which can be used as a template. A detailed description of this file is available at the end of the page.

### downloadData.py
Usage: `python3.8 downloadData.py benchmarkingParameters.tsv`

* Downloads reads using sra-downloader (https://github.com/s-andrews/sradownloader). Input: SRA Accession numbers associated with strain names
* Downloads and indexes a reference sequence using ncbi-genome-download (https://github.com/kblin/ncbi-genome-download). Input: taxID of the species, group according to the ncbi (fungi, viral, etc.), and species name.

### processReads.py [groundTruth]
Usage: `python3.8 processReads.py benchmarkingParameters.tsv groundTruth`
* Generates the ground truth dataset. Maps short reads and long reads, variants calls short reads

### hybridGenerator.py
Usage: `python3.8 hybridGenerator.py benchmarkingParameters.tsv`
* Generates virtual hybrids by merging the ground truth dataset reads at specified coverage levels.

### processReads.py [virtualHybrids]
Usage: `python3.8 processReads.py benchmarkingParameters.tsv virtualHybrids`
* Runs on the virtual polyploids generated by hybridGenerator.py. Maps short reads and long reads, variants calls and subsets short read VCFs, with and without indels

### phaseToolRunner.py
Usage: `python3.8 phaseToolRunner.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase [...]`
* Phases the virtual polyploids generated by hybridGenerator.py and processed by processReads.py [virtualHybrids]. Will run every phasing algorithm specified in the command line. This script also outputs performance metrics (runtime and memory usage). Currently only flopp, nphase and whatshap polyphase are implemented. Please feel free to contribute to the project by adding an implementation of another phasing algorithm you would like to see included.

### accuracyCalculator.py
Usage: `python3.8 accuracyCalculator.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase [...]`
* Calculates accuracy metrics of phasing results obtained by running phaseToolRunner.py and outputs them to a file and to stdout.

### Example script
With the example benchmarkingParameters.tsv file, you can run a full benchmarking pipeline using 4 strains to generate a virtual 2n, 3n and 4n, at coverage levels 10X/haplotype and 20X/haplotype, and randomly subset their variants to obtain VCF files at 1%, 0.5%, 0.1% and 0.05% heterozygosity, with and without indels. In total it generates 3 ploidies at 2 coverage levels and 4 heterozygosity levels each, with and without indels, for a total of 3\*2\*4\*2=48 virtual polyploids which are all phased by flopp, nphase and whatshap polyphase and their results analyzed in comparison with the known ground truth.

Please note that this can take a few days and over 100GB of hard drive space to run fully. You can modify the benchmarkingParameters.tsv file to reduce the number of tests to perform to test run the phasing toolkit faster and more efficiently. We provide a very complete benchmarking test here as an example to show the range of available options.

>python3.8 downloadData.py benchmarkingParameters.tsv  
>python3.8 processReads.py benchmarkingParameters.tsv groundTruth  
>python3.8 hybridGenerator.py benchmarkingParameters.tsv  
>python3.8 processReads.py benchmarkingParameters.tsv virtualHybrids  
>python3.8 phaseToolRunner.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase  
>python3.8 accuracyCalculator.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase

### Parameter file in detail
``benchmarkingParameters.tsv``

This file contains all of the information needed to share a full benchmark with anyone else. Here we describe it line by line:

Output folder which will be used to store all of the data we're going to download and generate  
>mainPath&emsp;/path/to/outputFolder/

Names of the strains we will phase and the accessions of their long reads, format is `strainName:SRA_Accession`  
>longReads&emsp;ACA:ERR4352153&emsp;BMB:ERR4352154&emsp;CCN:ERR4352155&emsp;CRL:ERR4352156  

Define the type of long read (ont or pacbio)  
>longReadMethod&emsp;ont  

Names of the strains we will phase and the accessions of their long reads, format is `strainName:SRA_Accession`. The names must be the same as for the long reads.  
>shortReads&emsp;ACA:ERR1309429&emsp;BMB:ERR1308675&emsp;CCN:ERR1308732&emsp;CRL:ERR1308952  

Coverage levels to test (10 means 10X per haplotype, for a diploid that means a total of 20X, for a tetraploid it means a total of 40X)  
>coverages&emsp;20&emsp;10  

Heterozygosity levels to test (0.5 means 0.5%, i.e. on average 1 heterosygous SNP every 2000 bp)  
>heterozygosityRates&emsp;0.05&emsp;0.1&emsp;0.5&emsp;1  

Reference information, must include the taxID of the species according to the ncbi, the ground name according to the ncbi, and a name for the reference sequence (speciesName). More information can be obtained from https://github.com/kblin/ncbi-genome-download.  
>reference&emsp;taxID:559292&emsp;group:fungi&emsp;speciesName:sCerevisiae  

Define the virtual polyploids to generate. `ACA,BMB,CCN` means it should generate a virtual triploid with reads from the strains ACA, BMB and CCN.  
>strainLists&emsp;BMB,CCN&emsp;ACA,BMB,CCN&emsp;ACA,BMB,CCN,CRL  

Reference genome size, needed to calculate coverage  
>genomeSize&emsp;12500000  

Number of threads to use when applicable  
>threads&emsp;8  

# How to contribute

If you're interested in helping with the development of this toolkit, please feel free to create issues on this page or email me at omaroakheart@gmail.com to discuss ways in which we can improve this project and make it more flexible and useful for the community. I'm very interested in extending this to (in no particular order):

* Improve the flexibility of tools provided to make the phasing toolkit more versatile (for example reusing the VCF subsetting method to make it easy for anyone to subset their own VCF)
* Include more default benchmarking datasets which push the limits of phasing methods in different ways (highly similar haplotypes, aneuploidy, LOH tracts, SVs...)
* Host a simple website which displays the results of different tools on different datasets
* Test how different parameters influence accuracy results
* Provide more insight into which variants are well-phased
* Investigate the effect of read and mapping quality on results
* Provide tools to analyze and visualize the results of phased data
* Discuss and develop accuracy metrics
* Discuss and develop the ideal input and output formats of polyploid phasing algorithms


