Metadata-Version: 2.1
Name: metapi
Version: 0.6.1
Summary: a pipeline to construct a genome catalogue from metagenomics data
Home-page: https://github.com/ohmeta/metapi
Author: Jie Zhu
Author-email: zhujie@genomics.cn
License: GPLv3
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: ruamel.yaml

# **metapi**

hello, metagenomics!

## brother project

* [anvio](https://github.com/merenlab/anvio)
* [atlas](https://github.com/pnnl/atlas)
* [sunbeam](https://github.com/sunbeam-labs/sunbeam)

## motivation

  we all need a metagenomics pipeline for academic research.

## principle

* bind intelligense together
  * [github](https://github.com/search?q=metagenomics)
  * why we here?
* do not make wheels
  * [biopython](https://github.com/biopython/biopython)
  * [bioperl](http://bioperl.org)
  * [seqan](https://github.com/seqan/seqan)
* make full use of pipeline execution engine
  * [make](https://www.gnu.org/software/make/manual/make.html)
  * [snakemake](https://bitbucket.org/snakemake/snakemake)
  * [common workflow language](https://github.com/common-workflow-language/common-workflow-language)
  * [workflow definition language](https://software.broadinstitute.org/wdl/)
* make full use of awesome bioinformatics tools
  * [bwa](https://github.com/lh3/bwa)
  * [samtools](https://github.com/samtools/samtools)
  * [spades](https://github.com/ablab/spades)
  * [idba](https://github.com/loneknightpy/idba)
  * [megahit](https://github.com/voutcn/megahit)
  * [metabat](https://bitbucket.org/berkeleylab/metabat)
  * [checkm](https://github.com/Ecogenomics/CheckM)
* robust and module, extensible, update
  * one rule, one module
  * one module, one analysis
* welcome to PR

## design

* execution module
    ```python
    # Snakefile
        include: "rules/step.smk"
        include: "rules/simulation.smk"
        include: "rules/fastqc.smk"
        include: "rules/trimming.smk"
        include: "rules/rmhost.smk"
        include: "rules/assembly.smk"
        include: "rules/alignment.smk"
        include: "rules/binning.smk"
        include: "rules/cobinning.smk"
        include: "rules/checkm.smk"
        include: "rules/dereplication.smk"
        include: "rules/classification.smk"
        include: "rules/annotation.smk"
        include: "rules/profilling.smk"
    ```

* analysis module
  * raw data report
  * quality control
  * remove host sequences
  * assembly
  * assembly evaluation
  * binning
  * checkm
  * dereplication
  * bins profile
  * taxonomy classification
  * genome annotation
  * function annotation

* test module
  * execution test
  * analysis test

## install

* install dependencies*
  * [snakemake](https://snakemake.readthedocs.io)
  * [pigz](https://zlib.net/pigz/)
  * [ncbi-genome-download](https://github.com/kblin/ncbi-genome-download)
  * [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq)
  * [OAFilter](https://github.com/Scelta/OAFilter)
  * [sickle](https://github.com/najoshi/sickle)
  * [fastp](https://github.com/OpenGene/fastp)
  * [MultiQC](https://github.com/ewels/MultiQC)
  * [bwa](https://github.com/lh3/bwa)
  * [samtools](https://github.com/samtools/samtools)
  * [spades](https://github.com/ablab/spades)
  * [idba](https://github.com/loneknightpy/idba)
  * [megahit](https://github.com/voutcn/megahit)
  * [quast](https://sourceforge.net/projects/quast/)
  * [MetaBat](https://bitbucket.org/berkeleylab/metabat)
  * [MaxBin2](http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html)
  * [CheckM](https://github.com/Ecogenomics/CheckM)
  * [drep](https://github.com/MrOlm/drep)
  * [prokka](https://github.com/tseemann/prokka)
  * [metaphlan2](https://bitbucket.org/biobakery/metaphlan2)

  ```bash
  # in python3 environment
  conda install snakemake pigz ncbi-genome-download sickle-trim fastp bwa samtools \
                bbmap spades idba megahit maxbin2 prokka metabat2 drep quast checkm-genome
  pip install insilicoseq

  # in python2 envrionment
  conda install metaphlan2

  # database configuration
  wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
  mkdir checkm_data
  cd checkm_data
  tar -xzvf ../checkm_data_2015_01_16.tar.gz
  cd ..
  ln -s checkm_data checkm_data_latest

  # activate python3 environment where checkm in
  checkm data setRoot checkm_data_latest
  ```

* install metapi

    ```bash
    # recommand
    git clone https://github.com/ohmeta/metapi
    # or (maybe not latest)
    pip install metapi
    ```

## example

* snakemake了解一下:)

    ```python
    rule bwa_mem:
        input:
            r1 = "fastq/sample_1.fq.gz",
            r2 = "fastq/sample_2.fq.gz",
            ref = "ref/ref.index
        output:
            bam = "sample.sort.bam",
            stat = "sample_flagstat.txt"
        threads:
            8
        shell:
            '''
            bwa mem -t {threads} {input.ref} {input.r1} {input.r2} | \
            samtools view -@{threads} -hbS - | \
            tee >(samtools flagstat -@{threads} - > {output.stat}) | \
            samtools sort -@{threads} -o {output.bam} -
            '''
    ```

* a simulated metagenomics data test(uncomplete)

    ```bash
    # in metapi/example/basic_test directory
    cd example/basic_test

    # look
    snakemake --dag | dot -Tsvg > dat.svg
    ```
    <img src="examples/basic_test/dat.svg">

    ```bash
    # run on local
    snakemake

    # run on SGE cluster
    snakemake \
    --jobs 80 \
    --cluster "qsub -S /bin/bash -cwd -q {queue} -P {project} -l vf={mem},p={cores} -binding linear:{cores}"
    ```

* a real world metagenomics data process(uncomplete)

    ```bash
    # in metapipe directory
    # look
    cd metapi
    snakemake --dag | dot -Tsvg > ../docs/dat.svg
    ```
    <img src="docs/dat.svg">

    ```bash
    # run on local
    snakemake \
    --cores 8 \
    --snakefile metapi/Snakefile \
    --configfile metapi/metaconfig.yaml \
    --until all

    # run on SGE cluster
    snakemake \
    --snakefile metapi/Snakefile \
    --configfile metapi/metaconfig.yaml \
    --cluster-config metapi/metacluster.yaml \
    --jobs 80 \
    --cluster "qsub -S /bin/bash -cwd -q {cluster.queue} -P {cluster.project} -l vf={cluster.mem},p={cluster.cores} -binding linear:{cluster.cores} -o {cluster.output} -e {cluster.error}"
    --latency-wait 360 \
    --until all
    ```

## metapi command line interface

* init

    ```bash
    metapi --help

    usage: metapi [subcommand] [options]

    metapi, a metagenomics data process pipeline

    optional arguments:
        -h, --help     show this help message and exit
        -v, --version  print software version and exit

    available subcommands:

        init         a metagenomics project initialization
        simulation   a simulation on metagenomics data
        workflow     a workflow on real metagenomics data
    ```


    please supply samples.tsv    
    format     
    | id |    fq1     |     fq2    |  
    |----|------------|------------|  
    | s1 | s1.1.fq.gz | s1.2.fq.gz |  
    | s2 | s2.1.fq.gz | s2.2.fq.gz |    


   ```bash
   python /path/to/metapi/metapi/metapi.py init -d . -s samples.tsv -b raw -a metaspades
   ```

* list

    ```bash
    snakemake --snakefile /path/to/metapi/metapi/Snakefile --configfile metaconfig.yaml --list
    ``` 

* debug

    ```bash      
    snakemake --snakefile /path/to/metapi/metapi/Snakefile \
        --configfile metaconfig.yaml \
        -p -r -n --debug-dag \
        --until checkm_lineage_wf
    ```

* simulation

* workflow


