Metadata-Version: 2.1
Name: workflow.py
Version: 0.1.0
Summary: A set of scripts to convert multiple breseq analyses together and highlight variabls of interest.
Home-page: https://github.com/cdeitrick/isolate_parsers
Author: chris deitrick
Author-email: chrisdeitrick1@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Provides: i
Provides: s
Provides: o
Provides: l
Provides: a
Provides: t
Provides: e
Provides: p
Provides: a
Provides: r
Provides: s
Provides: e
Provides: r
Description-Content-Type: text/markdown
Requires-Dist: pandas (>=0.24.0)
Requires-Dist: loguru
Requires-Dist: xlrd
Provides-Extra: additional_support_for_parsing_files
Requires-Dist: beautifulsoup4 ; extra == 'additional_support_for_parsing_files'
Provides-Extra: to_run_tests
Requires-Dist: pytest ; extra == 'to_run_tests'

# isolate_parsers

## Usage
```
python isolateset_parser.py [-h] [-i FOLDER] [--no-fasta] [-w WHITELIST]
                            [-b BLACKLIST] [-m SAMPLE_MAP] [--filter-1000bp]

optional arguments:
  -h, --help            show this help message and exit
  -i FOLDER, --input FOLDER
                        The breseq folder to parse.
  --no-fasta            Whether to generate an aligned fasta file of all snps
                        in the breseq VCF file.
  -w WHITELIST, --whitelist WHITELIST
                        Samples not in the whitelist are ignored. Either a
                        comma-separated list of sample ids for a file with
                        each sample id occupying a single line.
  -b BLACKLIST, --blacklist BLACKLIST
                        Samples to ignore. See `--whitelist` for possible
                        input formats.
  -m SAMPLE_MAP, --sample-map SAMPLE_MAP
                        A file mapping sample ids to sample names. Use if the
                        subfolders in the breseqset folder are named
                        differently from the sample names. The file should
                        have two columns: `sampleId` and `sampleName`,
                        separated by a tab character.
  --filter-1000bp       Whether to filter out variants that occur within
                        1000bp of each other. Usually indicates a mapping
                        error.
```

## Input
The scripts expect a folder of individual breseq runs, with each folder named after the isolate/sample.
The scipts only require the `output.vcf`, `annotated.gd`, and `index.html` files located in each folder.
Example folder:
```
    .breseq_folder
    |-- sample1
    |   |-- data
    |   |   |-- output.vcf
    |   |-- output
    |   |   |-- index.html
    |   |   |-- evidence
    |   |   |   |-- annotated.gd
    |-- sample2
    |   |-- data
    |   |   |-- output.vcf
    |   |-- output
    |   |   |-- index.html
    |   |   |-- evidence
    |   |   |   |-- annotated.gd
    |-- sample3
    |   |-- data
    |   |   |-- output.vcf
    |   |-- output
    |   |   |-- index.html
    |   |   |-- evidence
    |   |   |   |-- annotated.gd
```

## Output
The scripts generate an excel file in the breseq run folder with 4 sheets: `comparison`, `variant`, `coverage`, and `junction`.
The `variant`, `coverage`, and `junction` tables are just the concatenated tables of all samples in the breseq run.

### Comparision table
A table in which every row represents a single mutation seen in the sample callset 
and samples are represented by columns with the alternate sequence for each sample.

| Sample1 | Sample2 | Sample3 | annotation                | description                                                                                     | gene                    | locusTag          | mutationCategory  | position | presentIn | presentInAllSamples | ref | seq id    | 
|---------|---------|---------|---------------------------|-------------------------------------------------------------------------------------------------|-------------------------|-------------------|-------------------|----------|-----------|---------------------|-----|-----------| 
| GG      | GG      | GG      | intergenic (+65/+20)      | putative lipoprotein/putative hydrolase                                                         | PFLU0045 - / - PFLU0046 | PFLU0045/PFLU0046 | small_indel       | 45881    | 3         | 1                   | G   | NC_012660 | 
| CC      | CC      | CC      | intergenic (+17/-136)     | microcin-processing peptidase 1. Unknown type peptidase. MEROPS family U62/hypothetical protein | PFLU0872 - / - PFLU0873 | PFLU0872/PFLU0873 | small_indel       | 985333   | 3         | 1                   | C   | NC_012660 | 
|         |         |         | intergenic (+57/+21)      | hypothetical protein/putative helicase                                                          | PFLU3154 - / - PFLU3155 | PFLU3154/PFLU3155 | small_indel       | 3447986  | 3         | 1                   |     | NC_012660 | 
| A       | A       | G       | M350I (ATG-ATA)           | putative GGDEF domain signaling protein                                                         | PFLU3571 -              | PFLU3571          | snp_nonsynonymous | 3959631  | 2         | 0                   | G   | NC_012660 | 
| A       | A       | C       | T238P (ACC-CCC)           | hybrid sensory histidine kinase in two-component regulatory system with UvrY                    | PFLU3777 -              | PFLU3777          | snp_nonsynonymous | 4173231  | 1         | 0                   | A   | NC_012660 | 
| G       | G       | GG      | coding (322/1476 nt)      | putative two-component system response regulator nitrogen regulation protein NR(I)              | PFLU4443 -              | PFLU4443          | small_indel       | 4908233  | 1         | 0                   | G   | NC_012660 | 

### Aligned fasta files

The scripts also generates 3 fasta files (`breseq.snp.fasta`, `breseq.amino.fasta`, `breseq.codon.fasta`)
with all nonsynonymous snps from each sample represented by the replacement bases, amino acids, and codons.
Example:
```
>reference
GA
>Sample1
AA
>Sample2
AA
>Sample3
GC
```

