Metadata-Version: 2.1
Name: sc2ts
Version: 0.0.3
Summary: Infer a succinct tree sequence from SARS-CoV-2 sequence alignments
Author: Jerome Kelleher
Author-email: jerome.kelleher@bdi.ox.ac.uk
License: MIT
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: analysis
Provides-Extra: setup_requires
License-File: LICENSE

# sc2ts
Infer a succinct tree sequence from SARS-COV-2 variation data

**This is an early alpha version not intended for production use!!**

If you are interested in helping to develop sc2ts or would like to
work with the inferred ARGS, please get in touch.

## Installation

To run the downstream analysis utilties, install from pip using

```
python3 -m pip install sc2ts[analysis]
```

This installs matplotlib and some other heavyweight dependencies.

For just running the inference tools, use

```
python3 -m pip install sc2ts
```

## Inference workflow

### Command line inference

Inference is intended to be run from the command-line primarily,
and most likely orchestrated via a shell script or Snakemake file, etc.

The CLI is split into subcommands. Get help by running the CLI without arguments:

```
python3 -m sc2ts
```

### Import metadata to local database

Metadata for all samples must be available, and provided in a tab-separated
file. We need to convert from a standard text file to a SQLite database
so that we can quickly search for strains collected on  a given day, without
loading the entire set each time.

```
python3 -m sc2ts import-metadata data/metadata.tsv data/metadata.db
```

**TODO: Document required fields**

### Import alignments

To provide fast access to the individual alignments, we store them in a local
database file. These must be imported before inference can be performed.

The basic approach is to use the ``import-alignments`` command, with a
path to a ``alignments.db`` file which we are creating, and one or more
FASTA files that we are importing into it.

```bash
python3 -m sc2ts import-alignments data/alignments.db data/alignments/.fasta
```

By default the database file is updated each time, so this can be done
in stages.

**TODO discuss the storage and time requirements for this step!**


### Run the inference

The basic approach is to run the ``daily-extend`` command which runs the
basic extension operation day-by-day using the information
in the metadata DB.

```
python3 -m sc2ts daily-extend data/alignments.db data/metadata.db results/output-prefix
```

### Example run script

Here is a script used to run the inference for the Long ARG
in the preprint:

```
#!/bin/bash
set -e
precision=12
mismatches=3
max_submission_delay=30
max_daily_samples=1000
num_threads=40
datadir=data
run_id=upgma-mds-$max_daily_samples-md-$max_submission_delay-mm-$mismatches
resultsdir=results/$run_id
results_prefix=$resultsdir/$run_id-
logfile=logs/$run_id.log
# Setup the options
options="--num-threads $num_threads -vv -l $logfile "
options+="--max-submission-delay $max_submission_delay "
options+="--max-daily-samples $max_daily_samples "
options+="--precision $precision --num-mismatches $mismatches"
# Create the results dir and data paths
mkdir -p $resultsdir
alignments=$datadir/alignments2.db
metadata=$datadir/metadata.filtered.db
# NOTE: we can start from a given data also with the -b option
# basets="$results_prefix"2022-01-24.ts
# options+=" -b $basets"
python3 -m sc2ts daily-extend $alignments $metadata $results_prefix $options
```

## Licensing

The code is marked as licensed under the MIT license,
but because the current implementation is used the matching
engine from tsinfer (which is GPL licensed) this code is
therefore also GPL.

However, we plan to switch out the matching engine for an
implementation provided by tskit, which is MIT licensed.
This will be done before the first official release.
