Metadata-Version: 2.1
Name: radinitio
Version: 1.2.0
Summary: A package for the simulation of RADseq data.
Home-page: http://catchenlab.life.illinois.edu/radinitio
Author: Angel G. Rivera-Colon <arcolon14@gmail.com>, Nicolas Rochette <rochette@illinois.edu>, Julian Catchen <jcatchen@illinois.edu>
Author-email: angelgr2@illinois.edu
Project-URL: Manual, http://catchenlab.life.illinois.edu/radinitio/manual/
Project-URL: Source, https://bitbucket.org/angelgr2/radinitio/src/default/
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: scipy
Requires-Dist: numpy
Requires-Dist: decoratio
Requires-Dist: msprime

# RADinitio

![RADinitio logo](image.png)

**RADinitio** is a pipeline for the assessment of RADseq experiments via prospective and retrospective data simulation. Sequencing data is generated *de novo* from a population of individuals via a coalescent simulation under a user-defined demographic model using **msprime**. The genetic variants in each sample are simulated in a genomic context that can impact downstream generation of RAD loci and sequencing reads. The per-individual sequences are then treated to an *in silico* RADseq library preparation. The components of the library are defined by the user, allowing for the exploration of parameters including restriction enzyme selection, insert size, sequencing coverage, and PCR duplicate distribution (generated using the **decoratio** software). **RADinitio** simulations can also be applied retrospectively by comparing and modelling sources of error in empirical datasets. The purpose of **RADinitio** is for researchers to fully explore possible variables in their data generation process to ensure that their protocol selection and library preparation is performed optimally, within the limitations of technical and experimental error.

## Citation

If you use **RADinitio** in your work, please cite the [2021 *Mol Ecol Resour*](https://doi.org/10.1111/1755-0998.13163) manuscript:

> Rivera-Colón, AG, Rochette, NC, Catchen, JM. (2021) Simulation with RADinitio improves RADseq experimental design and sheds light on sources of missing data. *Mol Ecol Resour* 21: 363– 378. <https://doi.org/10.1111/1755-0998.13163>

## Installation

**RADinitio** can be installed from the Python Package Index (PyPI) using:

```sh
python3 -m pip install radinitio
```

More information regarding this installation can be obtained at the **RADinitio** [PyPI Project page](https://pypi.org/project/radinitio/ "PyPI Project page").

Users can install **RADinitio** in their local Python directories using the `--user` flag:

```sh
python3 -m pip install radinitio --user
```

Alternatively, the software can be downloaded from the [Catchen Lab](http://catchenlab.life.illinois.edu/radinitio/ "Catchen Lab") website and installed using the following commands:

```sh
python3 -m pip install radinitio-version.tar.gz
```

Or for a local installation:

```sh
python3 -m pip install radinitio-version.tar.gz --user
```

For more information regarding the `pip` installation, please visit the [pip user-guide](https://pip.pypa.io/en/stable/user_guide/ "pip user-guide").

### Requirements and dependencies

**RADinitio** requires Python 3.7 or higher.

**RADinitio** depends on the following non-default Python packages:

* `msprime` ([Kelleher et al. 2016](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842/); [Baumdicker et al. 2022](https://doi.org/10.1093/genetics/iyab229); <https://tskit.dev/msprime/docs/latest/intro.html>)
* `decoratio` ([Rochette et al. 2023](https://doi.org/10.1111/1755-0998.13800); <https://pypi.org/project/decoratio/>)
* `scipy` ([Harris et al. 2020](https://doi.org/10.1038/s41586-020-2649-2); <https://scipy.org/>)
* `numpy` ([Virtanen et al. 2020](https://doi.org/10.1038/s41592-019-0686-2); <https://numpy.org/>)

Running the `pip` installation, from both PyPI or the the direct installation will take care of all dependencies, including **msprime**, **decoratio**, **scipy**, **numpy**.

## RADinitio manual

A complete version of the **RADinitio** manual including instructions for installation, documentation of command line variables, description of output files, and a tutorial to run the pipeline can be found at the [**RADinitio** website](http://catchenlab.life.illinois.edu/radinitio/ "RADinitio website") (Note: needs to be updated to **RADinitio** v `1.2.0`).

## Pipeline structure

**RADinitio** is designed as a series of independent functions that simulate different aspects of the RADseq library preparation process. The pipeline can be broken down into three main steps:

### Variant generation and processing

Variants are generated with **msprime** from a user-defined demographic model. Independent simulations are run for different chromosomes/scaffolds present in a user-provided reference genome. The simulated variants are then projected against the reference genome to obtain the reference alleles, which are then converted into alternative alleles using a user-defined model. By default, **RADinitio** simulates using **msprime**'s island models; however, users can simulate more complex models using the Python API.

### Extraction of RAD alleles

The reference genome is *in silico* digested to obtain a series of reference RAD loci. This can be done using a single or double enzyme digestion, for single- and double-digest RADseq, respectively. The positions of the reference loci in the genome are then used to filter the simulated variants for all samples to include only variants present in RAD loci, improving downstream performance. When RAD alleles are extracted, the reference RAD loci are modified to include the corresponding genetic variants for each sample. This process can alter a cutsite's sequence, resulting in a dropped allele for that sample.

### Simulate library enrichment and sequencing

Once extracted, the alleles for each sample are randomly sampled to obtain paired-end sequences for each allele template. The alleles are sampled with replacement, proportional to the desired sequencing coverage of the library. Each iteration of the sampling is treated as an independent sequence template that is truncated to a random length sampled from a simulated insert size distribution (or digested with a second enzyme, as is the case with ddRAD). An optional PCR duplicate distribution, as specified using the **decoratio** software, can be applied to the sampling process, resulting in the duplicate sampling of unique template sequences. This process can also introduce random error to the sequence, simulating the generation of PCR errors. Finally, paired end sequences are generated from each template, each with its own unique (simulated) sequencing error.

The corresponding functions for each stage can be run independently. We do provide a wrapper script, `radinitio`, that calls the top-level `radinitio.simulate()` function which runs the pipeline from start to finish. Advanced users can run the pipeline through the Python API, which allows for the generation of more complex demographic models, define finer details of the library preparation process, and run components of the pipeline independently.

## Command line interface

The simplest way to run **RADinitio** is to execute the module via the `radinitio` command line
wrapper, which (by default) calls the top-level `radinitio.simulate()` function, which runs all the stages of the
pipeline.

### Usage & options

The `radinitio` program options are the following:

```sh
radinitio --genome path --out-dir dir [pipeline-stage] [--chromosomes path] \
    [(demographic model options)] [(library options)] [(pcr options)] [(advanced options)]
```

#### Pipeline stages (these options are mutually exclusive)

`--simulate-all` : Run all the **RADinitio** stages (simulate a population, make a library, and sequence the library) (Default)

`--make-population` : Simulate and process variants. Produces genome-wide VCF.

`--make-library-seq` : Simulate and sequence a RAD library. Requires exising `radinitio.make-population` run.

`--tally-rad-loci` : Calculate the number of kept RAD loci in the genome.

#### Required input/output

`-g`, `--genome` : (*str*) Path to reference genome (fasta file, may be gzipped). **Required**

`-o`, `--out-dir` : (*str*) Path to an output directory where all files will be written. **Required**

#### Input genome options

`-l`, `--chromosomes` : (*str*) File containing the list of chromosomes (one per line) to simulate. (default = `None`)

`-i`, `--min-seq-len` : (*int*) Minimum length in bp required to load a sequence from the genome fasta. (default = 0)

#### Demographic model (simple island model)

`-p`, `--n-pops`: (*int*) Number of populations (demes) in the island model. (default = 2)

`-n`, `--pop-eff-size` : (*float*) Effective population size of each simulated deme. (default = 5000)

`-s`, `--n-seq-indv` : (*int*) Number of individuals sampled from each population. (default = 10)

#### Library preparation/sequencing

`-b`, `--library-type` : (*str*) Library type (sdRAD or ddRAD). (default = `sdRAD`)

`-e`, `--enz` : (*str*) Restriction enzyme (*SbfI*, *PstI*, *EcoRI*, *BamHI*, etc.). (default = `SbfI`)

`-d`, `--enz2` : (*str*) Second restriction enzyme for double digest (*MspI*, *MseI*, *AluI*, etc.). (default = `MspI`)

`-m`, `--insert-mean` : (*int*) Insert size mean in bp. (default = 350)

`-t`, `--insert-stdev` : (*int*) Insert size standard deviation in bp. (default = 37)

`-j`, `--min-insert` : (*int* or *None*) Minimum insert length in bp; overrides `insert-mean` and `insert-stdev`. (default = `None`)

`-x`, `--max-insert` : (*int* or *None*) Maximum insert length in bp; overrides `insert-mean` and `insert-stdev`. (default = `None`)

`-v`, `--coverage` : (*int*) Sequencing coverage. (default = 20)

`-r`, `--read-length` : (*int*) Sequence read length. (default = 150)

`-f`, `--read-out-fmt` : (*str*) Format for the output reads, FASTA or FASTQ. (default = `fasta`)

#### PCR model options

`-c`, `--pcr-model` : (*str*) PCR amplification model from **decoratio**. Available models are log-normal (`lognormal`), log-skew-normal (`logskewnormal`), inherited efficiency (`inheff`), and inherited efficiency beta (`inheffbeta`). (default = `None`)

`-y`, `--pcr-cycles` : (*int*) Number of PCR cycles. (default = `None`)

`--pcr-deco-ratio` : (*float*) **De**pth/**co**mplexity ratio.  (default = 0.1)

`--pcr-inheff-mean`: (*float*) Mean amplification factor for inherited efficiency amplification models (`inheff` and `inheffbeta`). (default = 0.7)

`--pcr-inheff-sd` : (*float*) Standard deviation for the amplification factor in the inherited efficiency amplification models (`inheff` and `inheffbeta`). (default = 0.1)

`--pcr-lognormal-sd`: (*float*) Standard deviation for log-normal (`lognormal`) and log-skew-normal amplification models (`logskewnormal`).  (default = 0.2)

`--pcr-lognormal-skew`: Skew for log-skew-normal (`logskewnormal`) amplification model.  (default = -1)

#### make-library-seq()-specific options

`--make-pop-sim-dir` : (*str*) Directory containing a previous `radinitio.make-population` run. Cannot be the same as `out-dir`.

#### Additional options

`-V,` `--version` : Print program version.

`-h`, `--help` : Display this help message.

#### Advanced options (can be left default for the large majority of cases)

`--genome-rec-rate` : (*float*) MsprimeOptions: Average per-base per-generation recombination rate for the whole genome. (default = 3e-8)

`--genome-mut-rate` : (*float*) MsprimeOptions: Average per-base per-generation mutation rate for the whole genome. (default = 7e-8)

`--pop-mig-rate` : (*float*) MsprimeOptions: Total per-population per-generation immigration rate. (default = 0.001)

`--pop-growth-rate` : (*float*) MsprimeOptions: Population per-generation growth rate. (deault = 0.0)

`--lib-bar1-len` : (*int*) LibraryOptions: Length of the forward-read barcode in bp. (default = 5)

`--lib-bar2-len` : (*int*) LibraryOptions: Length of the reverse-read barcode in bp. (default = 0)

`--lib-5-error` : (*float*) LibraryOptions: Frequency of sequencing errors at the 5' end of the reads. (default = 0.001)

`--lib-3-error` : (*float*) LibraryOptions: Frequency of sequencing errors at the 3' end of the reads. (default = 0.01)

`--lib-min-dist` : (*int*)  LibraryOptions: Minimum distance between adjacent loci from different cutsites in bp. (default = 1000)

`--lib-base-len` : (*int*) LibraryOptions: Base length for reference locus extraction in bp. (default = 1000)

`--lib-max-prop-n` : (*float*) LibraryOptions: Maximum proportion of Ns that can be present in the reference locus sequence. (default = 0.1)

`--mut-indel-prob` : (*float*) MutationModel: Probability of indels. (default = 0.01)

`--mut-ins-del-ratio` : (*float*) MutationModel: Ratio of insertions to deletions. (default = 1.0)

`--mut-subs-model` : (*str*) MutationModel: Substitution Model, equal probability among all substitutions (`equal`), transitions are 2x more likely (`ts`), or transversions are 2x more likely (`tv`). (default = `equal`)

`--pcr-pol-error` : (*float*) PCRDups: Probability of PCR errors, based on the error rate of the Phusion HF polymerase. (default = 4.4e-7)

## Examples

### Getting started

The simplest **RADinitio** simulation requres just the specification of a reference genome fasta (`--genomne`) and an output directory (`--out-dir`). The reference sequence (`reference.fa.gz` in this example) is a fasta file containing the genomic sequences of one or more chromosomes/scaffolds. The reference sequences can be at a chromosome-level, or broken into smaller scaffold sequences. The file can be compressed (gzipped) or uncompressed.

Using the default parameters runs the whole simulation pipeline and generates all possible outputs (`--simulate-all`). This will simulate a standard island model for 2 populations, sequencing 10 individuals per population (20 total). The single-digest library will be made with the enzyme *SbfI*, using the default insert size distribution (mean 350 bp and stdev 35bp) and generating 2x150bp reads at 20x coverage.

```sh
radinitio --simulate-all \                 # Execute the complete pipeline (--simulate-all default)
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simple-simulation/         # Path to output directory
```

### Controlling the input reference sequences

The `radinitio` command line options allow the user to control which sequences in the reference genome are used for the simulations, without the need to modify the input fasta file. The first example is using a chromosome list file (specified with `--chromosomes`). This file contains a list with the sequence IDs that the user wants to use for the simulation, one per line. The IDs in this file must be as they appear on the `reference.fa` (**NOTE**: without the "`>`", as this is part of the fasta header!). The file follows the following structure:

```sh
chr1
chr2
chr3
...
```

Only the sequences on the list will be used as input for the simulations. This is important when working with highly fragmented assemblies or those with small unplaced scaffolds, allowing the user to simulate only over the chromosome-scale sequences, for example. If this option is not provided, **RADinitio** will simulate over all the sequences in the input fasta.

Here we run the same default **RADinitio** simulation as before, but instead we provide a `chrom_list.txt` file containing the IDs of the target sequences of interest.

```sh
radinitio --simulate-all \                 # Execute the complete pipeline (--simulate-all default)
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simple-simulation/ \       # Path to output directory
    --chromosomes ./genome/chrom_list.txt  # Path to the list of the target chromosomes
```

In addition to the use of `--chromosomes`, the input sequences can be filtered according to their length using the `--min-seq-len` option. The simulations will be performed only for sequences larger than the specified value. Here, we run the default **RADinitio** simulation, keeping only sequences larger than 1 Mbp.

```sh
radinitio --simulate-all \                 # Execute the complete pipeline (--simulate-all default)
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simple-simulation/ \       # Path to output directory
    --min-seq-len 1000000                  # Only keep reference sequences larger than 1,000,000 bp
```

The `radinitio.log` file reports the number of sequences read from the reference genome fasta file, as well as those kept after selection/filtering. For example, here the reference genome contained 1,844 total sequences; however, we only selected 24 (e.g., 24 chromosomes) for downstream analysis.

```sh
Loading the genome...
    Read 1,844 sequences from the input FASTA.
    Loaded 24 chromsome/scaffold sequences.
```

### Modifying the library parameters

**RADinitio** allows the user to specify the paramters of their simulated RADseq library. By default, as done in the previous examples, the software simulates a single-digest RADseq (sdRAD) library, following the protocols published by [Baird et al. (2008)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0003376) and [Etter et al. (2011)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018561). This library uses the *SbfI* restriction enzyme to generate a library with an insert size distribution of an mean length of 350bp (standard deviation 37bp), coverage of 20x, read length of 150bp (2x150bp paired-end), and no PCR amplification. The default simulation is equivalent to:

```sh
radinitio --simulate-all \                 # Execute the complete pipeline
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simulations_sdRAD/ \       # Path to output directory
    --library-type sdRAD \                 # Simulate a single-digest library
    --enz SbfI \                           # Use the SbfI enzyme
    --insert-mean 350 \                    # Insert size mean of 350bp
    --insert-stdev 37 \                    # Insert size stdev of 37bp
    --coverage 20 \                        # Depth of coverage of 20x
    --read-length 150                      # Read length of 150bp (paired-end 2x150bp)
```

Instead, we can generate a library under completely different parameters. Here, we generate a double-digest RADseq (ddRAD) library, based on the [Peterson et al. (2012)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037135) protocol. This library uses the *PstI* restriction enzyme as the main (or rare) cutter and *MspI* as the secondary (or common) cutter. Instead of the default insert size distribution (which uses insert size mean and standard deviation), we are specifying a fixed insert size range of 225-475 bp. We are also sequencing this library using 2x100bp paired-end reads (instead of the default 150bp) at a higher coverage of 30x. Additionally, we are retaining only sequences larger than 10 Kbp for the simulations.

```sh
radinitio --simulate-all \                 # Execute the complete pipeline
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simulations_ddRAD/ \       # Path to output directory
    --min-seq-len 10000                    # Only keep reference sequences larger than 10,000 bp
    --library-type ddRAD \                 # Simulate a double-digest library
    --enz PstI \                           # Use the PstI enzyme as the main (rare) cutter
    --enz2 MspI \                          # Use the MspI enzyme as the secondary (common) cutter
    --min-insert 225 \                     # Min insert size of 225 bp
    --max-insert 475 \                     # Max insert size of 475 bp
    --coverage 30 \                        # Depth of coverage of 30x
    --read-length 100                      # Read length of 100bp (paired-end 2x100bp)
```

More advanced parameters for the library preparation options are available in the command line options (e.g., `--lib-bar1-len` to specify length of barcode sequences removed from the reads, or `--lib-5-error`/`--lib-3-error` to control sequencing errors). However, these optionns can be left default for the large majority of simulations. Advanced users can additionally control these options in the `radinitio.LibraryOptions()` class in the Python API.

### PCR duplicates model

**RADinitio** can be used to simulate an study PCR duplicates, a common artifact in empirical RADseq libraries, which have wide-spread implications for allelic dropout and genotyping accuracy. For further details on the effects of PCR duplicates in RADseq libraries, see [Rochette et al. (2023)](https://doi.org/10.1111/1755-0998.13800), [Rivera-Colón et al. (2021)](https://doi.org/10.1111/1755-0998.13163), and [Rivera-Colón & Catchen (2022)](https://doi.org/10.1007/978-1-0716-2313-8_7#DOI).

The **RADinitio** pipeline can simulate a PCR model, which generates a distribution of PCR clone sizes and errors, which are then used when sampling sequencing reads. This model introduces both PCR duplicate reads and mutations in the sequencing reads, mimicking the error patterns seen in real-life RADseq libraries. The generation of this PCR model is off by default; however, the user can define a the parameters of an amplification model from the `radinitio` wrapper command line options, or with the `radinitio.PCRDups()` class in the Python API.

When no `--pcr-model` is selection, the `radinitio.log` reports the following information:

```sh
PCR model options:
    PCR amplication model:   None (No PCR)
    PCR cycles:              None
    PCR duplicate rate:      0.00%
```

We see that no PCR model was selected (defaults to `None`). Thus. there were no PCR cycles applied, and the PCR duplicate rate is 0% (i.e., each individual sequencing reads was sampled from an unique template).

In contrast, the user can choose and apply a PCR model to the downstream data. Since version `1.2.0`, **RADinitio** uses the PCR models described by the **decoratio** software ([Rochette et al. 2023](https://doi.org/10.1111/1755-0998.13800)). These PCR models are largely comprised of two steps: 1) an amplification model (which describes the amplification probability and noise applied over some number of PCR cycles) and 2) a depth/complexity ratio which describes how the sequenced reads are sampled from the pool of amplified molecules.

The different PCR models can be specified in `radinitio` using the `--pcr-model` flag. Two main types of models are available, each with two main variants (4 models total). First, the inherited efficiency models described in [Best. et al. (2015)](https://doi.org/10.1038/srep14629). When choosing `--pcr-model inheff` the user controls the amplification probability and noise by controlling the mean (`--pcr-inheff-mean`) and standard deviation (`--pcr-inheff-sd`) of a normal distribution defining the amplification probability of the virtual molecules across a number of PCR amplification cycles (`--pcr-cycles`). A related model, `inheffbeta`, uses a beta distrubition for parametrization purposes instead of the normal distribution, and produces similar results. By default, `radinitio` uses 0.7 and 0.1 for the mean and standard deviation of these distributions, respectively.

The second class of amplification models defined the amplification probability based on a log-normal distribution. Both models (`lognormal` and `logskewnormal`) use the standard deviation of the distribution (specified with `--pcr-lognormal-sd`) as the main parameter controlling amplification probability and noise. The `logskewnormal` model used the skew of the distribution (`--pcr-lognormal-skew`) as an additional parameter. For more information of these models and their specification please check the **decoratio** manuscript ([Rochette et al. 2023](https://doi.org/10.1111/1755-0998.13800)) and documentation (<https://bitbucket.org/rochette/decoratio/src/master/>).

In addition to the amplification model, the other (and more important parameter) controlling the PCR duplicate distribution is the **de**pth/**co**mplexity **ratio**. This value describes the ratio of sequenced molecules (the depth of coverage) to the molecular complexity (the number of unique molecules). In other words, if sequencing at 30x, but the complexity is 10 molecules, the library has a depth/complexity ratio of 3:1. This means that, while it is possible to sequence 30 reads, only 10 of them can have unique information and the remaining 20 reads will be duplicates.`radinitio` users can specify this ratio using the `--pcr-deco-ratio` option. Smaller values (<1) will generally reduce the rate of PCR duplicates seen in the simulated library. See [Rochette et al. (2023)](https://doi.org/10.1111/1755-0998.13800) for further information regarding the depth/complexity ratio.

#### Simulating libraries with low/moderate PCR duplicate rate

Using the models and paramters described above, `radinitio` users can simulate a RADseq library containing a low-to-moderate number of PCR duplicates. We are using the default library contruction paramters (single-digest with *SbfI*, 20x coverage, 2x150bp reads, 350bp insert mean and 37bp insert stdev). Additionally, we are applying an inhertited efficiency amplification model (`--pcr-model inheff`). The mean amplification probability is 45% (`--pcr-inheff-mean 0.45`) with a standard deviation of 20% (`--pcr-inheff-sd 0.2`) for 12 cycles of PCR (`--pcr-cycles 12`). This means that a DNA molecule would have a 45% change of being amplified in each of the 12 PCR cycles. This distribution produces both lower and noisier amplification probabilities than the default inherited efficienty model (mean of 0.7 and stdev of 0.1). Finally, we are also setting the depth/complexity ratio to 1:10 (`--pcr-deco-ratio 0.10`).

```sh
radinitio --simulate-all \                 # Execute the complete pipeline
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simulations_sdRAD/ \       # Path to output directory
    --library-type sdRAD \                 # Simulate a single-digest library
    --enz SbfI \                           # Use the SbfI enzyme
    --insert-mean 350 \                    # Insert size mean of 350bp
    --insert-stdev 37 \                    # Insert size stdev of 37bp
    --coverage 20 \                        # Depth of coverage of 20x
    --read-length 150 \                    # Read length of 150bp (paired-end 2x150bp)
    --pcr-model inheff \                   # Inherited efficiency amplification model
    --pcr-cycles 12 \                      # Amplify for 12 PCR cycles
    --pcr-inheff-mean 0.45 \               # Mean amplification of 45%
    --pcr-inheff-sd 0.2 \                  # Stdev amplification of 20%
    --pcr-deco-ratio 0.10                  # De/Co ratio of 1:10 or 0.1
```

After running this simulation, we can see the description of the PCR model on the `radinitio.log` file:

```sh
PCR model options:
    PCR amplication model:   inheff:12:0.45:0.2
    PCR cycles:              12
    depth/complexity ratio:  0.1
    PCR duplicates rate:     17.432%
```

This PCR model generated a PCR duplicate rate of 17.4%. We can see a per-clone size breakdown of the frequency of these duplicates (i.e., the proportion of clones with a given number of duplicates) in the `sequenced_clone_distrib.tsv` file. Additionally, the log reports the parameters of the model, including the number of cycles, depth/complexity ratio, and the "model string". This "model string" (`inheff:12:0.45:0.2` in this example) is the definition of the model, as specified by **decoratio**. In this example, the string describes (in order, separated by ":"): 1) the selected model (`inheff`), number of PCR cycles (`12`), mean amplification probability (`0.45`), and the amplification probability standard deviation (`0.2`). Different models used by **RADinitio** have different specifications and model string. See the **decoratio** documentation for additional details on model specifications.

#### Simulating libraries with high PCR duplicate rate

**RADinitio** users can also change the PCR amplification models to simulated RADseq libraries with elevated duplicate rates. While this is something researchers want to avoid in their empirical libraries, experimenting with high PCR duplicates in simulations may provide useful to, e.g., observe how the added noise affects the recovery and genotyping of loci and the reconstruction of biological information.

In this example, we repeat the general library construction process from above (single-digest with *SbfI*, 20x coverage, 2x150bp reads, 350bp insert mean); however, we modify the PCR paramters to 1) increase the noise of the inherited efficiency amplification model , and 2) drastically increase the depth/complexity ratio (from 0.1 to 10).

```sh
radinitio --simulate-all \                 # Execute the complete pipeline
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./simulations_sdRAD/ \       # Path to output directory
    --library-type sdRAD \                 # Simulate a single-digest library
    --enz SbfI \                           # Use the SbfI enzyme
    --insert-mean 350 \                    # Insert size mean of 350bp
    --insert-stdev 37 \                    # Insert size stdev of 37bp
    --coverage 20 \                        # Depth of coverage of 20x
    --read-length 150 \                    # Read length of 150bp (paired-end 2x150bp)
    --pcr-model inheff \                   # Inherited efficiency amplification model
    --pcr-cycles 12 \                      # Amplify for 12 PCR cycles
    --pcr-inheff-mean 0.25 \               # Mean amplification of 25%
    --pcr-inheff-sd 0.5 \                  # Stdev amplification of 50%
    --pcr-deco-ratio 10                    # De/Co ratio of 10:1 or 10
```

After completing the simulation, we evaluate the results provided in the `radinitio.log` file.

```sh
PCR model options:
    PCR amplication model:   inheff:12:0.25:0.5
    PCR cycles:              12
    depth/complexity ratio:  10
    PCR duplicates rate:     94.019%
```

We obtain 94% PCR duplicates! Notice how we didn't change the number of PCR cycles, after all their individual contribution to PCR duplicate is negligible (see [Rochette et al. (2023)](https://doi.org/10.1111/1755-0998.13800)). This increase in PCR duplicates is mainly the product of the depth/complexity ratio, as we are sequencing 10x more reads than there are available molecules in the library. This is further exacerbated by the increased noisiness in the PCR amplificaton. While this is an extreme case, it provides an example of the behavior of the PCR models in **RADinitio** and provides useful guidelines for the monitoring of empirical RADseq experiments when elevated PCR duplicate rates are observed.

WARNING: Very noisy PCR amplification models can take a long time to run. Similarly, models with *very* extreme values can also fail due to several reasons. In one example, if amplified clones reach very large sizes, they can trigger infinity-related errors when extracting the amplified clone size distribution.

### Changing the demographic model

The `radinitio` wrapper script simulates a population using an island model from **msprime** ([Kelleher et al. 2016](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842/); [Baumdicker et al. 2022](https://doi.org/10.1093/genetics/iyab229)). By default, this model simulates two populations, each with an effective population size (Ne) of 5,000 individuals. Ten individuals are sampled and sequenced per population, for a total of 20 individuals in the library.

The wrapper script allow the user to modify some parameters of this model. For example, the user can change the number of populations (`--n-pops`), the effective size of the populations (`--pop-eff-size`), and the number of sampled individuals per population (`--n-seq-indv`). Other parameters, e.g., controlling the migration rate (`--pop-mig-rate`), growth rate (`--pop-growth-rate`), and mutation rate (`--genome-mut-rate`), are available in the advanced options.

As an example, we can run a new simulation (running the whole pipeline with `--simulate-all`) simulating a new island model with 4 populations, each with an effective population size (Ne) of 2,500 individuals. We sampled 25 indididuals per population, for a total of 100 sequenced individuals. We will also increase the total per-population, per-generation immigration rate between these populations to 2.5%. The library is generated with default parameters (e.g., single-digest with *SbfI*, coverage of 20x, read length of 150bp, insert distribution of 350bp, stdev 35bp, no PCR).

```sh
radinitio --simulate-all \                 # Execute the complete pipeline
    --genome ./genome/reference.fa.gz \    # Path to reference genome
    --out-dir ./sims_sdRAD_4pops/ \        # Path to output directory
    --n-pops 4 \                           # Number of populations in the island model
    --pop-eff-size 2500 \                  # Effecitve size (Ne) of the populations
    --n-seq-indv 25 \                      # Number of individuals to sequence per-population (100 total)
    --pop-mig-rate 0.025                   # Per generation immigration rate
    --library-type sdRAD \                 # Simulate a single-digest library
    --enz SbfI                             # Use the SbfI enzyme
```

The options in the wrapper are useful for more general simulations, e.g., studying the effects of genetic diversity over the final RADseq library. However, they are limited to a relatively simple island model and the parameters are applied equaly to all populations (i.e., all populations will be of the same size and migration rates will be symmetrical). More detailed demographic models are available through the [**msprime** documentation](https://tskit.dev/msprime/docs/latest/intro.html). **RADinitio** users can implement these complex simulations using the **msprime** Python API, after which they can generate *in silico* RADseq libraries from the simulated populations (see section on `--make-library-seq` below).

### Tallying RAD loci in the genome

Perhaps the common application of **RADinitio** is simulating an *in silico* digestion to estimate the number of RADseq loci in an experiment. This estimation is commonly used to estimate sequencing coverage and individuals in a library during experimental design (see section 2.3 in [Rivera-Colón & Catchen (2022)](https://doi.org/10.1007/978-1-0716-2313-8_7) for an example).

An estimation of the number of RADseq loci in a genome can be easily generated using the `--tally-rad-loci` option in the `radinitio` wrapper script. This option only identifies and filters the RADseq loci found in the genome based on the library properties, skipping the simulation of the population(s) and the generation of sequencing reads.

For example, here we tally the number of RADseq loci in the genome using the default library parameters (single-digest library with *SbfI*).

```sh
radinitio --tally-rad-loci \              # Execute just the tallying of loci
    --genome ./genome/reference.fa.gz \   # Path to reference genome
    --out-dir ./tally_sbfi_loci/ \        # Path to output directory
    --library-type sdRAD \                # Simulate a single-digest library
    --enz SbfI                            # Use the SbfI enzyme
```

The `radinitio.log` file will report the distribution of RAD loci found in the genome. In this case, based on the single-digest *SbfI* simulation we find the following information in the `radinitio.log` (for versions `1.2.0` or higher):

```sh
Extracting RAD loci...
    Found a total of 6,120 SbfI loci in the genome (use this number for coverage calculations).
    Removed 3 loci (0.0%) too close to the ends of the reference sequences.
    Removed 2 loci (0.0%) containing excess (>10.0%) of Ns in sequence.
    Removed 428 loci (7.0%) in close proximity (<1000bp) to another locus.
    Retained 5,687 loci (92.9%) for downstream analysis.

    See `./output/tally-rad-loci/ri_sdRAD_sbfI/ref_loci_vars/reference_rad_loci.stats.gz` for a detailed description of the per-cutsite position and status.
```

In this example, **RADinitio** foind 6,120 *SbfI* loci in the genome. Since this is a single-digest library, this means 3,060 instances of the *SbfI* recognition sequence (`CC|TGCAGG`), each yielding two loci in the negative and positive DNA strands. The two loci originating from a single restriction enzyme cutsite are treated as separate objects and are independently filtered. As stated by the log, these 6 thousand loci can be used to estimate sequencing coverage and number of pooled individuals in the real library.

However, not all loci are retained for downstream analysis. **RADinitio** filters certain loci based on the properties of the sequence. For example, here 3 loci are removed from the analysis as they are too close to the end of the sequence. By default, this distance is equal to the base size of each reference locus. By default, this distance is 1Kbp, and can be controlled with the `--lib-base-len` advanced option. These loci are removed since it would not be possible to extract a complete sequence from the reference. Also, loci can be removed if their reference sequence contains an excess of uncalled bases (or `N`s), by default 10% (controlled with the `--lib-max-prop-n` advanced option). These loci are removed to prevent extracting loci spanning large gaps or regions of uncertain sequences.

More commonly, loci are removed due to their proximity to one another, as happened for 428 (or 7% of loci here). A locus is removed when it is under 1,000 bp (by default, controlled by `--lib-min-dist`) away from a another locus originating from a different restriction enzyme cutsite. This filtering process serves two purposes: first, it mimics the real-life process of shearing efficiency during library preparation, where shearing efficiency decreases for smaller DNA molecules, reducing the chance of recovering a molecule of the right size. Secondly, it removes loci originating from tandemly duplicated and/or repetitive cutsites. Many of these loci would be overlapping from one another on a real RADseq library, and their enzymatic digestion and shearing would be impacted by their close proximity.

Lastly, the `reference_rad_loci.stats.gz` file (full path provided in the `radinitio.log`) provides a locus-by-locus description of the status of each loci. The file reports the ID for the sequence in which the locus is located (`#chrom_id`), the basepair coordinate in which the main cutsite was found (`cut_pos`), the direction of the tag (`tag_dir`) in the DNA strand, and the `status` of each locus. An individual locus can be kept for downstream analysis (`kept`) or removed for one of the reasons described above, e.g., too close the end of the sequence (`rm_chrom_end`) or too close to an ajacent locus (`rm_too_close`).

```sh
#chrom_id  cut_pos  tag_dir  status
chr1       11811    neg      kept
chr1       11811    pos      kept
chr2       329      neg      rm_chrom_end
chr2       329      pos      kept
chr3       24972    neg      kept
chr3       24972    pos      kept
chr4       3940     neg      kept
chr4       3940     pos      kept
chr4       15514    neg      kept
chr4       15514    pos      kept
chr5       5137     neg      kept
chr5       5137     pos      kept
chr5       22829    neg      kept
chr5       22829    pos      rm_too_close
chr5       22954    neg      rm_too_close
chr5       22954    pos      kept
chr6       14834    neg      rm_excess_Ns
chr6       14834    pos      rm_excess_Ns
chr6       27346    neg      kept
chr6       27346    pos      kept
...
```

We can alter the properties of the simulated library to look at how the number of estimated loci changes. In this example, we will simulate a new single-digest library changing the restriction enzyme from the default *SbfI* (whoch was an 8-bp recognition sequence) to *EcoRI* (which has the 6-bp recognition sequence `G|AATTC`).

```sh
radinitio --tally-rad-loci \              # Execute just the tallying of loci
    --genome ./genome/reference.fa.gz \   # Path to reference genome
    --out-dir ./tally_ecoRI_loci/ \       # Path to output directory
    --library-type sdRAD \                # Simulate a single-digest library
    --enz EcoRI                           # Use the EcoRI enzyme
```

From this simulation we obtain:

```sh
Extracting RAD loci...
    Found a total of 25,836 EcoRI loci in the genome (use this number for coverage calculations).
    Removed 14 loci (0.0%) too close to the ends of the reference sequences.
    Removed 26 loci (0.1%) containing excess (>10.0%) of Ns in sequence.
    Removed 4,648 loci (18.0%) in close proximity (<1000bp) to another locus.
    Retained 21,148 loci (81.8%) for downstream analysis.
```

As expected given its smaller restriction enzyme recognition sequence, we find far more *EcoRI* loci in the genome--25.8 thousand compared to the 6 thousand found with *SbfI*. When preparing an empirical library the user would have to take into account how this order of magnitude difference in the number of loci would impact sequencing coverage and the number of samples in a library. Most importantly, we see how the increase in the number of loci impact their filtering, as a higher proportion of loci are removed due to close proximity.

#### Tallying loci in ddRAD libraries

The identification of loci in single-digest loci in RADseq libraries depends only on the distribution of the restriction enzyme recognition sequence in the genome. In ddRAD libraries, the retained loci are impacted by the enzyme combination in addition to the insert size range (see Figure 4 in [Rivera-Colón et al. 2021](https://doi.org/10.1111/1755-0998.13163)). **RADinitio** allows users to simulate both the enzyme selection and insert size range when tallying loci from ddRAD libraries.

Here, we tally the number of loci in a ddRAD library generated using the *EcoRI* enzyme as the main cutter and *MseI* as the secondary (common) cutter. We are using the default insert size distribution (mean of 350bp and stdev of 37bp). This distribution corresponds to a mimumum insert size of 276bp and a max insert size of 424bp (i.e., can also be specified with the `--min-insert`/`--max-insert` options).

```sh
radinitio --tally-rad-loci \             # Execute just the tallying of loci
    --genome ./genome/reference.fa.gz \  # Path to the reference genome
    --out-dir ./count_ddrad_loci/ \      # Path to output directory
    --library-type ddRAD \               # Simulate double-digest library
    --enz EcoRI \                        # Use the EcoRI enzyme as the main (rare) cutter
    --enz2 MseI \                        # Use the MseI enzyme as the secondary (common) cutter
    --min-insert 276 \                   # Minimum insert size of 276 bp
    --min-insert 424                     # Maximum insert size of 424 bp
```

After running this simulation we find:

```sh
Extracting RAD loci...
    Found a total of 16,064 EcoRI-MseI loci (insert size range: 276-424bp) in the genome (use this number for coverage calculations).
    Note: an additional 9,550 loci were found outside the target insert size range.
    Removed 4 loci (0.0%) too close to the ends of the reference sequences.
    Removed 11 loci (0.1%) containing excess (>10.0%) of Ns in sequence.
    Removed 2,898 loci (18.0%) in close proximity (<1000bp) to another locus.
    Retained 13,151 loci (81.9%) for downstream analysis.
```

As for the single-digest examples above, for ddRAD libraries the `radinitio.log` file reports a total number of loci found in the genome (16 thousand in this example) which can be used for coverage estimations. Similarly, the log reports the filter of loci due to proximity with other loci and/or to the edge of sequences. A key difference between the single- and double-digest simulations is that **RADinitio** reports the presence of additional ddRAD loci present outside the size range. The line starting with `Note` reports that, in addition to our 16 thousand loci, an additional 9,550 are found outside the 276-424bp size range. These loci are reported with the status of `rm_no_dd_cuts` in the `reference_rad_loci.stats.gz` file. By changing the size range, the user can determine the number of additional loci retained under different library conditions.

Here, we simulate the same library as before (ddRAD with *EcoRI* and *MseI*); however, we increase the size selection range by about 100bp (50bp in each direction), from 225bp to 475bp.

```sh
radinitio --tally-rad-loci \             # Execute just the tallying of loci
    --genome ./genome/reference.fa.gz \  # Path to the reference genome
    --out-dir ./count_ddrad_loci/ \      # Path to output directory
    --library-type ddRAD \               # Simulate double-digest library
    --enz EcoRI \                        # Use the EcoRI enzyme as the main (rare) cutter
    --enz2 MseI \                        # Use the MseI enzyme as the secondary (common) cutter
    --min-insert 225 \                   # Minimum insert size of 225 bp
    --min-insert 475                     # Maximum insert size of 475 bp
```

After simulating a ddRAD library with this increased insert size range we find:

```sh
Extracting RAD loci...
    Found a total of 20,465 EcoRI-MseI loci (insert size range: 225-475bp) in the genome (use this number for coverage calculations).
    Note: an additional 5,149 loci were found outside the target insert size range.
    Removed 4 loci (0.0%) too close to the ends of the reference sequences.
    Removed 11 loci (0.1%) containing excess (>10.0%) of Ns in sequence.
    Removed 3,667 loci (17.9%) in close proximity (<1000bp) to another locus.
    Retained 16,783 loci (82.0%) for downstream analysis.
```

The number of found loci increased from 16 thousand to over 20 thousand. The 100bp-increase in the insert size range lead to about 4 thousand more loci being included in the library. We see that over 5 thousand more loci are still found outside the size range, and thus this parameter could be further optimized to recover more loci. Note, however, that many of these loci might not be recoverage experimentally, e.g., too small to be feasibly sequenceable (i.e., if they are smaller than the desired read length) or larger than the capicity of the PCR and/or sequencer. Thus, the objective should not be to recover all the available loci in the genome.

### Simulating a population and generating genomic variants

The `--make-population` option from the `radinitio` wrapper allows the user to run the initial stage of the **RADinitio** pipeline, simulating a population using the **msprime** island model and generating a genome-wide VCF with the simulated genetic variants. This mode is library agnostic, and does not extract RADseq loci nor generates sequencing reads. The purpose of this mode is to generate a fixed set of genetic variants that can then be processed under a variety of library configurations (see `--make-library-seq` example below).

In this example, we will run **RADinitio** under the `--make-population` mode. We will simulate an island model of 6 populations or demes, each with an effective population size of 10,000. We will sequence 25 individuals per population, or 150 individuals in total. Additionally, we will filter the input genome to only keep sequences larger than 1 Mbp.

```sh
radinitio --make-population \            # Simulate the population only
    --genome ./genome/reference.fa.gz \  # Path to the reference genome
    --min-seq-len 1000000 \              # Only keep reference sequences larger than 1,000,000 bp
    --out-dir ./mpop_p6_ne10k_n25/ \     # Path to output directory
    --n-pops 6 \                         # Simulate 6 populations
    --pop-eff-size 10000 \               # Each population has an Ne of 10,000
    --n-seq-indv 25                      # Sample 25 individuals per population
```

This mode will generate three main outputs (see description of outputs below):

1. An `msprime_vcfs/` directory with the raw simulated variants for each individual sequence
2. A genome-wide VCF containing the combined sequences from all simulated chromosomes (`ref_loci_vars/ri_master.vcf.gz`)
3. A population map file (`popmap.tsv`) containing the ID of each individual and assignment to a given population.

### Simulate a library from an existing population and variants

Using the `--make-library-seq` option from the `radinitio` wrapper, the user can generate an *in silico* RADseq library using an existing population simulation. Most commonly, this existing population will be the product of an `radinitio --make-populations` run; however, `--make-library-seq` can be generated from a custom **msprime** run made by the user.

To run `radinitio` in the `--make-library-seq` pipeline stage, the user most provide a directory containing an existing population simulation (`--make-pop-sim-dir`). This directory must not be the same output directory for the run, and most contain a genome-wide master VCF as well as a population map, as described above for `--make-population`.

Here, we take our previous simulation of 6 populations and 150 individuals and simulate a single-digest RADseq library. The library will have a mean insert size of 450 bp (stdev of 50bp), will be PCR amplified using a default inherited efficiency model for 9 cycles, and sequenced to 30x of coverage. We will additionally simulate over a specific set of chromosomes in the reference sequence.

```sh
radinitio --make-library-seq \                 # Simulate and sequence a library
    --genome ./genome/reference.fa.gz \        # Path to reference genome
    --chromosomes ./genome/chrom.list \        # Path to the list of the target chromosomes
    --out-dir ./SbfI_library/ \                # Path to the output directory
    --make-pop-sim-dir ./mpop_p6_ne10k_n25/ \  # Path to the previous `make-populations` run
    --library-type sdRAD \                     # Simulate single-digest library
    --enz SbfI \                               # Use the SbfI enzyme
    --insert-mean 450 \                        # Insert size mean of 450bp
    --insert-stdev 50 \                        # Insert size std deviation of 50bp
    --pcr-model inheff \                       # Inherited efficiency amplification model
    --pcr-cycles 9 \                           # Amplify for 9 PCR cycles
    --coverage 30 \                            # Sequence to 30x coverage
    --read-length 100 \                        # Read length of 100bp (2x100bp paired-end)
    --read-out-fmt fastq                       # Export reads in FASTQ format
```

Additionally, we will simulate a double-digest library over that same existing population. The ddRAD library will be generated with the *PstI* and *MspI* restriction enzymes, and size selected for a 250-450bp insert size range. The library will be PCR amplified using an inherited efficiency model for 9 cycles, and sequenced to 30x of coverage using 2x100bp reads. We will additionally simulate over a specific set of chromosomes in the reference sequence.

```sh
radinitio --make-library-seq \                 # Simulate and sequence a library
    --genome ./genome/reference.fa.gz \        # Path to reference genome
    --chromosomes ./genome/chrom.list \        # Path to the list of the target chromosomes
    --out-dir ./PstI-MspI_library/ \           # Path to the output directory
    --make-pop-sim-dir ./mpop_p6_ne10k_n25/ \  # Path to the previous `make-populations` run
    --library-type ddRAD \                     # Simulate double-digest library
    --enz PstI \                               # Use the PstI enzyme as the main (rare) cutter
    --enz2 MspI \                              # Use the MspI enzyme as the secondary (common) cutter
    --min-insert 250 \                         # Minimum insert size of 250 bp
    --min-insert 450                           # Maximum insert size of 450 bp
    --pcr-model inheff \                       # Inherited efficiency amplification model
    --pcr-cycles 9 \                           # Amplify for 9 PCR cycles
    --coverage 30 \                            # Sequence to 30x coverage
    --read-length 100 \                        # Read length of 100bp (2x100bp paired-end)
    --read-out-fmt fastq                       # Export reads in FASTQ format
```

### Additional examples

#### How to generate chromosome list

The simplest way to generate a chromosome list file is to extact the sequence IDs form the desired fasta. The sequences must not contain the "`>`" character as this is part of the fasta header convention and are not part of the ID. This can be done by pipping a few commands in Bash:

1. Open the gzipped file into the stream (`zcat`)
2. Select the lines starting with `>` (`grep`)
3. Remove any comments present after whitespace (`cut`)
4. Remove the `>` (`tr`)
5. Save to new file

```sh
# Create a chrom_list.tsv from all the sequences in the gzipped reference
zcat reference.fa.gz | grep '^>' | cut -f1 -d ' ' | tr -d '>' > chrom_list.tsv
```

The user can apply additional filters to remove additional sequences.

An additional way of generating a chromosome list file is by parsing the genomic index (`*.fai`) generated by `samtools faidx`. See [**SAMtools** documentation](http://www.htslib.org/doc/samtools-faidx.html) for additional options. From the gzipped fasta...

```sh
# Uncompress genome
gunzip reference.fa.gz
# Run samtools faidx
samtools faidx reference.fa
# Re-compress the genome
gzip reference.fa
```

The generated index follows the following structure (see [docs](http://www.htslib.org/doc/faidx.html)):

```sh
scaf1  37000     17         60  61
scaf2  60500     37651      60  61
chr1   37057500  99163      60  61
chr2   31613927  37774292   60  61
chr3   29895000  69915122   60  61
chr4   29166904  100308375  60  61
chr5   29222005  129961399  60  61
...
```

The first and second columns correspond to the sequence IDs (notice they don't have the "`>`") and lengths, respectively. We can generate a chromosome list by, for example, filtering the sequences by length. Here we filter the `fai` file to generate a chromosome list containing only sequences larger than 1Mbp using the program `awk`:

```sh
cat reference.fa.fai | awk '$2 > 1000000 {print $1}' > chrom_list.1mbp.tsv
```

## Output directories

The outputs of a default `radinitio --simulate-all` run are:

```sh
output_directory:
popmap.tsv  radinitio.log  sequenced_clone_distrib.tsv

output_directory/msprime_vcfs:
chr1.vcf.gz   chr7.vcf.gz  chr13.vcf.gz  chr19.vcf.gz
chr2.vcf.gz   chr8.vcf.gz  chr14.vcf.gz  chr20.vcf.gz
chr3.vcf.gz   chr9.vcf.gz  chr15.vcf.gz  chr21.vcf.gz
chr4.vcf.gz  chr10.vcf.gz  chr16.vcf.gz  chr22.vcf.gz
chr5.vcf.gz  chr11.vcf.gz  chr17.vcf.gz  chr23.vcf.gz
chr6.vcf.gz  chr12.vcf.gz  chr18.vcf.gz  chr24.vcf.gz

output_directory/rad_alleles:
msp_00.alleles.fa.gz    msp_10.alleles.fa.gz
msp_01.alleles.fa.gz    msp_11.alleles.fa.gz
msp_02.alleles.fa.gz    msp_12.alleles.fa.gz
msp_03.alleles.fa.gz    msp_13.alleles.fa.gz
msp_04.alleles.fa.gz    msp_14.alleles.fa.gz
msp_05.alleles.fa.gz    msp_15.alleles.fa.gz
msp_06.alleles.fa.gz    msp_16.alleles.fa.gz
msp_07.alleles.fa.gz    msp_17.alleles.fa.gz
msp_08.alleles.fa.gz    msp_18.alleles.fa.gz
msp_09.alleles.fa.gz    msp_19.alleles.fa.gz
dropped_alleles.tsv.gz

output_directory/rad_reads:
msp_00.1.fa.gz  msp_07.1.fa.gz  msp_14.1.fa.gz
msp_00.2.fa.gz  msp_07.2.fa.gz  msp_14.2.fa.gz
msp_01.1.fa.gz  msp_08.1.fa.gz  msp_15.1.fa.gz
msp_01.2.fa.gz  msp_08.2.fa.gz  msp_15.2.fa.gz
msp_02.1.fa.gz  msp_09.1.fa.gz  msp_16.1.fa.gz
msp_02.2.fa.gz  msp_09.2.fa.gz  msp_16.2.fa.gz
msp_03.1.fa.gz  msp_10.1.fa.gz  msp_17.1.fa.gz
msp_03.2.fa.gz  msp_10.2.fa.gz  msp_17.2.fa.gz
msp_04.1.fa.gz  msp_11.1.fa.gz  msp_18.1.fa.gz
msp_04.2.fa.gz  msp_11.2.fa.gz  msp_18.2.fa.gz
msp_05.1.fa.gz  msp_12.1.fa.gz  msp_19.1.fa.gz
msp_05.2.fa.gz  msp_12.2.fa.gz  msp_19.2.fa.gz
msp_06.1.fa.gz  msp_13.1.fa.gz
msp_06.2.fa.gz  msp_13.2.fa.gz

output_directory/ref_loci_vars:
reference_rad_loci_SbfI.fa.gz     ri_master.vcf.gz
reference_rad_loci_SbfI.stats.gz
```

### Top level directory

#### Population Map

A population map file (`popmap.tsv`) containing the ID of each individual and assignment to a given population. The popmap file is a tab-delimited file, with sample IDs on the first column, and population IDs on the second column. When running the `radinitio` wrapper, the sample IDs are labelled according to the **msprime** simulation have the `msp_` prefix (or `tsk_`, in **msprime** 1.0.0+ versions).

```sh
msp_00<tab>pop0
msp_01     pop0
msp_02     pop0
msp_03     pop1
msp_04     pop1
msp_05     pop1
...
```

#### PCR clones distribution

The `sequenced_clone_distrib.tsv` contains information regarding the sequenced clone distribution generated during the most recent run. This distribution is the product of **decoratio**'s PCR model and is representative of the PCR duplicates observed in the simulated reads, if any. By default, no PCR cycles are enabled, thus the model returns:

```sh
clone_size  n_errors  clone_prob
0           0         0
1           0         1.0
1           1         0
```

In this distribution, 100% of reads originate from a sequenced clone of size 1 (a clone containing a single molecule) without any errors. Since there was no PCR amplification, the original template molecule is sampled and "sequenced". However, if PCR cycles are defined by the user using the PCR models more complex clones can be sampled, resulting in PCR duplicates.

```sh
clone_size  n_errors  clone_prob
0           0         0
1           0         0.722149
1           1         8.58233e-05
2           0         0.211999
2           1         3.77304e-05
2           2         1.25639e-05
3           0         0.0521028
3           1         1.10092e-05
3           2         5.04165e-06
3           3         1.72311e-06
...
```

Here, the `clone_size` describes the number of molecules in a clone. A clone size of *n* means that *n* molecules where sampled from a single clone. *n* copies of the same sequence will be saved as reads, with *n - 1* being PCR duplicates. Moreover, clones can also contain molecules with PCR error (`n_errors`). A clone of size *n* can contain up to *n* molecules with error. `clone_prob` is the probability of sampling a clone with given size and error properties, and describes the distribution of PCR duplicates and errors in the simulated data.

#### Log file

The `radinitio.log` file is generated on the top level outoput directory. It contains information on the **RADinitio** version, a time stamp on the current run, as well as information on the different stages of the program.

### Msprime VCFs

The `msprime_vcfs/` directory containing the genetic variants simulated by **msprime** (derived from the [`tskit.TreeSequence.write_vcf()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.write_vcf) function) for each individual chromosome.

### Reference loci/vars directory

The `ref_loci_vars` contains the reference information for the simulated RAD loci as well as the genome-wide variants generated from the population simulation.

#### Genomic VCF

A genome-wide VCF containing the combined sequences from all simulated chromosomes (`ri_master.vcf.gz`). Here, the variants simulated by **msprime**, which may encode unspeficied alleles, are projected against the sequence of the reference genome to determine the identify of the reference alleles. To determine alterative alleles, **RADinitio** implements a `MutationModel()`, which can mutate the reference alleles based on a substitution model (specified with `--mut-subs-model`) and introduce indels at a specified rate (`--mut-indel-prob`).

#### Reference RAD loci

The reference loci file reports the sequence of all the reference loci present in the genome, i.e., the genomic sequence corresponding to the 1 Kbp spanning the extracted loci. These sequenced are exported as a gzipped fasta file (`reference_rad_loci.fa.gz`).

```sh
>t0000n ref_pos=chr1:10818-11817
TGCAGGACTCTGTACAGCAGTTCACACACTTATACCTCCTGAGACGTGTTGTACATTTGTC...
>t0001p ref_pos=chr1:11814-12813
TGCAGGATCAGACCATTGTGTTGTTATTGAGAGCTAGCGTTAGCTCCACCTCAAACCGGTA...
>t0002p ref_pos=chr2:332-1331
TGCAGGGAGGGTGGGAGAACTTTCTCCGTTAACAAAAAATGAGGCTCCGTCTCTGTGATTC...
>t0003n ref_pos=chr3:23979-24978
TGCAGGTCAGTGAACTCACCCCCACTGCAGGCGTCCTCTGGACTGAACCAACAGAAGCTGG...
>t0004n ref_pos=chr3:24974-25973
TGCAGGCGATCCAGTCCACAGACCGGCTACCTGGATCATGCCGACGTCCGACAGTGTCTCC...
```

The sequence IDs of each locus are composed of `t` for "true locus", number representing the ordering of the cutsite in the genome (e.g., `0000` for the first cutsute on position 11,811 of chr1), and either `n` or `p` to denote loci in the negative and positive strand, respectively. The fasta header also provides the coordinates of the locus sequence in the reference. Note that locus from the negative strand have been reverser complimented, with the restriction enzyme overhang present at the 5' of the sequence.

#### Locus Status

The `reference_rad_loci.stats.gz` file provides a locus-by-locus description of the status of each loci. The file reports the ID for the sequence in which the locus is located (`#chrom_id`), the basepair coordinate in which the main cutsite was found (`cut_pos`), the direction of the tag (`tag_dir`) in the DNA strand, and the `status` of each locus. An individual locus can be kept for downstream analysis (`kept`) or removed for one of the reasons described above, e.g., too close the end of the sequence (`rm_chrom_end`) or too close to an ajacent locus (`rm_too_close`). When simulating ddRAD libraries, the file will also report loci with additional secondary cutsites outside the insert size range (`rm_no_dd_cuts`).

```sh
#chrom_id  cut_pos  tag_dir  status
chr1       11811    neg      kept
chr1       11811    pos      kept
chr2       329      neg      rm_chrom_end
chr2       329      pos      kept
chr3       24972    neg      kept
chr3       24972    pos      kept
chr4       3940     neg      kept
chr4       3940     pos      kept
chr4       15514    neg      kept
chr4       15514    pos      kept
chr5       5137     neg      kept
chr5       5137     pos      kept
chr5       22829    neg      kept
chr5       22829    pos      rm_too_close
chr5       22954    neg      rm_too_close
chr5       22954    pos      kept
chr6       14834    neg      rm_excess_Ns
chr6       14834    pos      rm_excess_Ns
chr6       27346    neg      kept
chr6       27346    pos      kept
...
```

### RAD alleles

The `rad_alleles/` directoru contains per-individual gzipped fasta files contining the RAD alleles for each simulated sample. For each allele, the fasta headers contains the following information:

```sh
>reference_locus_id:sample_id:allele_i
```

Here, `reference_locus_id` referes to the original reference locus the allele was generated. `sample_id` is the **msprime** sample id, and `allele_i` refers to the allele number (`a1` or `a2` in diploid individuals).

```sh
>t0n:msp_0:a1
TGCAGGCACGCAAGCGGCTCTAGGGATTCCTTGAGAGGAAAAATGCAACGCTGGTTTAACG...
>t0n:msp_0:a2
TGCAGGCACGCAAGCGGCTCTAGGGATTCCTTGAGAGGAAAAATGCAACGCTGGTTTAACG...
>t1p:msp_0:a1
TGCAGGGGCCCCCGCACCACACGTTGGGAACCCCTGATTGACCAGAAACATATTAAAGTTG...
>t1p:msp_0:a2
TGCAGGGGCCCCCGCACCACACGTTGGGAACCCCTGATTGACCACAAACATATTAAAGTTG...
```

### RAD reads

The `rad_reads/` contains per-individual gzipped paired-end fasta (or fastq) files contining the RAD reads for each simulated sample. For each individual read, the fasta headers contains the following information:

```sh
reference_locus_id:sample_id:allele_i:clone_id:duplicate_i/read_pair
```

The basename of the read, which contains `reference_locus_id:sample_id:allele_i`, comes from the source locus of the reads. `clone_id` referes to the source template id, while `duplicate_i` is the duplicate number for each read in the clone. `read_pair` is the standard designation for paired-end reads, `/1` for forward read and `/2` for the corresponding paired read.

Notice in the fasta example below that for reads 3 and 4, they both have a `clone_id` of 3. Read 3 has a `clone_i` of 1, meaning is the first read in the clone. Read 4 has a `clone_id` of 2, meaning it is the second read in the clone and thus a duplicate product of PCR.

```sh
>t5n:msp_0:a2:1:1/1
TGCAGGCTTAGGCTACACCCAACGAGCCACTGGGTGCAGTTGGACCCGAGGGATCTGTAT...
>t4n:msp_0:a1:2:1/1
TGCAGGTCTCCTGCTACCACCATTCCACCTGTGCAGCTCATCCAGAATGCAGCAGCTCCA...
>t2n:msp_0:a2:3:1/1
TGCAGGAAGAGACGAACCCATCGCCAGTACCACCCCCTGTACATCTGACATCACTCTTAC...
>t2n:msp_0:a2:3:2/1
TGCAGGAAGAGACGAACCCATCGCCAGTACCACCCCCTGTACATCTGACATCACTCTTAC...
>t5p:msp_0:a1:4:1/1
TGCAGGGCTTCCACTCCATACTTGTAAAATTAGGGGAAAACACTTTTTTTAACCTCCCAG...
```

The contents of `rad_reads/` contain the main output of RADinitio. These reads behave as empirical paired-end reads and are fully compatible with the majority of bioinformatic software. Reads can be exported in both fasta (default) and fastq formats, as specified by the `--read-out-fmt` option in the `radinitio` wrapper. Both output formats report the sample allele information, as well as the indentity of PCR clones.

## Questions?

If you have any questions or are encountering any problems when running **RADinitio**, please drop your questions in the [Stacks-users mailing list](https://groups.google.com/g/stacks-users).

## Changelog

### RADinitio `1.2.0` - 2023 Jun 23

* Feature: Updated to `decoratio` PCR amplification models
* Feature: Updated reporting of RAD loci tally
* Feature: Several advanced options available in `radinitio` wrapper
* Bug Fix: Program now prints warning and stops when no sequences are loaded from genome.
* Other: General re-working and refactoring of main functions.
* Notes: Long overdue update! Finally had time to work on some very needed changes.

### RADinitio `1.1.1` - 2019 Sep 12

* Bug Fix: Fixed compatibility with updated version of `msprime` and `tskit`

### RADinitio `1.1.0` - 2019 Aug 06

* Feature: Added pipeline substage functions (`make-population`, `make-library-seq`, `tally-rad-loci`)
* Feature: Added log of dropped alleles when inserting variants.
* Bug Fix: Fixed report of insert sizes in various logs
* Bug Fix: Fixed error in `NlaIII` enzyme
* Other: **msprime** model generation now added as a function in `__init__`
* Other: Changed default value for template/read ratio in PCR model
* Other: Added `--version` flag in main wrapper. Also reported in other intermediary files.
* Other: General formatting updates to logs.

### RADinitio `1.0.3` - 2019 May 16

* Fixed bug when reporting number of loci in log
* Changed the requierements of the output directory - it now MUST exist

### RADinitio `1.0.2` - 2019 May 13

* Updated contents of package distribution build.

### RADinitio `1.0.1` - 2019 May 13

* FASTA parser now handles empty lines and comments.
* Program reports when non-canonical characters are present in the genome.

### RADinitio `1.0.0` - 2019 May 08

* First public release of RADinitio.

## Authors

**Angel G. Rivera-Colón**  
Department of Evolution, Ecology, and Behavior  
University of Illinois at Urbana-Champaign  
<angelgr2@illinois.edu> / <arcolon14@gmail.com>

**Nicolas Rochette**  
Department of Evolution, Ecology, and Behavior  
University of Illinois at Urbana-Champaign  
<nic.rochette@gmail.com>

**Julian Catchen**  
Department of Evolution, Ecology, and Behavior  
University of Illinois at Urbana-Champaign  
<jcatchen@illinois.edu>
