Metadata-Version: 2.4
Name: symclatron
Version: 0.9.1
Summary: symclatron: symbiont classifier
Keywords: bioinformatics,machine-learning,symbiosis,microbiology,classification,genomics
Author-email: "Juan C. Villada" <jvillada@lbl.gov>
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: POSIX :: Linux
License-File: LICENSE
Requires-Dist: typer>=0.12.3
Requires-Dist: pandas>=2.2.2
Requires-Dist: numpy>=1.26.4
Requires-Dist: xgboost>=2.1.1
Requires-Dist: shap>=0.45.1
Requires-Dist: scikit-learn>=1.5.1
Requires-Dist: tensorflow-cpu>=2.18.0
Requires-Dist: joblib>=1.4.2
Requires-Dist: pyhmmer>=0.11.1
Requires-Dist: pyrodigal>=3.6.0
Requires-Dist: psutil>=5.9.0
Project-URL: Bug Tracker, https://github.com/NeLLi-team/symclatron/issues
Project-URL: Documentation, https://github.com/NeLLi-team/symclatron#readme
Project-URL: Homepage, https://github.com/NeLLi-team/symclatron
Project-URL: Repository, https://github.com/NeLLi-team/symclatron

# symclatron: symbiont classifier

**ML-based classification of microbial symbiotic lifestyles**

symclatron is a tool that classifies microbial genomes (protein FASTA, or nucleotide FASTA with automatic protein prediction) into three symbiotic lifestyle categories:

- **Free-living**
- **Symbiont; Host-associated**
- **Symbiont; Obligate-intracellular**

## Installation and quick start

Recommended install paths are Pixi (recommended) or Mamba/Conda.

### Option 1: Pixi (recommended)

Install `pixi`:

```sh
curl -fsSL https://pixi.sh/install.sh | sh
```

More information about `pixi` can be found in the [pixi documentation](https://pixi.sh/).

Install, setup, and test:

```sh
pixi global install -c conda-forge -c bioconda -c https://repo.prefix.dev/astrogenomics symclatron
symclatron setup
symclatron test  # also tests contig (.fna) inputs when available
```

### Option 2: Mamba or Conda

```sh
mamba create -n symclatron-0.9.1 -c conda-forge -c bioconda -c https://prefix.dev/astrogenomics symclatron
mamba run -n symclatron-0.9.1 symclatron setup
mamba run -n symclatron-0.9.1 symclatron test  # also tests contig (.fna) inputs when available
```

## Setup data (required)

Before using `symclatron` for the first time, you need to download the required database files. This only needs to be done once.

```bash
symclatron setup
```

By default, `symclatron setup` downloads the database bundle from:

- `https://github.com/NeLLi-team/symclatron/releases/download/v<version>/symclatron_db.tar.gz` (preferred)
- `https://github.com/NeLLi-team/symclatron/releases/latest/download/symclatron_db.tar.gz` (fallback)
- `https://portal.nersc.gov/cfs/nelli/symclatron_db.tar.gz` (fallback)

Override the URL if needed:

```bash
symclatron setup --data-url https://example.org/symclatron_db.tar.gz
# or
export SYMCLATRON_DATA_URL=https://example.org/symclatron_db.tar.gz
symclatron setup
```

## Input file requirements

- **Input**: a directory with one genome per file
- **Supported FASTA types**:
  - **Proteins (recommended)**: `.faa` (also accepts common protein FASTA suffixes, optionally gzipped)
  - **Nucleotide contigs/assemblies**: `.fa`, `.fna`, `.fasta` (proteins predicted with `pyrodigal`)
  - **Nucleotide genes/CDS**: `.ffn`, `.fnn` (translated in-frame)
- **Quality**: Complete or near-complete genomes recommended, but good performance for MQ MAGs are expected

symclatron auto-detects whether each input file contains proteins, genes, or contigs and converts nucleotide inputs to proteins before running the standard workflow.
If your nucleotide file extensions are ambiguous, you can override detection with `--input-kind contigs` or `--input-kind genes`.

### Classify your genomes

```bash
# Protein FASTA input
symclatron classify --genome-dir /path/to/genomes/ --output-dir results/

# Nucleotide contigs/assemblies input (auto protein prediction)
symclatron classify --genome-dir /path/to/contigs/ --output-dir results/

# Ambiguous nucleotide files: force contig mode and only use .fna files
symclatron classify --genome-dir /path/to/inputs/ --input-kind contigs --input-ext .fna --output-dir results/
```

### Getting help

```bash
symclatron --help

# Command-specific help
symclatron classify --help
symclatron setup --help

# Show version and information
symclatron --version
```

### Classification command

The main classification command with all options:

```bash
symclatron classify [OPTIONS]
```

**Options:**

- `--genome-dir, -i`: Directory (or FASTA file) containing genome inputs (.faa/.fa/.fna/.fasta/.ffn/.fnn) [default: input_genomes]
- `--input-kind`: Force input kind: `auto`, `proteins`, `genes`, `contigs` [default: auto]
- `--input-ext`: Only include files with these extensions (repeatable), e.g. `--input-ext .fna` (also matches `.fna.gz`)
- `--output-dir, -o`: Output directory for results [default: output_symclatron]
- `--keep-tmp`: Keep temporary files for debugging
- `--threads, -t`: Number of threads for HMMER searches [default: 2]
- `--quiet, -q`: Suppress progress messages
- `--verbose`: Show detailed progress information

**Examples:**

```bash
# Basic usage
symclatron classify --genome-dir genomes/ --output-dir results/

# With more threads and keeping temporary files
symclatron classify -i genomes/ -o results/ --threads 8 --keep-tmp

# Quiet mode
symclatron classify --genome-dir genomes/ --quiet

# Verbose mode with detailed progress
symclatron classify --genome-dir genomes/ --verbose
```

## Results

The classification results are saved in the specified output directory:

### Main output files

1. **`symclatron_results.tsv`** - Main classification results with columns:
   - `taxon_oid` - Genome identifier
   - `completeness_UNI56` - Completeness metric based on universal marker genes
   - `confidence` - Overall confidence score for the classification
   - `classification` - Final classification label:
     - `Free-living`
     - `Symbiont;Host-associated`
     - `Symbiont;Obligate-intracellular`

2. **`classification_summary.txt`** - Summary report with statistics

3. **Log files** - Detailed execution logs with timestamps

### Debug files

When using `--keep-tmp`, intermediate files are preserved in `tmp/` directory for analysis.

## Performance

symclatron is designed for efficiency:

- **>2 minutes per genome** on consumer-level laptops
- **Most recent benchmark**: 306 genomes in ~162 minutes (1.9 min/genome)
- **Memory efficient** - suitable for standard workstations

## Container usage

### Apptainer/Singularity

Pull the latest container:

```bash
apptainer pull docker://docker.io/astrogenomics/symclatron:latest
```

## Citation

If you use symclatron in your research, please cite:

A genomic catalog of Earth’s bacterial and archaeal symbionts.
Juan C. Villada, Yumary M. Vasquez, Gitta Szabo, Ewan Whittaker-Walker, Miguel F. Romero, Sarina Qin, Neha Varghese, Emiley A. Eloe-Fadrosh, Nikos C. Kyrpides, SymGs data consortium, Axel Visel, Tanja Woyke, Frederik Schulz
bioRxiv 2025.05.29.656868; doi: https://doi.org/10.1101/2025.05.29.656868

## Support

- **Repository**: [https://github.com/NeLLi-team/symclatron](https://github.com/NeLLi-team/symclatron)
- **Issues**: [https://github.com/NeLLi-team/symclatron/issues](https://github.com/NeLLi-team/symclatron/issues)
- **Author**: Juan C. Villada <jvillada@lbl.gov>

