Metadata-Version: 2.4
Name: kalign-python
Version: 3.4.8
Summary: Python wrapper for the Kalign multiple sequence alignment engine
Keywords: bioinformatics,sequence alignment,multiple sequence alignment,MSA,computational biology,genomics,proteomics,phylogenetic analysis,evolutionary biology,biopython,scikit-bio,DNA alignment,protein alignment,RNA alignment,fast alignment,parallel alignment,SIMD optimization
Author-Email: Timo Lassmann <timolassmann@icloud.com>
Maintainer-Email: Timo Lassmann <timolassmann@icloud.com>
License-Expression: GPL-3.0-or-later
License-File: COPYING
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: C
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Homepage, https://github.com/TimoLassmann/kalign
Project-URL: Documentation, https://github.com/TimoLassmann/kalign/blob/main/README.md
Project-URL: Repository, https://github.com/TimoLassmann/kalign
Project-URL: Bug Tracker, https://github.com/TimoLassmann/kalign/issues
Project-URL: Changelog, https://github.com/TimoLassmann/kalign/blob/main/ChangeLog
Requires-Python: >=3.9
Requires-Dist: numpy>=1.19.0
Provides-Extra: biopython
Requires-Dist: biopython>=1.85; extra == "biopython"
Provides-Extra: skbio
Requires-Dist: scikit-bio>=0.6.3; extra == "skbio"
Provides-Extra: io
Requires-Dist: biopython>=1.85; extra == "io"
Provides-Extra: analysis
Requires-Dist: pandas>=2.3.0; extra == "analysis"
Requires-Dist: matplotlib>=3.9.4; extra == "analysis"
Provides-Extra: all
Requires-Dist: biopython>=1.85; extra == "all"
Requires-Dist: scikit-bio>=0.6.3; extra == "all"
Requires-Dist: pandas>=2.3.0; extra == "all"
Requires-Dist: matplotlib>=3.9.4; extra == "all"
Provides-Extra: benchmark
Requires-Dist: dash>=2.14; extra == "benchmark"
Requires-Dist: plotly>=5.18; extra == "benchmark"
Requires-Dist: pandas>=2.0; extra == "benchmark"
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-benchmark; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: rich; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: biopython>=1.85; extra == "dev"
Requires-Dist: scikit-bio>=0.6.3; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=6.0; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: pytest-benchmark; extra == "test"
Requires-Dist: pytest-xdist; extra == "test"
Requires-Dist: rich; extra == "test"
Provides-Extra: docs
Requires-Dist: sphinx; extra == "docs"
Requires-Dist: sphinx-rtd-theme; extra == "docs"
Requires-Dist: myst-parser; extra == "docs"
Description-Content-Type: text/markdown

# Kalign Python Package

Python bindings for [Kalign](https://github.com/TimoLassmann/kalign), a fast multiple sequence alignment program for biological sequences (DNA, RNA, protein).

## Installation

```bash
pip install kalign
```

Optional dependencies for ecosystem integration:

```bash
pip install kalign[biopython]    # Biopython integration (fmt="biopython", I/O helpers)
pip install kalign[skbio]        # scikit-bio integration (fmt="skbio")
pip install kalign[io]           # I/O helpers (requires Biopython)
pip install kalign[analysis]     # pandas + matplotlib for downstream analysis
pip install kalign[all]          # all of the above
```

## Quick Start

```python
import kalign

sequences = [
    "ATCGATCGATCG",
    "ATCGTCGATCG",
    "ATCGATCATCG"
]

aligned = kalign.align(sequences, seq_type="dna")
for seq in aligned:
    print(seq)
```

## Core API

### `kalign.align()`

```python
aligned = kalign.align(
    sequences,              # list of str
    seq_type="auto",        # "auto", "dna", "rna", "protein", "divergent", "internal"
    gap_open=None,          # positive float, or None for defaults
    gap_extend=None,        # positive float, or None for defaults
    terminal_gap_extend=None,
    n_threads=None,         # int, or None for global default
    fmt="plain",            # "plain", "biopython", "skbio"
    ids=None,               # list of str (for biopython/skbio output)
)
```

Returns a list of aligned strings (default), a `Bio.Align.MultipleSeqAlignment` (`fmt="biopython"`), or a `skbio.TabularMSA` (`fmt="skbio"`).

### `kalign.align_from_file()`

Align sequences directly from a FASTA, MSF, or Clustal file:

```python
result = kalign.align_from_file("sequences.fasta", seq_type="protein")
for name, seq in zip(result.names, result.sequences):
    print(f"{name}: {seq}")
```

Returns an `AlignedSequences` named tuple with `.names` and `.sequences`.

### `kalign.compare()`

Score a test alignment against a reference using the Sum-of-Pairs (SP) score:

```python
score = kalign.compare("reference.msf", "test.fasta")
print(f"SP score: {score:.1f}")  # 0 (no match) to 100 (identical)
```

### `kalign.write_alignment()`

Write aligned sequences to a file:

```python
kalign.write_alignment(aligned, "output.fasta", format="fasta", ids=ids)
```

Supported formats: `fasta`, `clustal`, `stockholm`, `phylip` (non-FASTA formats require Biopython).

## Threading

```python
import kalign

kalign.set_num_threads(4)        # set global default
n = kalign.get_num_threads()     # query current default

# or override per call
aligned = kalign.align(sequences, n_threads=8)
```

Thread settings are thread-local, so different threads can use different defaults.

## Utilities (`kalign.utils`)

Requires only NumPy (installed automatically):

```python
import kalign

aligned = kalign.align(sequences)

arr = kalign.utils.to_array(aligned)                          # numpy array
stats = kalign.utils.alignment_stats(aligned)                 # dict with gap_fraction, conservation, identity
consensus = kalign.utils.consensus_sequence(aligned, threshold=0.7)
matrix = kalign.utils.pairwise_identity_matrix(aligned)       # numpy array
trimmed = kalign.utils.remove_gap_columns(aligned)
region = kalign.utils.trim_alignment(aligned, start=2, end=10)
```

## Biopython Integration

Requires `pip install kalign[biopython]`.

```python
import kalign

# Return a Biopython MultipleSeqAlignment
aln = kalign.align(sequences, fmt="biopython", ids=["s1", "s2", "s3"])
print(aln.get_alignment_length())

# Write in various formats via Biopython
from Bio import AlignIO
AlignIO.write(aln, "output.clustal", "clustal")
```

### I/O helpers (`kalign.io`)

```python
sequences = kalign.io.read_fasta("input.fasta")
sequences, ids = kalign.io.read_sequences("input.fasta")

aligned = kalign.align(sequences)
kalign.io.write_fasta(aligned, "output.fasta", ids=ids)
kalign.io.write_clustal(aligned, "output.aln", ids=ids)
kalign.io.write_stockholm(aligned, "output.sto", ids=ids)
kalign.io.write_phylip(aligned, "output.phy", ids=ids)
```

## scikit-bio Integration

Requires `pip install kalign[skbio]`.

```python
import kalign

# Returns a TabularMSA of DNA, RNA, or Protein depending on seq_type
aln = kalign.align(sequences, seq_type="dna", fmt="skbio")
print(type(aln))  # <class 'skbio.alignment._tabular_msa.TabularMSA'>
```

## Sequence Types

| String | Constant | Description |
|--------|----------|-------------|
| `"auto"` | `kalign.AUTO` | Auto-detect (default) |
| `"dna"` | `kalign.DNA` | DNA sequences |
| `"rna"` | `kalign.RNA` | RNA sequences |
| `"protein"` | `kalign.PROTEIN` | Protein sequences |
| `"divergent"` | `kalign.PROTEIN_DIVERGENT` | Divergent protein sequences |
| `"internal"` | `kalign.DNA_INTERNAL` | DNA with internal gap preference |

## Command-line Interface

```bash
kalign-py -i sequences.fasta -o aligned.fasta --format fasta --type protein
kalign-py -i sequences.fasta -o - --format clustal   # stdout
cat input.fa | kalign-py -i - -o aligned.fasta        # stdin
kalign-py --version
```

## Development

```bash
git clone https://github.com/TimoLassmann/kalign.git
cd kalign
uv pip install -e .
uv run pytest tests/python/ -v
```

Requirements: Python 3.9+, CMake 3.18+, C++11 compiler, NumPy.

## Citation

If you use Kalign in your research, please cite:

> Lassmann, T. (2020). Kalign 3: multiple sequence alignment of large data sets.
> *Bioinformatics*, 36(6), 1928-1929.
> [doi:10.1093/bioinformatics/btz795](https://doi.org/10.1093/bioinformatics/btz795)

## License

GNU General Public License v3.0 or later. See [COPYING](COPYING).
