Metadata-Version: 2.4
Name: strvcf-annotator
Version: 0.2.1
Summary: STR annotation tool for VCF files
Home-page: https://github.com/acg-team/strvcf_annotator
Author: Olesia Kondrateva
Author-email: Olesia Kondrateva <xkdnoa@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/acg-team/strvcf_annotator
Project-URL: Repository, https://github.com/acg-team/strvcf_annotator
Project-URL: Documentation, https://github.com/acg-team/strvcf_annotator/#readme
Project-URL: Bug Tracker, https://github.com/acg-team/strvcf_annotator/issues
Keywords: strvcf,vcf,str,annotation,bioinformatics,genomics
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: pysam>=0.22.0
Requires-Dist: pandas>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: ruff>=0.3.5; extra == "dev"
Requires-Dist: tox>=4.0.0; extra == "dev"
Requires-Dist: coverage>=7.0.0; extra == "dev"
Requires-Dist: Sphinx>=7.2.6; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: bump2version>=1.0.1; extra == "dev"
Requires-Dist: watchdog>=3.0.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

================
strvcf_annotator
================


STR (Short Tandem Repeat) annotation tool for VCF files.

`strvcf_annotator` is a Python library and CLI tool for annotating 
variants in VCF files that overlap short tandem repeat (STR) regions. 
The tool converts SNPs, SNVs and indels into full repeat sequences and adds STR metadata.

Features
--------

* **Dual Usage**: Works as both a library and CLI tool
* **Extensible**: Easy to add custom parsers for different VCF formats
* **Efficient**: Streaming support for large files

Installation
------------
Package is available through `PyPI <https://pypi.org/project/strvcf-annotator/>`_. To install, type:

.. code-block:: shell

    pip install strvcf-annotator
    
.. code-block:: shell

    # Install from source
    git clone https://github.com/acg-team/strvcf_annotator.git
    cd strvcf_annotator
    pip install -e .

.. code-block:: shell

    # Dev dependencies
    pip install -r requirements_dev.txt


Quick Start
-----------

Command Line
~~~~~~~~~~~~

.. code-block:: shell

    # Annotate a single VCF
    strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf

    # Batch-process a directory
    strvcf-annotator --input-dir vcf_files/ --str-bed repeats.bed --output-dir annotated/

    # With verbose logging
    strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf --verbose

Library Usage
~~~~~~~~~~~~~

.. code-block:: python

    from strvcf_annotator import STRAnnotator

    # Create the annotator
    annotator = STRAnnotator('repeats.bed')

    # Annotate a single file
    annotator.annotate_vcf_file('input.vcf', 'output.vcf')

    # Batch processing
    annotator.process_directory('vcf_files/', 'annotated/')

    # Streaming processing
    import pysam
    vcf_in = pysam.VariantFile('input.vcf')
    for record in annotator.annotate_vcf_stream(vcf_in):
        print(f"Repeat unit: {record.info['RU']}")

Input format
-------------

BED file with STR regions
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell

    CHROM   START   END     PERIOD  RU
    chr1    100     115     3       CAG
    chr1    200     212     4       ATCG
    chr2    300     318     3       GAT

- **CHROM**: Chromosome name
- **START**: Start position (0-based, BED format)
- **END**: End position (0-based, exclusive)
- **PERIOD**: Repeat unit length
- **RU**: Repeat unit sequence

VCF file
~~~~~~~~~~~~

A standard VCF with variants. Must contain:

- FORMAT field **GT** (genotype)
- Optional: **AD** (allelic depth), **DP** (total depth)

Output format
-------------

The annotated VCF contains additional fields:

INFO fields
~~~~~~~~~~~~

- **RU**: Repeat unit
- **PERIOD**: Repeat period (unit length)
- **REF**: Reference copy number
- **PERFECT**: TRUE if both alleles are perfect repeats

FORMAT fields
~~~~~~~~~~~~~

- **REPCN**: Genotype expressed as repeat copy numbers

Example
~~~~~~~~~~~~

.. code-block:: shell

    ##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit">
    ##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Repeat period">
    ##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
    ##INFO=<ID=PERFECT,Number=1,Type=String,Description="Perfect repeat indicator">
    ##FORMAT=<ID=REPCN,Number=2,Type=Integer,Description="Repeat copy number">

    #CHROM  POS  ID  REF         ALT             QUAL  FILTER  INFO                              FORMAT      Sample1
    chr1    101  .   CAGCAGCAG   CAGCAGCAGCAG    .     .       RU=CAG;PERIOD=3;REF=3;PERFECT=TRUE  GT:REPCN    0/1:3,4

Architecture
------------

.. code-block::

    src/strvcf_annotator/
    ├── __init__.py          # Public API
    ├── api.py               # Library API
    ├── cli.py               # CLI interface
    ├── core/                # Core functionality
    │   ├── annotation.py    # Annotation engine
    │   ├── vcf_processor.py # VCF processing
    │   ├── str_reference.py # STR reference management
    │   └── repeat_utils.py  # Repeat sequence utilities
    ├── parsers/             # Parser system
    │   ├── base.py          # Abstract parser interface
    │   └── generic.py       # Generic VCF parser
    └── utils/               # Utilities
        ├── vcf_utils.py     # VCF helpers
        └── validation.py    # Input validation

Extending functionality
-----------------------

Creating a custom parser
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from strvcf_annotator.parsers.base import BaseVCFParser

    class CustomParser(BaseVCFParser):
        def get_genotype(self, record, sample_idx):
            # Your logic for extracting the genotype
            pass
        
        def has_variant(self, record, sample_idx):
            # Your logic for determining if there is a variant
            pass
        
        def extract_info(self, record, sample_idx):
            # Your logic for extracting additional fields
            pass
        
        def validate_record(self, record):
            # Your logic for validating the record
            pass

    # Usage
    annotator = STRAnnotator('repeats.bed', parser=CustomParser())

Performance
-----------
- **Streaming processing**: Does not load the entire VCF into memory
- **Efficient lookup**: Uses sorted data for fast STR searches
- **Batch processing**: Supports processing multiple files

Troubleshooting
---------------

Issue: ModuleNotFoundError
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell

    # Install the package in editable (dev) mode
    pip install -e .

Issue: Unnormalized VCF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This tool **only accepts normalized VCFs**. Please normalize with `bcftools` before running.
Example (produces a normalized, indexed VCF):

.. code-block:: shell

   # Replace reference.fa with the exact reference used for the VCF
   bcftools norm -f reference.fa -m input.vcf


Issue: Unsorted VCF
~~~~~~~~~~~~~~~~~~~

The tool automatically sorts the VCF in memory, but for large files pre-sorting is recommended:

.. code-block:: shell

    bcftools sort input.vcf -o sorted.vcf


Issue: Reference mismatch
~~~~~~~~~~~~~~~~~~~~~~~~~

If you see warnings about a reference mismatch, check:

- The correctness of the STR BED file
- Matching reference genome versions

Documentation
-------------

* `API Documentation <docs/API.md>`_
* `Examples <examples/>`_

Contributing
------------

Contributions are welcome! 
For major changes, please open an issue first 
to discuss what you’d like to change.
Please ensure:

1. All tests pass
2. Code follows existing style
3. New features include tests
4. Documentation is updated

License
-------

MIT License

Credits
-------

Test bed files were taken from `ConSTRain` repository `https://github.com/acg-team/ConSTRain`. 
