Metadata-Version: 2.4
Name: genvarloader
Version: 0.21.1
Requires-Dist: numpy
Requires-Dist: numba>=0.59.1
Requires-Dist: loguru
Requires-Dist: attrs
Requires-Dist: natsort
Requires-Dist: polars>=1.37.1
Requires-Dist: cyvcf2
Requires-Dist: pandera
Requires-Dist: pysam
Requires-Dist: pyarrow
Requires-Dist: pyranges
Requires-Dist: pydantic>=2,<3
Requires-Dist: more-itertools
Requires-Dist: tqdm
Requires-Dist: pybigwig
Requires-Dist: einops
Requires-Dist: tbb
Requires-Dist: joblib
Requires-Dist: pooch
Requires-Dist: awkward
Requires-Dist: hirola>=0.3,<0.4
Requires-Dist: seqpro>=0.9
Requires-Dist: genoray>=2.2.0,<3
Requires-Dist: genvarloader-cli>=0.1.0 ; extra == 'cli'
Provides-Extra: cli
License-File: LICENSE.txt
Summary: Pipeline for efficient genomic data processing.
Author-email: David Laub <dlaub@ucsd.edu>, Aaron Ho <aho@salk.edu>
Requires-Python: >=3.10, <3.14
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: documentation, https://genvarloader.readthedocs.io/en/stable/
Project-URL: issues, https://github.com/mcvickerlab/GenVarLoader/issues
Project-URL: source, https://github.com/mcvickerlab/GenVarLoader

<img src=docs/source/_static/gvl_logo.svg width="200">

[![PyPI version](https://badge.fury.io/py/genvarloader.svg)](https://pypi.org/project/genvarloader/)
[![Documentation Status](https://readthedocs.org/projects/genvarloader/badge/?version=latest)](https://genvarloader.readthedocs.io)
[![Downloads](https://static.pepy.tech/badge/genvarloader)](https://pepy.tech/project/genvarloader)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/genvarloader)](https://img.shields.io/pypi/dm/genvarloader)
[![GitHub stars](https://badgen.net/github/stars/mcvickerlab/GenVarLoader)](https://github.com/mcvickerlab/GenVarLoader)
[![bioRxiv](https://img.shields.io/badge/bioRxiv-2025.01.15.633240-b31b1b.svg)](https://www.biorxiv.org/content/10.1101/2025.01.15.633240)

## Features

GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Dalla-Torre et al.](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [Celaj et al.](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1), [Drusinsky et al.](https://www.biorxiv.org/content/10.1101/2024.07.27.605449v1), [He et al.](https://www.biorxiv.org/content/10.1101/2024.10.15.618510v1), and [Rastogi et al.](https://www.biorxiv.org/content/10.1101/2024.09.23.614632v1)).

- Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
- Generate haplotypes up to 1,000 times faster than reading a FASTA file
- Generate tracks up to 450 times faster than reading a BigWig
- **Supports indels** and re-aligns tracks to haplotypes that have them
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig

Documentation is available [here](https://genvarloader.readthedocs.io/). See our [preprint](https://www.biorxiv.org/content/10.1101/2025.01.15.633240) for benchmarking and implementation details.

## Installation

```bash
pip install genvarloader
```

A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/).

## Contributing

1. Clone the repo.
2. Assuming you have [Pixi](https://pixi.sh/latest/), install pre-commit hooks `pixi run -e dev pre-commit`. If you forget to do this, your PR will likely fail to pass CI checks.
3. Activate and use the appropriate Pixi environment for your needs. A decent catch-all is `dev` but you might need a different environment if using a GPU.

All the tests are designed to use pytest (sans Rust extension code) and live under `tests/`. These tests ensure the code works as intended so they must all pass before any features are merged into `main` and subsequently released. These tests will automatically run on every PR and failing tests will block PRs from being merged.

If your PR has merge conflicts, this is usually because the `main` branch received updates while you've been working on it. In this case, please **rebase** your branch via `git rebase main` to resolve merge conflicts, rather than using a merge commit via `git merge main`.

> [!NOTE]
> Do not edit the version number in `pyproject.toml`. This is handled automatically by GitHub Actions.

