Metadata-Version: 2.1
Name: genvarloader
Version: 0.0.3
Summary: Pipeline for efficient genomic data processing.
Home-page: https://github.com/mcvickerlab/genome-loader
License: MIT
Author: David Laub
Author-email: dlaub@ucsd.edu
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: attrs (>=23.1.0,<24.0.0)
Requires-Dist: dask[array] (>=2023.9.3,<2024.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: natsort (>=8.4.0,<9.0.0)
Requires-Dist: numba (>=0.58.0,<0.59.0)
Requires-Dist: pandas (<2)
Requires-Dist: pandera (>=0.17.2,<0.18.0)
Requires-Dist: pgenlib (>=0.90.1,<0.91.0)
Requires-Dist: polars (>=0.19.8,<0.20.0)
Requires-Dist: pyarrow (>=13.0.0,<14.0.0)
Requires-Dist: pybigwig (>=0.3.22,<0.4.0)
Requires-Dist: pysam (>=0.22.0,<0.23.0)
Requires-Dist: ray (>=2.7.1,<3.0.0)
Requires-Dist: xarray (>=2023.9.0,<2024.0.0)
Project-URL: Repository, https://github.com/mcvickerlab/genome-loader
Description-Content-Type: text/markdown

# GenVarLoader
GenVarLoader aims to enable training sequence models on 10's to 100's of thousands of individuals' personalized genomes.

## Installation
`pip install genvarloader`

A PyTorch dependency is not included since it requires [special instructions](https://pytorch.org/get-started/locally/).

## Quick Start
```python
import genvarloader as gvl

reference = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'
```
Create readers for each file providing sequence data:
```python
ref = gvl.Fasta(name='ref', path=reference, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', fasta=ref, variants=var)
```
Put them together and get a `torch.DataLoader`:
```python
gvloader = gvl.GVL(
    readers=varseq,
    bed=regions_of_interest,
    fixed_length=1000,
    batch_size=16,
    max_memory_gb=8,
    batch_dims=['sample', 'ploid'],
    shuffle=True,
    num_workers=2
)

dataloader = gvloader.torch_dataloader()
```
And now you're ready to use the `dataloader` however you need to:
```python
# implement your training loop
for batch in dataloader:
    ...
```
