Metadata-Version: 2.4
Name: slafdb
Version: 0.5.2
Summary: Sparse Lazy Array Format - MVP for single-cell data
Author-email: Pavan Ramkumar <pavan.ramkumar@gmail.com>
License: Apache-2.0
Requires-Python: <3.14,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pylance>=2.0.0
Requires-Dist: polars>=1.36.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: pandas<3,>=2.1.0
Requires-Dist: pyarrow>=18.0.0
Requires-Dist: scipy<1.17,>=1.15.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: scanpy>=1.11.2
Requires-Dist: h5py>=3.10.0
Requires-Dist: requests>=2.32.4
Requires-Dist: typer>=0.9.0
Requires-Dist: rich>=13.0.0
Requires-Dist: tqdm>=4.67.0
Requires-Dist: build>=1.2.2.post1
Requires-Dist: lancedb>=0.29.0
Requires-Dist: lance-namespace-urllib3-client>=0.5.0
Requires-Dist: boto3>=1.40.10
Requires-Dist: smart-open[s3]>=7.4.1
Requires-Dist: huggingface-hub>=0.36.0
Requires-Dist: s3fs>=2025.9.0
Requires-Dist: modal>=1.3.1
Provides-Extra: ml
Requires-Dist: torch>=2.5.0; extra == "ml"
Requires-Dist: tiledb>=0.34.2; extra == "ml"
Requires-Dist: tiledbsoma>=1.17.1; extra == "ml"
Provides-Extra: advanced
Requires-Dist: igraph>=0.11.9; extra == "advanced"
Requires-Dist: leidenalg>=0.10.2; extra == "advanced"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff==0.12.2; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: marimo>=0.14.0; extra == "dev"
Requires-Dist: matplotlib>=3.10.0; extra == "dev"
Requires-Dist: seaborn>=0.13.0; extra == "dev"
Requires-Dist: psutil>=6.0.0; extra == "dev"
Requires-Dist: scvi-tools>=1.3.3; extra == "dev"
Requires-Dist: scdataset>=0.1.1; extra == "dev"
Requires-Dist: bionemo-scdl>=0.0.8; extra == "dev"
Requires-Dist: vortex-data>=0.33.2; extra == "dev"
Requires-Dist: datasets>=4.0.0; extra == "dev"
Requires-Dist: ray>=2.49.0; extra == "dev"
Requires-Dist: modal>=1.2.1; extra == "dev"
Requires-Dist: rclone-python>=0.1.23; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.5.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == "docs"
Requires-Dist: mkdocs-literate-nav>=0.6.0; extra == "docs"
Requires-Dist: mkdocs-section-index>=0.3.0; extra == "docs"
Requires-Dist: mkdocs-autorefs>=0.4.0; extra == "docs"
Requires-Dist: mkdocs-awesome-pages-plugin>=2.9.0; extra == "docs"
Requires-Dist: mkdocs-macros-plugin>=1.0.0; extra == "docs"
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == "docs"
Requires-Dist: mkdocs-git-authors-plugin>=0.8.0; extra == "docs"
Requires-Dist: mkdocs-minify-plugin>=0.7.0; extra == "docs"
Requires-Dist: mkdocs-redirects>=1.2.0; extra == "docs"
Provides-Extra: test
Requires-Dist: pytest>=8.0.0; extra == "test"
Requires-Dist: pytest-cov>=6.2.0; extra == "test"
Requires-Dist: coverage>=7.9.1; extra == "test"
Requires-Dist: scanpy>=1.11.2; extra == "test"
Requires-Dist: h5py>=3.10.0; extra == "test"
Requires-Dist: psutil>=6.0.0; extra == "test"
Requires-Dist: torch>=2.5.0; extra == "test"
Requires-Dist: tiledbsoma>=1.17.1; extra == "test"
Provides-Extra: full
Requires-Dist: slafdb[advanced,ml]; extra == "full"
Dynamic: license-file

# SLAF (Sparse Lazy Array Format)

<div align="center">
  <img src="docs/assets/slaf-logo-light.svg" alt="SLAF Logo" width="400"/>
</div>

[![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://img.shields.io/github/actions/workflow/status/slaf-project/slaf/ci.yml?branch=main&label=tests)](https://github.com/slaf-project/slaf/actions)
[![Coverage](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/pavanramkumar/33e8b97f85afdc956a71edc623f5c2ba/raw/slaf-coverage.json)](https://github.com/slaf-project/slaf/actions)
[![Code style](https://img.shields.io/badge/code%20style-ruff-black.svg)](https://github.com/astral-sh/ruff)
[![PyPI](https://img.shields.io/badge/PyPI-0.5.2-blue.svg)](https://pypi.org/project/slafdb/)
[![PyPI Downloads](https://static.pepy.tech/badge/slafdb)](https://pepy.tech/projects/slafdb)

**SLAF** is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. Built for large-scale single-cell analysis with memory efficiency and production-ready ML capabilities.

**Be Lazy** (lazy APIs for AnnData and Scanpy) • **Write SQL** (arbitrary SQL to query the tables) • **Train Foundation Models** (with tokenizers and dataloaders)

## 🚀 Key Features

- **⚡ Fast**: SQL-level performance for data operations
- **💾 Memory Efficient**: Lazy evaluation, only load what you need
- **🔍 SQL Native**: Direct SQL queries on your data
- **🧬 Scanpy Compatible**: Drop-in replacement for AnnData workflows
- **⚙️ ML Ready**: Ready for ML training with efficient tokenization
- **🔧 Production Ready**: Built for large-scale single-cell analysis

## 📦 Installation

### Default Installation (Batteries Included)

The default installation includes core functionality, CLI tools, and data conversion capabilities:

```bash
# Using uv (recommended)
uv add slafdb

# Or pip
pip install slafdb
```

**What's included by default:**

- ✅ Core SLAF functionality (SQL queries, data structures)
- ✅ CLI tools (`slaf convert`, `slaf query`, etc.)
- ✅ Data conversion tools (scanpy, h5py for h5ad files)
- ✅ Rich console output and progress bars
- ✅ Cross-platform compatibility

**What's NOT included by default:**

Dependencies for:

- ❌ Machine learning features (PyTorch tokenizers)
- ❌ Advanced single-cell tools (igraph, leidenalg)

### Platform-Specific Notes

**Polars Compatibility:**

- **Linux/Windows**: Works with standard `polars`
- **macOS (Apple Silicon)**: May require `polars-lts-cpu` for compatibility

If you encounter polars-related issues on macOS, you have several options:

**Option 1: Manual platform-specific installation**

```bash
# For macOS Apple Silicon
pip install "polars-lts-cpu>=1.31.0"
pip install slafdb

# For Linux/Windows
pip install slafdb
```

**Option 2: Use uv with manual polars specification**

```bash
# For macOS Apple Silicon
uv add "polars-lts-cpu>=1.31.0"
uv add slafdb

# For Linux/Windows
uv add slafdb
```

**Note**: Package managers don't automatically choose between `polars` and `polars-lts-cpu` - you may need to specify the correct version for your platform.

### Optional Dependencies

Add specific features as needed:

**Using uv:**

```bash
uv add "slafdb[ml]"
uv add "slafdb[advanced]"
uv add "slafdb[full]"
uv add "slafdb[dev]"
```

**Using pip:**

```bash
pip install slafdb[ml]
pip install slafdb[advanced]
pip install slafdb[full]
pip install slafdb[dev]
```

### Development Installation

```bash
git clone https://github.com/slaf-project/slaf.git
cd slaf
uv add --extra dev --extra test --extra docs
```

## 🚀 Quick Start

### Converting Your Data

Convert your existing single-cell data to SLAF format - **no extra dependencies required!**

```bash
# Convert AnnData (.h5ad) to SLAF
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf

# Convert 10x Genomics data
slaf convert path/to/10x/filtered_feature_bc_matrix output.slaf
```

### Basic Usage

```python
from slaf import SLAFArray

# Load a SLAF dataset
slaf = SLAFArray("path/to/dataset.slaf")

# Describe the dataset
print(slaf.info())

# Execute SQL queries directly
results = slaf.query("""
    SELECT batch, COUNT(*) as count
    FROM cells
    GROUP BY batch
    ORDER BY count DESC
""")
print(results)
```

### Filtering Data

```python
# Filter cells by metadata
filtered_cells = slaf.filter_cells(
    batch="batch1",
    total_counts=">1000"
)

# Filter genes
filtered_genes = slaf.filter_genes(
    highly_variable=True
)

# Get expression submatrix
expression = slaf.get_submatrix(
    cell_selector=filtered_cells,
    gene_selector=filtered_genes
)
```

## 🦥 Be Lazy - Lazy AnnData & Scanpy Integration

SLAF provides lazy versions of AnnData and Scanpy operations that only compute when needed:

```python
from slaf.integrations.anndata import read_slaf
import scanpy as sc

# Load as lazy AnnData
adata = read_slaf("path/to/dataset.slaf")
print(f"Type: {type(adata)}")  # LazyAnnData
print(f"Expression matrix type: {type(adata.X)}")  # LazyExpressionMatrix

# Apply scanpy operations (lazy)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

# Still lazy - no computation yet
print(f"Still lazy: {type(adata.X)}")

# Compute when needed
adata.compute()  # Now it's a real AnnData object
```

### Lazy Computation Control

```python
# Compute specific parts
expression_matrix = adata.X.compute()  # Just the expression matrix
cell_metadata = adata.obs              # Cell metadata
gene_metadata = adata.var              # Gene metadata

# Or compute everything at once
real_adata = adata.compute()
```

### Lazy Slicing

```python
# All slicing operations are lazy
subset = adata[:100, :50]  # Lazy slice
filtered = adata[adata.obs['n_genes_by_counts'] > 1000]  # Lazy filtering
```

## 🔍 Write SQL - Direct Database Access

SLAF stores data in three main tables that you can query directly with SQL:

### Database Schema

- **`cells`**: Cell metadata and QC metrics
- **`genes`**: Gene metadata and annotations
- **`expression`**: Sparse expression matrix data

### SQL Queries

```python
# Get expression data for specific cells
cell_expression = slaf.query("""
    SELECT
        c.cell_id,
        c.total_counts,
        COUNT(e.gene_id) as genes_expressed,
        AVG(e.value) as avg_expression
    FROM cells c
    JOIN expression e ON c.cell_integer_id = e.cell_integer_id
    WHERE c.batch = 'batch1'
    GROUP BY c.cell_id, c.total_counts
    ORDER BY genes_expressed DESC
    LIMIT 10
""")

# Find highly expressed genes
high_expr_genes = slaf.query("""
    SELECT
        g.gene_id,
        COUNT(e.cell_id) as cells_expressing,
        AVG(e.value) as avg_expression
    FROM genes g
    JOIN expression e ON g.gene_integer_id = e.gene_integer_id
    GROUP BY g.gene_id
    HAVING cells_expressing > 100
    ORDER BY avg_expression DESC
    LIMIT 10
""")
```

## 🧠 Train Foundation Models - ML Training

SLAF provides efficient tokenization and dataloaders for training foundation models:

### Tokenization

```python
from slaf.ml import SLAFTokenizer

# Create tokenizer for GeneFormer style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="geneformer",
    vocab_size=50000,
    n_expression_bins=10
)

# Geneformer tokenization (gene sequence only)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Example gene IDs
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    max_genes=2048
)

# Create tokenizer for scGPT style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="scgpt",
    vocab_size=50000,
    n_expression_bins=10
)

# scGPT tokenization (gene-expression pairs)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Gene IDs
expr_sequences = [[0.5, 0.8, 0.2], [0.9, 0.1, 0.7]]  # Expression values
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    expr_sequences=expr_sequences,
    max_genes=1024
)
```

### DataLoader for Training

```python
from slaf.ml import SLAFDataLoader

# Create DataLoader
dataloader = SLAFDataLoader(
    slaf_array=slaf,
    tokenizer_type="geneformer",  # or "scgpt"
    batch_size=32,
    max_genes=2048
)

# Use with PyTorch training
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    cell_ids = batch["cell_ids"]

    # Your training loop here
    loss = model(input_ids, attention_mask=attention_mask)
    loss.backward()
```

## 🛠️ Command Line Interface

### Data Conversion

```bash
# Convert AnnData to SLAF (included by default)
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf --format hdf5
```

### Data Querying

```bash
# Execute SQL query
slaf query dataset.slaf "SELECT * FROM cells LIMIT 10"

# Save results to CSV
slaf query dataset.slaf "SELECT * FROM cells" --output cells.csv
```

### Dataset Information

```bash
slaf info dataset.slaf
```

## 📚 Documentation

- [SLAF Documentation](https://slaf-project.github.io/slaf/)
- [Quickstart](https://slaf-project.github.io/slaf/getting-started/quickstart/)
- [API Reference](https://slaf-project.github.io/slaf/api/)
- [Examples](https://slaf-project.github.io/slaf/examples/getting-started/)
- [User Guide](https://slaf-project.github.io/slaf/user-guide/how-slaf-works/)
- [Contributing](https://slaf-project.github.io/slaf/development/contributing/) — setup, workflow, and how to contribute
- [Maintainers Guide](https://slaf-project.github.io/slaf/development/maintaining/)

## 💬 Community

- [Discord](https://discord.gg/7Q95RVhURe) — chat, questions, and updates

## 🙏 Acknowledgments

Built on top of

- [Lance](https://lancedb.github.io/lance/) for cloud-native, efficient columnar storage
- [Polars](https://pola.rs/) for lazy, composable, in-memory, zero-copy data processing
