Metadata-Version: 2.4
Name: spatialcheckpoint
Version: 0.1.0
Summary: Spatial heterogeneity profiling of immune checkpoints in spatial transcriptomics
License-Expression: MIT
Keywords: spatial transcriptomics,immune checkpoint,bioinformatics,single-cell,machine learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: scanpy>=1.10
Requires-Dist: squidpy>=1.4
Requires-Dist: anndata>=0.10
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.4
Requires-Dist: lightgbm>=4.0
Requires-Dist: xgboost>=2.0
Requires-Dist: shap>=0.45
Requires-Dist: lifelines>=0.28
Requires-Dist: matplotlib>=3.8
Requires-Dist: seaborn>=0.13
Requires-Dist: pyyaml>=6.0
Requires-Dist: typer>=0.9
Requires-Dist: rich>=13.0
Requires-Dist: tqdm>=4.66
Requires-Dist: optuna>=3.0
Requires-Dist: scipy>=1.11
Requires-Dist: imbalanced-learn>=0.11
Requires-Dist: requests>=2.28
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# SpatialCheckpoint

[![PyPI version](https://img.shields.io/pypi/v/spatialcheckpoint.svg)](https://pypi.org/project/spatialcheckpoint/)
[![Python](https://img.shields.io/pypi/pyversions/spatialcheckpoint.svg)](https://pypi.org/project/spatialcheckpoint/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**Spatial heterogeneity profiling of immune checkpoints in spatial transcriptomics data.**

SpatialCheckpoint is a bioinformatics pipeline that integrates spatial gene expression profiling, consensus clustering, ensemble ML classification, SHAP interpretability, and clinical survival analysis to characterize immune checkpoint heterogeneity across the tumor microenvironment.

---

## Features

- **Spatial profiling** — region-based checkpoint expression across tumor core, invasive margin, stroma, and immune-enriched zones
- **80+ spatial features** — co-localization scores, spatial gradients, Moran's I autocorrelation, region ratios
- **Archetype discovery** — consensus KMeans + NMF across 6 fixed immune archetypes
- **Ensemble classification** — LightGBM + XGBoost + MLP + Random Forest with SMOTE and Optuna HPO
- **SHAP interpretability** — global and per-class feature importance
- **Clinical associations** — Kaplan-Meier curves, Cox proportional hazards, logistic regression on OS/PFS
- **Bundled gene panel** — 44 curated immune checkpoint genes across 6 functional categories

---

## Installation

```bash
pip install spatialcheckpoint
```

For development:

```bash
git clone https://github.com/yourorg/SpatialCheckpoint.git
cd SpatialCheckpoint
pip install -e ".[dev]"
```

**Requirements:** Python ≥ 3.10

---

## Quick Start

### 5-Minute Demo (Synthetic Data)

The following demo runs entirely on synthetic data — no real Visium files required.

```python
import numpy as np
import pandas as pd
import scanpy as sc
import spatialcheckpoint as scp

print(f"SpatialCheckpoint v{scp.__version__}")

# ── 1. Gene panel ────────────────────────────────────────────────────────────
genes = scp.get_all_checkpoint_genes()
print(f"Checkpoint panel: {len(genes)} genes")
print(f"  e.g. {genes[:5]}")

pd1_pathway = scp.get_category_genes("co_inhibitory_receptors")
print(f"PD-1 pathway genes: {pd1_pathway}")

# ── 2. Synthetic Visium slide ────────────────────────────────────────────────
rng = np.random.default_rng(42)
n_spots, n_genes = 200, 100
checkpoint_genes_subset = genes[:8]
random_genes = [f"GENE{i:04d}" for i in range(n_genes - len(checkpoint_genes_subset))]
all_genes = random_genes + checkpoint_genes_subset

X = rng.negative_binomial(n=2, p=0.5, size=(n_spots, n_genes)).astype(float)
adata = sc.AnnData(X=X)
adata.var_names = pd.Index(all_genes)

# Spatial coordinates (20 × 10 grid)
gx, gy = np.meshgrid(np.arange(20), np.arange(10))
coords = np.column_stack([gx.ravel(), gy.ravel()]).astype(float)
coords += rng.uniform(-0.1, 0.1, size=coords.shape)
adata.obsm["spatial"] = coords

# Region annotations
regions = ["tumor_core", "invasive_margin", "stroma", "immune_enriched", "necrotic"]
region_list = []
for x, y in coords:
    if x < 5 and y < 5:       region_list.append("tumor_core")
    elif x < 10 and y < 8:    region_list.append("invasive_margin")
    elif x >= 15:              region_list.append("immune_enriched")
    elif y >= 8:               region_list.append("necrotic")
    else:                      region_list.append("stroma")
adata.obs["region_type"] = pd.Categorical(region_list, categories=regions)

# ── 3. Spatial feature extraction ───────────────────────────────────────────
engineer = scp.SpatialFeatureEngineer(adata, checkpoint_genes_subset)
features = engineer.extract_all_features(sample_id="demo_sample")
print(f"\nFeature matrix: {features.shape[0]} samples × {features.shape[1]} features")
print(f"  Feature columns (first 5): {list(features.columns[:5])}")

# ── 4. Archetype discovery ───────────────────────────────────────────────────
# Build a multi-sample feature matrix (simulate 30 samples)
n_samples, n_feats = 30, features.shape[1]
feat_data = rng.standard_normal((n_samples, n_feats))
sample_ids = [f"sample_{i:03d}" for i in range(n_samples)]
feature_matrix = pd.DataFrame(feat_data, index=sample_ids, columns=features.columns)

cancer_types = rng.choice(["BRCA", "CRC", "NSCLC"], size=n_samples)
metadata = pd.DataFrame({"cancer_type": cancer_types}, index=sample_ids)

discovery = scp.SpatialArchetypeDiscovery(feature_matrix, metadata)
result = discovery.consensus_clustering(k_range=(2, 5), n_iterations=30)

print(f"\nConsensus clustering:")
print(f"  Optimal k = {result['optimal_k']}")
print(f"  Label distribution: {dict(pd.Series(result['labels']).value_counts())}")

char_df = discovery.characterize_archetypes(result["labels"])
print(f"\nArchetype characterization:")
print(char_df[["archetype_name", "n_samples"]].to_string())

# ── 5. NMF soft membership ───────────────────────────────────────────────────
nmf_result = discovery.run_nmf(k=result["optimal_k"])
print(f"\nNMF decomposition:")
print(f"  W (membership weights): {nmf_result['W'].shape}")
print(f"  H (archetype profiles): {nmf_result['H'].shape}")
print(f"  Explained variance: {nmf_result['explained_variance']:.3f}")
```

---

### Python API

#### 1. Data Preprocessing

```python
import spatialcheckpoint as scp

# From Space Ranger output directory
preprocessor = scp.SpatialDataPreprocessor(spaceranger_out_path="path/to/spaceranger/output")
adata = preprocessor.load_visium()
adata = preprocessor.quality_control(adata, min_genes=200, max_mt_pct=25.0)
adata = preprocessor.normalize(adata)
adata.write_h5ad("data/processed/sample01_preprocessed.h5ad")

# Or from an existing H5AD
preprocessor = scp.SpatialDataPreprocessor(h5_path="existing_data.h5ad")
```

#### 2. Load & Cache

```python
loader = scp.SpatialDataLoader(processed_dir="data/processed/")
adata = loader.load("sample01")   # returns cached .h5ad if present
```

#### 3. Checkpoint Profiling

```python
genes = scp.get_all_checkpoint_genes()   # 44 genes, 6 functional categories

profiler = scp.SpatialCheckpointProfiler(adata, genes)
region_expr = profiler.expression_by_region()   # DataFrame: region × gene
hotspots    = profiler.checkpoint_hotspot_detection()  # Moran's I per gene
```

#### 4. Spatial Feature Engineering

```python
engineer = scp.SpatialFeatureEngineer(adata, genes)
features  = engineer.extract_all_features(sample_id="sample01")
# → DataFrame with 80+ columns: co-localization, gradients, Moran's I, region ratios
```

#### 5. Co-localization Analysis

```python
lr_pairs = scp.get_ligand_receptor_pairs()   # [{ligand, receptor, alias}]
analyzer = scp.CheckpointColocalizationAnalyzer(adata, genes)
coloc_df = analyzer.compute_colocalization()
```

#### 6. Archetype Discovery

```python
# feature_matrix: DataFrame (n_samples × n_features)
# sample_metadata: DataFrame with 'cancer_type' column, same index as feature_matrix
discovery = scp.SpatialArchetypeDiscovery(feature_matrix, sample_metadata)

cc     = discovery.consensus_clustering(k_range=(2, 8), n_iterations=100)
labels = cc["labels"]           # integer cluster labels
char   = discovery.characterize_archetypes(labels)   # archetype names, top features

nmf    = discovery.run_nmf(k=cc["optimal_k"])
# nmf["W"]  → (n_samples, k) soft membership weights
# nmf["H"]  → (k, n_features) archetype profiles
```

#### 7. Train the Ensemble Classifier

```python
trainer = scp.ArchetypeModelTrainer(
    feature_matrix=feature_matrix,
    archetype_labels=labels,
    output_dir="models/",
)
results = trainer.run(n_optuna_trials=30)
# results["model"]         → trained ensemble
# results["test_metrics"]  → accuracy, F1, AUC
```

#### 8. SHAP Explanations

```python
explainer = scp.ArchetypeExplainer(results["model"], feature_matrix)
shap_df   = explainer.global_feature_importance()   # DataFrame: feature × archetype
```

---

### CLI

```bash
# Download a registered dataset
spatialcheckpoint download BRCA_visium_10x

# Download all BRCA datasets
spatialcheckpoint download all --cancer-type BRCA

# Preprocess raw Visium output or H5AD
spatialcheckpoint preprocess path/to/spaceranger/  data/processed/
spatialcheckpoint preprocess sample.h5ad           data/processed/

# Run full spatial analysis on a preprocessed sample
spatialcheckpoint analyze sample01

# Discover archetypes from a feature matrix CSV
spatialcheckpoint discover results/sample01/features.csv --k-min 2 --k-max 8

# Train the archetype classifier
spatialcheckpoint classify features.csv archetype_labels.csv --model-dir models/

# Generate publication figures (requires prior analyze run)
spatialcheckpoint figures --results-dir results/ --output-dir paper/figures/
```

---

## Gene Panel

The bundled panel covers **44 genes** across 6 functional categories:

| Category | Genes (examples) |
|----------|-----------------|
| Co-inhibitory receptors | `PDCD1` (PD-1), `CTLA4`, `LAG3`, `HAVCR2` (TIM-3), `TIGIT` |
| Co-inhibitory ligands | `CD274` (PD-L1), `PDCD1LG2` (PD-L2), `LGALS9` (Galectin-9) |
| Novel checkpoints | `VSIR` (VISTA), `CD276` (B7-H3), `VTCN1` (B7-H4) |
| Innate checkpoints | `CD47`, `SIRPA`, `LILRB1`, `LILRB2` |
| Immune enzymes | `IDO1`, `ENTPD1` (CD39), `NT5E` (CD73), `ARG1` |
| Co-stimulatory reference | `CD28`, `ICOS`, `TNFRSF4` (OX40), `TNFRSF9` (4-1BB) |

```python
import spatialcheckpoint as scp

all_genes    = scp.get_all_checkpoint_genes()                      # 44 genes sorted
pd1_pathway  = scp.get_category_genes("co_inhibitory_receptors")   # 9 genes
cell_markers = scp.get_immune_cell_markers()                       # {cell_type: [genes]}
lr_pairs     = scp.get_ligand_receptor_pairs()                     # [{ligand, receptor, alias}]
```

---

## Archetypes

Six fixed spatial immune archetypes are inferred by consensus clustering:

| Archetype | Spatial signature |
|-----------|------------------|
| `Checkpoint-Hot` | High checkpoint expression, high immune infiltration, strong spatial co-localization |
| `Checkpoint-Cold` | Low checkpoint and immune activity throughout the tissue |
| `Checkpoint-Excluded` | Checkpoint expression concentrated at invasive margin; immune cells at periphery |
| `Checkpoint-Mismatch` | Checkpoint and immune signals spatially separated (non-overlapping) |
| `Innate-Dominant` | CD47/SIRPα axis dominant over adaptive checkpoints |
| `Novel-Enriched` | VISTA / B7-H3 / B7-H4 enriched over canonical PD-1/PD-L1 axis |

---

## Pipeline Architecture

```
Raw Visium data (Space Ranger dir or H5AD)
  → SpatialDataPreprocessor      QC, normalize → 'counts' / 'log1p' layers
  → SpatialDataLoader            cache-aware H5AD loader
  → SpatialCheckpointProfiler    region-based expression
                                 (tumor_core, invasive_margin, stroma,
                                  immune_enriched, necrotic)
  → SpatialFeatureEngineer       80+ features per slide:
                                  co-localization, gradients, Moran's I,
                                  region expression ratios
  → SpatialArchetypeDiscovery    consensus KMeans + delta-area k-selection
                                  + NMF soft membership
  → ArchetypeModelTrainer        LightGBM + XGBoost + MLP + RF ensemble,
                                  SMOTE oversampling, RFECV feature selection,
                                  Optuna hyperparameter optimization
  → ArchetypeExplainer           SHAP global / per-class feature importance
  → ClinicalAssociationAnalyzer  KM curves, Cox PH, logistic regression (OS/PFS)
  → Visualization                spatial plots, publication-ready figures
```

**Key data contracts:**
- Spatial coordinates in `adata.obsm['spatial']`
- Region annotations in `adata.obs['region_type']` (categorical)
- Preprocessed files: `data/processed/{sample_id}_preprocessed.h5ad`

---

## Output Files

| Path | Contents |
|------|----------|
| `results/{sample_id}/features.csv` | 80+ spatial features |
| `results/{sample_id}/region_expression.csv` | Region × gene expression stats |
| `results/{sample_id}/hotspots.csv` | Moran's I per gene |
| `results/{sample_id}/colocalization.csv` | Ligand-receptor co-occurrence |
| `results/archetypes/archetype_labels.csv` | Sample → archetype assignment |
| `results/archetypes/archetype_characteristics.csv` | Per-archetype feature profiles |
| `results/archetypes/nmf_W.csv`, `nmf_H.csv` | NMF basis / coefficient matrices |
| `models/archetype_classifier.joblib` | Serialized ensemble model |
| `paper/figures/` | Publication-ready PDF/PNG plots |
| `paper/tables/` | Feature importance and archetype CSV tables |

---

## API Reference

### Gene Set Utilities

| Function | Description |
|----------|-------------|
| `get_all_checkpoint_genes()` | Sorted list of 44 checkpoint gene symbols |
| `get_category_genes(category)` | Genes for a specific functional category |
| `get_immune_cell_markers()` | `{cell_type: [genes]}` reference marker dictionary |
| `get_ligand_receptor_pairs()` | List of `{ligand, receptor, alias}` pairs |

### Core Classes

| Class | Module | Purpose |
|-------|--------|---------|
| `SpatialDataPreprocessor` | `data.preprocess` | QC, normalize, dual-input (Space Ranger or H5AD) |
| `SpatialDataLoader` | `data.loader` | Cache-aware loader for preprocessed H5ADs |
| `SpatialCheckpointProfiler` | `analysis.spatial_expression` | Region-based expression, hotspot detection |
| `SpatialFeatureEngineer` | `analysis.spatial_features` | 80+ spatial feature extraction |
| `CheckpointColocalizationAnalyzer` | `analysis.colocalization` | Ligand-receptor spatial co-occurrence |
| `SpatialArchetypeDiscovery` | `model.archetype_discovery` | Consensus clustering + NMF |
| `SpatialArchetypeClassifier` | `model.classifier` | Ensemble classifier (LGBM+XGB+MLP+RF) |
| `ArchetypeModelTrainer` | `model.trainer` | Full train pipeline with HPO |
| `ArchetypeExplainer` | `model.explainer` | SHAP global/per-class importance |

---

## Development

```bash
# Clone and install in dev mode
git clone https://github.com/yourorg/SpatialCheckpoint.git
cd SpatialCheckpoint
pip install -e ".[dev]"

# Run tests (uses synthetic fixtures — no real data needed)
pytest tests/ -v

# Lint
ruff check src/
```

### Testing with synthetic data

All tests use synthetic fixtures from `tests/conftest.py`. No real Visium files are required:

```python
# 200-spot × 100-gene AnnData with spatial coords and region labels
# 50-sample × 80-feature DataFrame
# Clinical data with OS, PFS, ICI response
```

---

## Dependencies

Core: `scanpy`, `squidpy`, `anndata`, `pandas`, `numpy`, `scipy`, `scikit-learn`

ML: `lightgbm`, `xgboost`, `shap`, `imbalanced-learn`, `optuna`

Stats: `lifelines`

Viz: `matplotlib`, `seaborn`

CLI: `typer`, `rich`

Heavy dependencies (`squidpy`, `lightgbm`, `xgboost`, `lifelines`, `optuna`, `shap`, `imbalanced-learn`) are imported with `try/except` fallbacks — partial functionality is available even when these are not installed.

---

## Citation

If you use SpatialCheckpoint in your research, please cite:

```bibtex
@article{spatialcheckpoint2025,
  title   = {SpatialCheckpoint: Spatial heterogeneity profiling of immune checkpoints
             in spatial transcriptomics},
  author  = {},
  journal = {},
  year    = {2025},
}
```

---

## License

MIT License — see [LICENSE](LICENSE) for details.
