Metadata-Version: 2.4
Name: turboquant-gpu
Version: 0.1.1
Summary: TurboQuant KV cache compression for LLM inference — cuTile GPU kernels
Author: Anirudh Bharadwaj Vangara
License-Expression: MIT
Keywords: quantization,kv-cache,llm,inference,cutile,cuda,gpu,attention,blackwell,hopper,h100,b200
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: scipy
Provides-Extra: gpu
Requires-Dist: cuda-tile; extra == "gpu"
Dynamic: license-file

# turboquant-gpu

**5.02x KV cache compression for LLM inference** — GPU-accelerated cuTile kernels with PyTorch fallback.

```
pip install turboquant-gpu
```

## quick start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant_gpu import TurboQuantEngine
import torch

model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="cuda")
tok   = AutoTokenizer.from_pretrained(model_id)

engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")
result = engine.generate(model, tok, "The key to efficient LLM inference is")

print(result["text"])
print(f"{result['tokens']} tokens | {result['stats']['ratio']:.1f}x compression")
```

## how it works

Implements the [TurboQuant](https://arxiv.org/abs/2501.09747) algorithm:

1. **normalize + rotate** — random orthogonal rotation (Pi) makes coordinates near-Gaussian
2. **Lloyd-Max quantize** — optimal scalar quantization against N(0, 1/d)

Both keys and values are compressed to 3-bit via MSE-optimal Lloyd-Max quantization,
then reconstructed for standard attention. The package also includes fused attention
kernels with QJL bias correction (1-bit sign sketch of the quantization residual),
but these are not yet exposed in the high-level API.

## step-by-step api

```python
engine = TurboQuantEngine(head_dim=128, total_bits=3, device="cuda")

# after model prefill:
compressed = engine.compress_kv_cache(out.past_key_values)
cache      = engine.build_cache(compressed)
stats      = engine.compression_stats(out.past_key_values)

# or just do it all in one call:
result = engine.generate(model, tokenizer, "your prompt here")
```

## gpu support

Written in [cuTile](https://docs.nvidia.com/cuda/cutile-python/) for cross-architecture portability.
Falls back to PyTorch if cuTile or a compatible driver isn't available.

| GPU family | Architecture | Status |
|------------|-------------|--------|
| A100       | Ampere      | supported (PyTorch fallback) |
| H100       | Hopper      | supported |
| B200/B300  | Blackwell   | supported + swizzle fast path |

## kernels

| kernel | what it does | status |
|--------|-------------|--------|
| `compress_keys` | normalize → rotate(Pi^T) → Lloyd-Max quantize → QJL signs | used |
| `compress_values` | normalize → rotate(Pi^T) → Lloyd-Max quantize | used |
| `decompress_values` | dequantize → un-rotate(Pi) → scale by norms | used |
| `attention_scores` | asymmetric dot product with QJL correction | included, not in API |
| `fused_attention` | scores + online softmax + V accumulation | included, not in API |

## license

MIT
