Metadata-Version: 2.1
Name: csaq-quant
Version: 0.3.8
Summary: Causal Salience-Aware Quantization - Mixed precision LLM weights with self-speculative decoding
Home-page: https://github.com/Omdeepb69/csaq-quant
Author: Omdeep Borkar
Author-email: omdeepborkar@gmail.com
Keywords: quantization,llm,compression,inference,causal salience,mixed precision,pytorch,speculative decoding
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.35.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: safetensors
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Provides-Extra: eval
Requires-Dist: accelerate>=0.24.0; extra == "eval"

# CSAQ: Causal Salience-Aware Quantization

**Causal Salience-Aware Quantization (CSAQ)** is a high-performance LLM weight quantization engine designed to hit perfectly defined fractional bit-budgets (e.g., exactly 4.0 bits/weight) by utilizing mixed-precision formats. Unlike magnitude-based proxies like AWQ or GPTQ, CSAQ uses first-order Taylor approximations to measure actual causal salience combined with advanced co-activation interaction graphs.

## Features

- **Multi-Bit Mixed Precision**: Replaces static quantization settings. Automatically distributes available bit thresholds (1, 2, 4, 8, 16) based heavily on impact, significantly minimizing degradation on critical model pathways.
- **Top-K Jaccard Co-Activation Graphs**: Discovers sets of weights that commonly fire together using "Atomic Cliques".
- **Shared-Scale Architecture**: Assigns low-precision bits to trailing follower weights by recycling the Quantization Scaling Factors (S) and Zero-Points (Z) of the clique's high-salience *Leader*, aggressively compressing parameters without losing scale context.
- **Constant Memory Footprint**: Tracks Jaccard activation sparsification using an online bit-vector union/intersection accumulator, avoiding disastrous Out-Of-Memory (OOM) errors during calibration.

## Installation

Install using pip:

```bash
pip install csaq-quant
```

## Quick Start

### Python API

You can programmatically apply CSAQ using the core export `quantize` and managing constraints with `CSAQConfig`:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from csaq import quantize, CSAQConfig, build_calibration_data

# 1. Load your standard HF LLM
model_id = "Qwen/Qwen1.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu")

# 2. Extract representative calibration data
calib_data = build_calibration_data(tokenizer, n=32, seq_len=128)

# 3. Configure fractional Bit-Budget and allowed bits (e.g., target exactly 4 bits on average)
config = CSAQConfig(
    target_bits=4.0, 
    bit_options=[1, 2, 4, 8, 16],
    clique_threshold=0.85
)

# 4. Fire the Quantization Pipeline
quantized_model, info = quantize(
    model=model, 
    calib_data=calib_data, 
    config=config, 
    verbose=True
)
```

## License

MIT License
