Metadata-Version: 2.4
Name: step-distill
Version: 0.1.9
Summary: Distill verifiable chain-of-thought reasoning into small language models via hierarchical step supervision
Project-URL: Homepage, https://github.com/ductaiphan/step-distill
Project-URL: Documentation, https://ductaiphan.github.io/step-distill
Project-URL: Repository, https://github.com/ductaiphan/step-distill
Project-URL: HuggingFace Model, https://huggingface.co/ductaiphan/NanoReason-3B
Project-URL: Paper, https://github.com/ductaiphan/step-distill
License: MIT
License-File: LICENSE
Keywords: chain-of-thought,distillation,llm,lora,math,nlp,reasoning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: peft>=0.9.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: torch>=2.1.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: transformers>=4.38.0
Requires-Dist: typer>=0.9.0
Provides-Extra: all
Requires-Dist: accelerate>=0.27.0; extra == 'all'
Requires-Dist: black>=24.0.0; extra == 'all'
Requires-Dist: datasets>=2.17.0; extra == 'all'
Requires-Dist: fastapi>=0.115.0; extra == 'all'
Requires-Dist: httpx>=0.27.0; extra == 'all'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'all'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest>=8.0.0; extra == 'all'
Requires-Dist: ruff>=0.4.0; extra == 'all'
Requires-Dist: tensorboard>=2.15.0; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.24.0; extra == 'all'
Requires-Dist: websockets>=12.0; extra == 'all'
Provides-Extra: app
Requires-Dist: fastapi>=0.115.0; extra == 'app'
Requires-Dist: uvicorn[standard]>=0.24.0; extra == 'app'
Requires-Dist: websockets>=12.0; extra == 'app'
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: train
Requires-Dist: accelerate>=0.27.0; extra == 'train'
Requires-Dist: datasets>=2.17.0; extra == 'train'
Requires-Dist: tensorboard>=2.15.0; extra == 'train'
Description-Content-Type: text/markdown

<div align="center">

# 🧠 step-distill

**Distill Verifiable Chain-of-Thought Reasoning into Small Language Models**

*Open-source framework implementing Step-Aware LoRA — hierarchical step supervision that transfers structured mathematical reasoning from large teacher models into compact student models deployable on commodity hardware.*

[![PyPI version](https://img.shields.io/pypi/v/step-distill.svg)](https://pypi.org/project/step-distill/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![HuggingFace Model](https://img.shields.io/badge/🤗%20Model-NanoReason--3B-orange)](https://huggingface.co/ductaiphan/NanoReason-3B)
[![GitHub Stars](https://img.shields.io/github/stars/ductaiphan/step-distill?style=social)](https://github.com/ductaiphan/step-distill)
[![Coverage](https://codecov.io/gh/ductaiphan/step-distill/branch/main/graph/badge.svg)](https://codecov.io/gh/ductaiphan/step-distill)
[![Paper](https://img.shields.io/badge/Paper-Springer%20LNCS-red)](https://github.com/ductaiphan/step-distill)

<p align="center">
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-demo-app">Demo App</a> •
  <a href="#-train-your-own">Train Your Own</a> •
  <a href="#-results">Results</a> •
  <a href="#-architecture">Architecture</a> •
  <a href="#-citation">Citation</a>
</p>

</div>

---

## 🤖 Context for Antigravity (Read Before Writing Any Code)

> This section is for the AI building this codebase. Read fully before starting.

### What already exists — do NOT re-implement

All of the following have been **experimentally validated** on Kaggle 2×T4:

| Artifact | Status | Notes |
|---|---|---|
| Training pipeline | ✅ Done | Kaggle notebook, 2×T4, 2 epochs |
| NanoReason-3B weights | ✅ Done | Upload to `ductaiphan/NanoReason-3B` |
| GSM8K results | ✅ Done | 78.7% zero-shot exact-match |
| Ablation study | ✅ Done | 5 variants, per-step accuracy |
| RFS/SVR metrics | ✅ Done | RFS=100%, SVR=0% on 1,319 problems |

### What you BUILD

A clean, installable Python package wrapping the above research. This is **software engineering**, not research. Algorithms are known — implement them cleanly and match the API signatures exactly.

### Critical technical facts (from actual code)

1. **Tag format used in training** (from `StepType.tag` in code):
   ```
   #### UNDERSTAND ####
   #### PLAN ####
   #### EXECUTE ####
   #### VERIFY ####
   ```
   Use exactly this format — 4 hashtags, stage name, 4 hashtags, all uppercase.

2. **Teacher model**: `Qwen/Qwen2.5-32B-Instruct-AWQ` (quantized, served via vLLM). For the framework, any OpenAI-compatible API works.

3. **Training hardware**: 2×T4 (29GB total). Inference/demo: single T4 (16GB). Peak training VRAM: ~9.8GB per GPU.

4. **LoRA rank**: Fixed at 16. Do not implement adaptive rank — the paper uses static rank=16.

5. **Validated loss weights**: α=0.15, β=0.35, γ=0.35, δ=0.15, ε=0.10. Use these exact values as defaults.

6. **Training**: seed=42, batch_size=1, grad_accum=4, max_seq_length=1024, LR=1e-4, warmup=300, cosine decay, label_smoothing=0.1, 2 epochs.

7. **Data format** (from actual `.jsonl` files):
   ```json
   {
     "id": "math_9650",
     "question": "...",
     "cot_4steps": {
       "understand": "...",
       "plan": "...",
       "execute": "...",
       "verify": "..."
     },
     "ground_truth": "697",
     "is_correct": true
   }
   ```

8. **No build step for frontend**. Vanilla JS + CSS. Single HTML file. No npm/webpack/React.

### Build priority order

```
Priority 1 (Demo-critical — build first):
  Phase 1 → 2 → 4 → 9 → 10 → 11

Priority 2 (Training pipeline):
  Phase 3 → 5 → 6 → 7

Priority 3 (Completeness):
  Phase 8 → 12 → 13 → 14 → 15
```

---

## ✨ What is step-distill?

`step-distill` is an open-source framework for **structured reasoning distillation** — training small language models (1B–14B) to produce transparent, step-by-step mathematical reasoning by learning from a larger teacher model.

**The core problem with standard SFT:** Uniform cross-entropy treats every token equally. A mistyped connector word and a catastrophically wrong formula receive the same gradient penalty. This teaches surface text patterns, not reasoning structure.

**Our solution:** Decompose reasoning into 4 cognitively isolated stages and penalize planning/execution errors proportionally to their causal impact.

```
Teacher (32B AWQ)                    Student (3B)
     │                                    │
     │  Generate structured traces        │
     │ ─────────────────────────────────► │  Runs on
     │  #### UNDERSTAND ####              │  single T4 GPU
     │  #### PLAN ####                    │  (~9.8GB VRAM)
     │  #### EXECUTE ####                 │  $0 on Kaggle
     │  #### VERIFY ####                  │
```

---

## 🚀 Key Features

| Feature | Description |
|---|---|
| 🎯 **Step-Aware Loss** | Asymmetric gradient: PLAN/EXECUTE weighted 2.33× over UNDERSTAND |
| 📊 **Novel Metrics** | RFS + SVR — measure process quality, not just final accuracy |
| 🔌 **Plug-and-Play** | Swap teacher/student models via one YAML config line |
| 💻 **T4-Optimized** | Single T4 inference; 2×T4 training via SDPA + AMP + grad checkpointing |
| 🎓 **Ready Model** | `NanoReason-3B`: 78.7% zero-shot GSM8K, out of the box |
| 🇻🇳 **Vietnamese Math** | VNHSGE benchmark support — cross-lingual transfer finding |
| 🖥️ **Demo App** | Student-friendly streaming UI with 4-step reasoning cards |
| 📦 **One-line Install** | `pip install step-distill` |

---

## ⚡ Quick Start

```bash
pip install step-distill
```

```python
from step_distill import NanoReason

model = NanoReason.from_pretrained("ductaiphan/NanoReason-3B")

result = model.solve(
    "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast "
    "and bakes muffins with 4. She sells the rest at $2/egg. "
    "How much does she make daily?"
)

print(result)
# ┌─ UNDERSTAND ──────────────────────────────────┐
# │ Given: 16 eggs/day, eat 3, bake 4, sell @$2   │
# │ Goal: calculate daily earnings                 │
# └───────────────────────────────────────────────┘
# ┌─ PLAN ────────────────────────────────────────┐
# │ Step 1: remaining = 16 - 3 - 4                │
# │ Step 2: earnings = remaining × $2             │
# └───────────────────────────────────────────────┘
# ┌─ EXECUTE ─────────────────────────────────────┐
# │ 16 - 3 = 13 → 13 - 4 = 9 → 9 × 2 = 18       │
# └───────────────────────────────────────────────┘
# ┌─ VERIFY ──────────────────────────────────────┐
# │ Back-check: 9 × $2 = $18 ✓                   │
# └───────────────────────────────────────────────┘
# Answer: $18

print(result.answer)  # "18"
print(result.rfs)     # 1.0
```

```bash
# Launch student-friendly demo app
step-distill demo
# → http://localhost:8000
```

---

## 🎓 Demo App

- 🎨 **4 colored reasoning cards** — UNDERSTAND (blue) / PLAN (purple) / EXECUTE (amber) / VERIFY (emerald)
- ⚡ **Real-time token streaming** — watch the model think step by step
- 📝 **Problem bank** — 15 curated Vietnamese + English problems
- 🔲 **Toggle mode** — "Show Steps / Answer Only" for classroom use
- 📱 **Mobile responsive**

---

## 🏋️ Train Your Own

```python
from step_distill import TeacherPipeline, StepAwareTrainer, StepDistillConfig

# Step 1: Generate training data
pipeline = TeacherPipeline(
    teacher_model="Qwen/Qwen2.5-32B-Instruct",  # or AWQ quantized
    temperature=0.6,
    top_p=0.95,
)
dataset = pipeline.generate(
    sources=["gsm8k", "math", "vnhsge"],
    output_path="./data/step_traces.jsonl",
    quality_filters=["syntax", "math", "consistency"],
)

# Step 2: Fine-tune with Step-Aware LoRA
config = StepDistillConfig(
    student_model="Qwen/Qwen2.5-3B-Instruct",
    # Validated loss weights from paper
    alpha=0.15,    # UNDERSTAND
    beta=0.35,     # PLAN       ← highest: planning errors cascade
    gamma=0.35,    # EXECUTE    ← highest: arithmetic correctness
    delta=0.15,    # VERIFY
    epsilon=0.10,  # Transition penalty
    # LoRA
    lora_rank=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    # Training
    num_epochs=2, learning_rate=1e-4, warmup_steps=300,
    batch_size=1, gradient_accumulation_steps=4,
    max_seq_length=1024, label_smoothing=0.1,
    # Efficiency
    use_sdpa=True, mixed_precision="fp16",
    gradient_checkpointing=True,
)
trainer = StepAwareTrainer(config=config, train_data="./data/step_traces.jsonl")
trainer.train(output_dir="./nanoreason-3b")

# Step 3: Evaluate
from step_distill import Evaluator
results = Evaluator("./nanoreason-3b").evaluate(
    benchmarks=["gsm8k", "math", "vnhsge"],
    metrics=["accuracy", "rfs", "svr"],
)
print(results.summary())
```

---

## 📐 Architecture

### 4-Stage Cognitive Decomposition

Grounded in **Pólya's problem-solving framework** (1945):

```
┌──────────────┬──────────────┬──────────────┬────────────────┐
│  UNDERSTAND  │     PLAN     │   EXECUTE    │    VERIFY      │
│   α = 0.15  │   β = 0.35  │   γ = 0.35  │   δ = 0.15    │
│              │              │              │                │
│ Extract:     │ Retrieve:    │ Substitute:  │ Back-calc:     │
│ • Variables  │ • Theorems   │ • Values     │ • Check result │
│ • Givens     │ • Formulas   │ • Arithmetic │ • Verify logic │
│ • Objective  │ • Algorithm  │ • Steps      │                │
└──────────────┴──────────────┴──────────────┴────────────────┘
         ε = 0.10 (Transition Penalty — enforces stage order)
```

### Step-Aware Loss

```
L_step = α·L_U + β·L_P + γ·L_E + δ·L_V + ε·L_tr

where L_s = -(1/|M_s|) · Σ_{t: M_s[t]=1} log p_θ(y_t | y_{<t}, x)

β/α = γ/α = 2.33 — PLAN/EXECUTE errors penalized 2.33× heavier
Weights sum to 1.10 — importance multipliers, not probability simplex
```

### Two-Tier Parser

```
Raw text with #### TAGS ####
    → Tier 1 (Regex): locate tag boundaries
    → Tier 2 (Rules): classify each token → {U, P, E, V, IGNORE}
    → Output: M_U, M_P, M_E, M_V ∈ {0,1}^L (binary mask matrices)
```

### Central Finding: Form–Function Dissociation

```
RFS = 100.0%  ← perfect structural form — achievable via SFT ✅
SVR = 0.0%    ← functional self-correction — requires RL signals ⚠️
```

**Perfect Teacher Syndrome:** Training on 18,044 perfect trajectories produces perfect *form* (RFS=100%) but cannot produce functional *self-correction* (SVR=0%) because the training data contains zero examples of VERIFY catching an error. This is a distribution mismatch — not a bug, but a theoretically grounded finding that precisely motivates GRPO as the next step.

---

## 📊 Results

### GSM8K Zero-Shot (1,319 test problems)

| Model | Params | GSM8K | RFS | SVR |
|---|---|---|---|---|
| **NanoReason-3B (ours)** | **3B** | **78.7%** | **100.0%** | 0.0%* |
| Qwen2.5-3B SFT-LoRA | 3B | 77.9% | — | — |
| Qwen2.5-3B-Instruct | 3B | 83.9%† | — | — |
| DeepSeek-R1-Distill-1.5B | 1.5B | 40.9% | — | — |

*SVR=0%: documented limitation — see [Perfect Teacher Syndrome](#central-finding-formfunction-dissociation)
†83.9% without structured reasoning (Alignment Tax — intentional trade-off for pedagogy)

### Ablation Study (per-step accuracy, 1 epoch)

| Variant | U% | **P%** | E% | V% | Avg |
|---|---|---|---|---|---|
| **NanoReason-3B** | 92.1 | **91.1** | 97.1 | 93.3 | **93.4** |
| Zero-Penalty | 92.1 | 91.2 | 97.4 | 93.4 | 93.5 |
| LoRA Rank-32 | 91.4 | 86.4 | 94.7 | 90.0 | 90.6 |
| Uniform Loss | 91.2 | 86.6 | 94.6 | 89.7 | 90.5 |
| No-CoT | 94.3 | N/A | N/A | N/A | — |

**Key:** Step-Aware Loss → **+4.5pp on PLAN** vs Uniform Loss. This is the most direct validation of asymmetric gradient pressure.

### Training Dynamics

```
Total loss: 1.928 → 1.723 (−10.6%, 2 epochs, ~8,500 steps)
Per-step improvement:  UNDERSTAND +14.1%  PLAN +11.4%
                       EXECUTE    +8.9%   VERIFY +8.1%
```

---

## 📦 Novel Evaluation Metrics

### RFS — Reasoning Format Score

```python
# Measures Form Competence — structural adherence
RFS(g) = (1/4) · Σ_{s ∈ {U,P,E,V}} 1_s(g)
# where 1_s(g) = 1 if stage s present with ≥10 chars content
RFS_full = fraction of outputs where RFS(g) = 1.0
# NanoReason-3B: RFS_full = 100.0%
```

### SVR — Self-Verification Rate

```python
# Measures Functional Competence — genuine self-correction
SVR = |{g: VERIFY catches error AND reverses prior answer}| / N
# NanoReason-3B: SVR = 0.0% — Perfect Teacher Syndrome
# Requires RL training to improve (future work)
```

---

## ⚙️ Configuration Reference

```yaml
# step_distill_config.yaml
student_model: "Qwen/Qwen2.5-3B-Instruct"
teacher_model: "Qwen/Qwen2.5-32B-Instruct"

loss_weights:
  alpha: 0.15    # UNDERSTAND
  beta: 0.35     # PLAN
  gamma: 0.35    # EXECUTE
  delta: 0.15    # VERIFY
  epsilon: 0.10  # Transition penalty

lora:
  rank: 16
  alpha: 32
  dropout: 0.05
  target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

training:
  epochs: 2
  learning_rate: 1.0e-4
  min_lr: 1.0e-6
  scheduler: cosine_with_warmup
  warmup_steps: 300
  batch_size: 1
  gradient_accumulation_steps: 4
  max_seq_length: 1024
  weight_decay: 0.01
  max_grad_norm: 1.0
  label_smoothing: 0.1

efficiency:
  use_sdpa: true
  mixed_precision: fp16
  gradient_checkpointing: true
  # Peak VRAM: ~9.8GB per T4 GPU
```

---

## 🔌 Extensibility

```python
# Custom dataset
from step_distill import DataSource
class MySource(DataSource):
    def load(self) -> list[dict]:
        return [{"question": "...", "answer": "..."}, ...]

# Custom teacher (any OpenAI-compatible API)
pipeline = TeacherPipeline(teacher_model="gpt-4o", api_key="...")

# Custom student (any HuggingFace CausalLM)
config = StepDistillConfig(student_model="meta-llama/Llama-3.1-8B-Instruct")
```

---

## 🛠️ CLI Reference

```bash
step-distill generate --teacher Qwen/Qwen2.5-32B-Instruct \
    --sources gsm8k math vnhsge --output ./data/traces.jsonl

step-distill train --config config.yaml --data ./data/traces.jsonl \
    --output ./my-model

step-distill eval --model ./my-model \
    --benchmarks gsm8k math vnhsge --metrics accuracy rfs svr

step-distill demo --model ductaiphan/NanoReason-3B --port 8000

step-distill solve "If x + 5 = 12, what is x?"

step-distill run --config config.yaml   # full pipeline
```

---

## 📁 Project Structure

```
step-distill/
├── step_distill/
│   ├── __init__.py          # Exports: NanoReason, StepAwareTrainer,
│   │                        #   StepDistillConfig, Evaluator, TeacherPipeline
│   ├── core/
│   │   ├── model.py         # NanoReason: from_pretrained, solve, solve_stream
│   │   ├── trainer.py       # StepAwareTrainer: overrides compute_loss()
│   │   ├── loss.py          # StepAwareLoss: the core algorithm
│   │   ├── parser.py        # LabelMatrixParser: 2-tier Regex+Rule
│   │   └── metrics.py       # RFSMetric, SVRMetric, AccuracyMetric, Evaluator
│   ├── data/
│   │   ├── pipeline.py      # TeacherPipeline: async generation + 3 quality filters
│   │   ├── sources.py       # DataSource ABC + GSM8K, MATH, VNHSGE implementations
│   │   ├── filters.py       # QualityFilter: syntax, math correctness, step consistency
│   │   └── formatter.py     # PromptFormatter: system prompt, JSON schema
│   ├── config/
│   │   ├── config.py        # StepDistillConfig (Pydantic v2, validated)
│   │   └── defaults.yaml    # Exact paper hyperparameters
│   ├── app/
│   │   ├── server.py        # FastAPI: POST /api/solve, WS /api/stream
│   │   └── static/
│   │       ├── index.html   # Single HTML file — no build step
│   │       ├── style.css    # Design system
│   │       └── app.js       # Streaming + UI logic (vanilla JS)
│   └── cli/
│       └── main.py          # Typer CLI
├── tests/
│   ├── test_loss.py         # Analytical values — must be exact
│   ├── test_parser.py       # All edge cases for tag parsing
│   ├── test_metrics.py      # Must reproduce 78.7% / 100% / 0%
│   └── test_inference.py    # Quick Start end-to-end
├── examples/
│   ├── basic_inference.py
│   ├── train_custom_student.py
│   └── evaluate_model.py
├── pyproject.toml
├── README.md
├── CONTRIBUTING.md
├── CHANGELOG.md
└── LICENSE
```

---

## 🗺️ Development Phases

### Phase 1 — Project Scaffold
- `pyproject.toml` (see STRATEGY.md for exact spec)
- `step_distill/__init__.py` with stub exports
- `Makefile`: install, test, build, publish
- CI: GitHub Actions, Python 3.10/3.11/3.12

**Done when:** `pip install -e .` works, `from step_distill import NanoReason` imports.

---

### Phase 2 — Pydantic Config
- `StepDistillConfig` with all fields from config reference above
- `from_yaml()`, `from_dict()` class methods
- Validator: weight sum ≈ 1.10 (within 0.01)
- Validator: lora_rank is power of 2

**Done when:** Config round-trips YAML→Pydantic→dict. Validators raise on bad input.

---

### Phase 3 — Label Matrix Parser
- `LabelMatrixParser.parse(text, tokenizer) → dict[str, np.ndarray[bool]]`
- Tag format: `#### UNDERSTAND ####` (case-sensitive, exact spacing)
- Edge cases: missing stages, wrong order, partial tags
- Keys: "understand", "plan", "execute", "verify", "ignore"

**Done when:** Correctly parses all samples in `data/sample_traces.jsonl`.

---

### Phase 4 — Step-Aware Loss
- `StepAwareLoss(alpha, beta, gamma, delta, epsilon, label_smoothing=0.0)`
- `forward(logits, labels, mask_dict) → torch.Tensor`
- Per-stage normalization by `|M_s|` (not total length)
- Causal shift: correct +1 offset before masking
- Transition penalty: penalize stage-order violations

**Done when:** Unit tests pass with known analytical values (single-stage input → exact CE value).

---

### Phase 5 — Training Pipeline
- `StepAwareTrainer(config, train_data)`
- `train(output_dir) → TrainingResult`
- Overrides `compute_loss()` with `StepAwareLoss`
- LoRA via PEFT, all `target_modules` from config
- Auto-enables SDPA + AMP + gradient checkpointing from config flags
- Per-stage accuracy callback every N steps
- Multi-GPU via `accelerate` (2×T4 config)

**Done when:** 10-step smoke test on T4 16GB, no OOM, per-stage accuracy logged.

---

### Phase 6 — Teacher Pipeline
- `TeacherPipeline(teacher_model, temperature=0.6, top_p=0.95)`
- Async generation with retry and batch support
- 3-layer quality filters (syntax → math → step consistency)
- Resume support: skip already-generated IDs
- Output format matches `data/sample_traces.jsonl` exactly (8 fields)

**Done when:** Generates 100 MATH traces with ≥90% pass rate.

---

### Phase 7 — Dataset Sources
- `DataSource` ABC with `load() → list[dict]`, `sample(n) → list[dict]`
- `GSM8KSource`, `MATHSource` via HuggingFace datasets
- `VNHSGESource` bundled in package (`step_distill/data/vnhsge/`)
  - 225 training samples, test set withheld

**Done when:** All 3 sources load. VNHSGE returns correct sample count.

---

### Phase 8 — Evaluation Metrics
- `RFSMetric.compute(responses) → float`
- `SVRMetric.compute(responses) → float`
- `AccuracyMetric.compute(preds, golds) → float` with normalization:
  - Strip `$`, `\boxed{}`, commas, trailing units
  - Fraction tolerance (float comparison within 1e-3)
- `Evaluator(model_path).evaluate(benchmarks, metrics) → EvalResults`
- `EvalResults.summary()` → rich table, `.to_dict()` → JSON

**Done when:** `Evaluator("ductaiphan/NanoReason-3B")` reproduces `{gsm8k: 78.7%, rfs: 100.0%, svr: 0.0%}` within ±0.1pp.

---

### Phase 9 — NanoReason Inference Wrapper
- `NanoReason.from_pretrained(model_id, device="auto")`
  - Auto-detect: LoRA adapter vs merged model
- `solve(question, **gen_kwargs) → ReasoningResult`
  - Default: greedy, repetition_penalty=1.1, max_new_tokens=300
- `solve_stream(question) → AsyncGenerator[tuple[str, str], None]`
  - Yields `(token, stage_name)`
- `ReasoningResult`: `.answer`, `.steps: dict[str, str]`, `.rfs: float`, `.__repr__()` → boxes

**Done when:** Quick Start code block runs in <30s on T4. `result.answer == "18"`.

---

### Phase 10 — Demo Backend (FastAPI)
- `POST /api/solve` → `ReasoningResult` JSON
- `WS /api/stream` → `{"token": str, "stage": str}` messages
- `GET /api/problems` → 15-item problem bank (see STRATEGY.md)
- `GET /api/health` → `{"status": "ok", "model": model_id}`
- Model loads at startup via lifespan

**Done when:** All endpoints respond correctly. WebSocket streams 3 test problems without drop.

---

### Phase 11 — Demo Frontend (Vanilla JS)
**Design tokens (exact — do not deviate):**

```css
--bg:       #0f0f13;
--card-bg:  rgba(255,255,255,0.04);
--border:   rgba(255,255,255,0.08);
--grad:     linear-gradient(135deg, #6366f1, #8b5cf6);
--understand: #3b82f6;
--plan:       #8b5cf6;
--execute:    #f59e0b;
--verify:     #10b981;
```

**UX requirements:**
- Stage cards slide in from left with 200ms staggered delay
- Token streaming character-by-character in active card
- "Show Steps / Answer Only" toggle
- Problem bank sidebar, click → auto-fill
- Mobile responsive (min-width: 320px)
- No build step

**Done when:** Runs on localhost:8000, all 15 problems solve, looks like funded startup product.

---

### Phase 12 — CLI
- Typer app with 6 subcommands: `generate`, `train`, `eval`, `demo`, `solve`, `run`
- Rich output: progress bars, result tables
- `--config yaml` support on all commands

**Done when:** All CLI commands in CLI Reference section work.

---

### Phase 13 — Tests & CI
- Unit tests for all core modules
- `test_metrics.py` must reproduce paper results
- ≥80% coverage
- GitHub Actions: Python 3.10/3.11/3.12

**Done when:** `make test` green, coverage badge ≥80%.

---

### Phase 14 — Documentation
- MkDocs Material with auto API reference
- Tutorials: quickstart, training, custom dataset, metrics deep-dive
- GitHub Pages deploy on main push

---

### Phase 15 — PyPI Release
- Version `0.1.0` published
- `banner.png`, `demo.gif`, `CHANGELOG.md`, `CONTRIBUTING.md`
- Issue templates enabled

**Done when:** `pip install step-distill && step-distill demo` on fresh Python 3.10.

---

## 🤗 Pre-trained Models

| Model | Params | GSM8K | Download |
|---|---|---|---|
| **NanoReason-3B** | 3B | **78.7%** | [🤗 Hub](https://huggingface.co/ductaiphan/NanoReason-3B) |
| NanoReason-7B *(v0.2.0)* | 7B | TBD | Soon |

---

## 📄 Citation

```bibtex
@inproceedings{phan2026stepawarelora,
  title     = {Step-Aware {LoRA}: Distilling Verifiable Chain-of-Thought Reasoning
               into Small Language Models via Hierarchical Step Supervision},
  author    = {Phan, Duc Tai and {Trinh Tran}, Trung Duc},
  booktitle = {Proceedings of [Conference]},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  year      = {2026},
  note      = {University of Transport Ho Chi Minh City, Vietnam}
}
```

---

## 🤝 Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Priority areas:
- GRPO with negative trajectories → fix SVR=0%
- NanoReason-7B and 14B
- GGUF export for llama.cpp
- Multi-GPU DDP training

---

## 📜 License

MIT — use freely, modify freely.

---

<div align="center">

**Built with ❤️ at University of Transport Ho Chi Minh City, Vietnam**

*Making high-quality AI reasoning accessible to every student — offline, on commodity hardware, at zero cost.*

[⭐ Star](https://github.com/ductaiphan/step-distill) · [🐛 Issues](https://github.com/ductaiphan/step-distill/issues) · [💡 Discussions](https://github.com/ductaiphan/step-distill/discussions)

</div>