Metadata-Version: 2.4
Name: thinkbooster
Version: 0.1.0
Summary: ThinkBooster: a unified framework for test-time compute scaling of LLM reasoning
Author-email: "List of contributors: https://github.com/IINemo/thinkbooster/graphs/contributors" <artemshelmanov@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/IINemo/thinkbooster
Project-URL: Repository, https://github.com/IINemo/thinkbooster
Project-URL: Issues, https://github.com/IINemo/thinkbooster/issues
Keywords: llm,reasoning,test-time-scaling,best-of-n,reasoning-evaluation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: transformers>=4.56.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: numpy>=1.23.5
Requires-Dist: tqdm>=4.64.0
Requires-Dist: parse>=1.19.0
Requires-Dist: hydra-core>=1.2.0
Requires-Dist: omegaconf>=2.2.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: lm-polygraph>=0.6.0
Requires-Dist: pylatexenc>=2.10
Requires-Dist: sympy>=1.12
Requires-Dist: regex>=2023.0.0
Requires-Dist: plotly>=5.0.0
Requires-Dist: word2number>=1.1
Requires-Dist: wandb>=0.15.0
Requires-Dist: evalplus>=0.3.1
Requires-Dist: vllm<0.13.0,>=0.12.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=24.0.0; extra == "dev"
Requires-Dist: isort==7.0.0; extra == "dev"
Requires-Dist: flake8>=7.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=5.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Requires-Dist: myst-parser>=0.18.0; extra == "docs"
Provides-Extra: service
Requires-Dist: fastapi>=0.104.0; extra == "service"
Requires-Dist: uvicorn[standard]>=0.24.0; extra == "service"
Requires-Dist: pydantic>=2.0.0; extra == "service"
Requires-Dist: pydantic-settings>=2.0.0; extra == "service"
Requires-Dist: httpx>=0.25.0; extra == "service"
Requires-Dist: python-multipart>=0.0.6; extra == "service"
Requires-Dist: python-json-logger>=2.0.7; extra == "service"
Dynamic: license-file

<div align="center">
  <img src="assets/logo.png" alt="ThinkBooster logo" width="140" />
  <h1>ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning</h1>
</div>

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
[![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://thinkbooster.s3.us-east-1.amazonaws.com/thinkbooster.pdf)

[Quick Start](#quick-start) | [Key Features](#key-features) | [Strategies](#supported-strategies) | [Visual Debugger](#visual-debugger) | [Documentation](#documentation)

ThinkBooster is an open-source framework for **test-time compute scaling** of large language models. It implements nine state-of-the-art scaling strategies — beam search, best-of-N, self-consistency, DeepConf, MUR, phi-decoding, and more — scored by process reward models (PRMs), uncertainty estimators, LLM-as-a-critic, and ReProbes. The framework includes an evaluation pipeline for math, science, and coding benchmarks, an OpenAI-compatible endpoint gateway, and an interactive visual debugger for inspecting strategy behavior step by step.

---

## Key Features

- **9 scaling strategies** — beam search, best-of-N, self-consistency, DeepConf, MUR, phi-decoding, extended thinking, uncertainty CoT, and adaptive scaling (online and offline)
- **4 scorer families** — process reward models (PRMs), uncertainty/confidence scores, LLM-as-a-critic, and ReProbes; with configurable aggregation (min, mean, max, product) and sliding window
- **OpenAI-compatible endpoint gateway** — drop-in replacement for any OpenAI SDK; select strategy and scorer via URL path; enables "Pro reasoning mode" for any LLM deployment
- **Visual debugger** — interactive web UI for comparing strategies, inspecting step-by-step reasoning traces and confidence signals
- **Evaluation pipeline** — math (MATH-500, OlympiadBench, GaoKao, AIME), science (GPQA-Diamond), and coding (HumanEval+, MBPP+, KernelBench) with crash-resistant resume

---

## Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/IINemo/thinkbooster.git
cd thinkbooster

# Create conda environment
conda create -n thinkbooster python=3.11 -y
conda activate thinkbooster

# Install dependencies
./setup.sh

# Configure API keys
cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY
```

### REST API

```bash
pip install -e ".[service]"
python service_app/main.py   # starts on http://localhost:8001
```

Use with any OpenAI SDK:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8001/v1/beam_search/prm",
    api_key="<YOUR_API_KEY>",
)
response = client.chat.completions.create(
    model="Qwen/Qwen3-30B-A3B",
    messages=[{"role": "user", "content":
        "Find the number of ordered pairs (x, y) of "
        "positive integers satisfying x + 2y = 2xy."}],
    extra_body={
        "max_tokens": 8192, "tts_beam_size": 4,
    },
)
print(response.choices[0].message.content)
```

The `base_url` encodes the scaling strategy and scorer (`beam_search/prm`). To switch strategy, just change the URL — no other code changes needed.

See [Service API Guide](docs/service/api_guide.md) for the full reference.

### Run an Experiment

```bash
# Beam search on GSM8K (3 samples for quick verification)
python scripts/run_tts_eval.py \
  --config-name experiments/beam_search/gsm8k/window_all/mean/beam_search_vllm_qwen25_math_7b_instruct_gsm8k_prm \
  dataset.subset=3
```

Results are saved to `outputs/` with full config snapshots for reproducibility. Add `--resume` to continue interrupted runs.

---

## Visual Debugger

The interactive debugger lets you compare multiple TTS strategies side by side on the same problem. Inspect per-step decisions (escalate, stop, prune, select), view confidence and uncertainty signals, and drill into sampled candidates and tree expansions.

<table border="0">
<tr>
<td width="40%"><img src="https://github.com/user-attachments/assets/e1fec504-d6f7-49d8-85e3-bf42d4e7baec" alt="Visual Debugger — main interface" width="100%" /></td>
<td valign="middle"><b>Main interface.</b> Select a cached example or enter a custom math/science/coding problem. Choose any strategy (beam search, best-of-N, MUR, …) and scorer (PRM, uncertainty, LLM-as-a-critic) and run it directly from the browser.</td>
</tr>
<tr><td colspan="2"><br/></td></tr>
<tr>
<td width="40%"><img src="https://github.com/user-attachments/assets/21c7fc24-7507-46e3-9ce3-34cb6a37d7b5" alt="Step-by-step reasoning inspector" width="100%" /></td>
<td valign="middle"><b>Step inspector.</b> Replay the strategy execution step by step. Each entry in the reasoning timeline shows the operation (select, prune, escalate), the candidates considered, their scores, and the full text of the chosen step.</td>
</tr>
<tr><td colspan="2"><br/></td></tr>
<tr>
<td width="40%"><img src="https://github.com/user-attachments/assets/df03cc3e-a933-4b6c-aa96-f35ab3e9b986" alt="Trajectory tree visualization" width="100%" /></td>
<td valign="middle"><b>Trajectory tree.</b> Global branching view of the entire strategy run. Nodes represent reasoning steps; the orange path highlights the final selected trajectory. Useful for understanding how beam search or tree-of-thought explores and prunes the search space.</td>
</tr>
</table>

After starting the REST API service, open:

```
http://localhost:8001/debugger
```

See [service_app/README.md](service_app/README.md) for details on cached examples and custom input modes.

---

## Supported Strategies

| Strategy | Online/Offline | LLM Access | Prefill | Description |
|---|---|---|---|---|
| Best-of-N | Offline | Black-box | No | Sample N solutions, select best by scorer |
| Majority Voting | Offline | Black-box | No | Sample N solutions, select answer by majority vote |
| Beam Search (ToT) | Online | Black-box | Yes | Explore tree of reasoning paths, prune by score |
| Extended Thinking | Online | Black-box | Yes | Control reasoning budget to force longer CoT |
| MUR | Online | White-box | Yes | Allocate more compute only on uncertain steps |
| DeepConf Online | Online | White-box | Yes | Steer generation toward high-confidence tokens |
| DeepConf Offline | Offline | White-box | No | Rerank candidates by model confidence scores |
| Phi-decoding | Online | White-box | Yes | Foresight sampling and adaptive pruning |
| Uncertainty CoT | Online | White-box | Yes | Generate multiple trajectories when uncertain |

---

## Project Structure

```
thinkbooster/
├── llm_tts/              # Core library
│   ├── strategies/       # TTS strategy implementations
│   ├── models/           # Model wrappers (vLLM, HuggingFace, API)
│   ├── scorers/          # Step scoring (PRM, uncertainty, voting)
│   ├── evaluation/       # Correctness evaluation (exact match, LLM judge)
│   └── datasets/         # Dataset loaders and utilities
├── config/               # Hydra configuration system
├── scripts/              # Evaluation scripts (run_tts_eval.py)
├── service_app/          # REST API service + visual debugger
├── tests/                # Test suite with strategy registry
├── docs/                 # Documentation
└── lm-polygraph/         # Submodule: uncertainty estimation
```

See [Project Structure](docs/getting_started/project_structure.md) for a detailed architecture overview.

---

## Documentation

- [Project Structure](docs/getting_started/project_structure.md) — architecture and component descriptions
- [Evaluation Protocol](docs/evaluation/README.md) — datasets, metrics (accuracy, tokens, FLOPs), and reporting
- [Strategy Registration](docs/core/strategy_registration.md) — how to add new strategies with tests
- [Service API Guide](docs/service/api_guide.md) — REST API reference and configuration
- [DeepConf Guide](docs/strategies/deepconf.md) — confidence-based test-time scaling

---

## Contributing

We welcome contributions! Whether it's a new strategy, scorer, dataset, or bug fix — see the [Contributing Guide](docs/getting_started/contributing.md) for setup instructions, development workflow, and coding standards.

---

## Citation

If you use ThinkBooster in your research, please cite:

```bibtex
@misc{thinkbooster2026,
  title     = {ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning},
  author    = {Smirnov, Vladislav and Nguyen, Chieu and Senichev, Sergey and Ta, Minh Ngoc and Fadeeva, Ekaterina and Vazhentsev, Artem and Galimzianova, Daria and Rozanov, Nikolai and Mazanov, Viktor and Ni, Jingwei and Wu, Tianyi and Kiselev, Igor and Sachan, Mrinmaya and Gurevych, Iryna and Nakov, Preslav and Baldwin, Timothy and Shelmanov, Artem},
  booktitle = {Preprint},
  year      = {2026},
  url       = {https://thinkbooster.s3.us-east-1.amazonaws.com/thinkbooster.pdf}
}
```

---

## Troubleshooting

<details>
<summary>vLLM engine fails to start</summary>

**Corrupted torch compile cache:** If you see `RuntimeError: Engine core initialization failed`:

```bash
rm -rf ~/.cache/vllm/torch_compile_cache/
```

**Missing C compiler:** If Triton can't find `gcc`:

```bash
conda install -c conda-forge gcc_linux-64 gxx_linux-64 -y
ln -s $CONDA_PREFIX/bin/x86_64-conda-linux-gnu-gcc $CONDA_PREFIX/bin/gcc
ln -s $CONDA_PREFIX/bin/x86_64-conda-linux-gnu-g++ $CONDA_PREFIX/bin/g++
```

</details>

<details>
<summary>ANTLR version mismatch warnings</summary>

```
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
```

This is expected — Hydra uses ANTLR 4.9.3, latex2sympy2 was built with 4.7.2. Both work correctly.

</details>

---

## License

This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details.
