Metadata-Version: 2.4
Name: beyondbench
Version: 0.2.1
Summary: BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Author: Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang
Author-email: Gaurav Srivastava <gks@vt.edu>
Maintainer-email: Gaurav Srivastava <gks@vt.edu>
License: Apache-2.0
Project-URL: Homepage, https://github.com/ctrl-gaurav/BeyondBench
Project-URL: Documentation, https://github.com/ctrl-gaurav/BeyondBench#readme
Project-URL: Repository, https://github.com/ctrl-gaurav/BeyondBench
Project-URL: Bug Tracker, https://github.com/ctrl-gaurav/BeyondBench/issues
Project-URL: Changelog, https://github.com/ctrl-gaurav/BeyondBench/blob/main/CHANGELOG.md
Project-URL: Paper, https://arxiv.org/abs/2509.24210
Keywords: language-models,evaluation,benchmark,reasoning,nlp,ai,machine-learning,llm-evaluation,computational-reasoning,contamination-resistant
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Testing
Classifier: Framework :: Pytest
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.5.0
Requires-Dist: transformers>=4.47.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: pandas>=2.2.0
Requires-Dist: tqdm>=4.67.0
Requires-Dist: tiktoken>=0.12.0
Requires-Dist: sentencepiece>=0.2.0
Requires-Dist: click>=8.1.0
Requires-Dist: rich>=14.0.0
Requires-Dist: colorama>=0.4.6
Requires-Dist: scipy>=1.14.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: jsonschema>=4.23.0
Requires-Dist: psutil>=6.0.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: accelerate>=1.0.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: openai
Requires-Dist: openai>=2.0.0; extra == "openai"
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0.0; extra == "gemini"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40.0; extra == "anthropic"
Provides-Extra: vllm
Requires-Dist: vllm>=0.7.0; extra == "vllm"
Provides-Extra: all-apis
Requires-Dist: openai>=2.0.0; extra == "all-apis"
Requires-Dist: google-genai>=1.0.0; extra == "all-apis"
Requires-Dist: anthropic>=0.40.0; extra == "all-apis"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.3.0; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"
Requires-Dist: black>=24.0.0; extra == "dev"
Requires-Dist: isort>=5.13.0; extra == "dev"
Requires-Dist: flake8>=7.0.0; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: pre-commit>=3.6.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=2.0.0; extra == "docs"
Requires-Dist: myst-parser>=3.0.0; extra == "docs"
Provides-Extra: viz
Requires-Dist: matplotlib>=3.9.0; extra == "viz"
Requires-Dist: seaborn>=0.13.0; extra == "viz"
Requires-Dist: tabulate>=0.9.0; extra == "viz"
Requires-Dist: plotly>=5.20.0; extra == "viz"
Requires-Dist: kaleido>=0.2.1; extra == "viz"
Provides-Extra: serve
Requires-Dist: fastapi>=0.115.0; extra == "serve"
Requires-Dist: uvicorn[standard]>=0.34.0; extra == "serve"
Provides-Extra: dashboard
Requires-Dist: gradio>=4.0.0; extra == "dashboard"
Provides-Extra: full
Requires-Dist: beyondbench[all-apis,dashboard,dev,docs,serve,viz]; extra == "full"
Requires-Dist: vllm>=0.7.0; extra == "full"
Dynamic: license-file

# BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

[![Paper](https://img.shields.io/badge/Paper-arXiv%3A2509.24210-red?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2509.24210)
[![ICLR 2026](https://img.shields.io/badge/ICLR-2026-blue?style=for-the-badge)](https://iclr.cc/)
[![PyPI](https://img.shields.io/pypi/v/beyondbench.svg?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/beyondbench/)
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?style=for-the-badge&logo=python&logoColor=white)](https://pypi.org/project/beyondbench/)
[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg?style=for-the-badge)](https://github.com/ctrl-gaurav/BeyondBench/blob/main/LICENSE)
[![Stars](https://img.shields.io/github/stars/ctrl-gaurav/BeyondBench?style=for-the-badge&logo=github)](https://github.com/ctrl-gaurav/BeyondBench/stargazers)

**101+ Models Evaluated | 79 Reasoning Tasks | 138 Variations | >10^15 Unique Instances**

[Explore Leaderboard](https://ctrl-gaurav.github.io/BeyondBench/) | [Read Paper](https://arxiv.org/abs/2509.24210) | [GitHub](https://github.com/ctrl-gaurav/BeyondBench) | [Documentation](https://github.com/ctrl-gaurav/BeyondBench/blob/main/docs/DOCUMENTATION.md)

---

## What is BeyondBench?

BeyondBench introduces a **revolutionary approach** to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system **dynamically generates** novel problems across **79 distinct reasoning tasks** with **138 variations**, ensuring that models cannot memorize solutions and must demonstrate **true reasoning abilities**.

### Key Features

- **Dynamic Problem Generation** — Problem space >10^15 unique instances, zero risk of data contamination
- **Three Difficulty Levels** — Easy (44 tasks), Medium (15 tasks, 49 variations), Hard (20 tasks, 68 variations)
- **Multi-Backend Support** — OpenAI, Gemini, Anthropic APIs + vLLM and HuggingFace Transformers
- **Contamination-Resistant** — No static benchmark memorization, novel problems every run
- **Comprehensive Metrics** — Accuracy, instruction-following compliance, token efficiency
- **101+ Models Evaluated** — Open-source and proprietary, regularly updated

---

## Installation

### From PyPI

```bash
pip install beyondbench
```

### With Optional Dependencies

```bash
# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything (all APIs + vLLM + dev tools + visualization)
pip install beyondbench[full]
```

### From Source

```bash
git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .
```

---

## Quick Start

### Interactive Wizard

```bash
beyondbench
```

### Command Line

```bash
# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List all available tasks
beyondbench list-tasks
```

### Python API

```python
from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")
```

---

### New in v0.2.1 (Apr 17, 2026)

Critical packaging fix — four subpackages (`parsers.strategies`, `configs`, `eval`, `prompts`) were missing from the v0.2.0 PyPI wheel, causing `ImportError` on fresh installs. No API changes.

### New in v0.2.0 (Apr 16, 2026)

- **79 Reasoning Tasks**: 44 easy + 15 medium (49 variations) + 20 hard (68 variations)
- **Multi-GPU Parallel Evaluation**: Automatic batch auto-tuning and tensor parallelism
- **Plugin SDK**: Create and share custom tasks with `beyondbench plugin scaffold`
- **Gradio Dashboard**: Real-time evaluation monitoring with `--dashboard` flag
- **Response Caching**: Skip redundant API calls across runs
- **Universal Parser**: Unified parsing engine with confidence scoring
- **1000+ Tests**: Comprehensive unit, integration, and end-to-end test coverage
- **API Server**: `beyondbench serve` - FastAPI REST API with WebSocket support

---

## Supported Backends

| Backend | Models | Features |
|---------|--------|----------|
| **OpenAI** | GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini | Reasoning effort control |
| **Gemini** | Gemini 2.5 Pro, Gemini 2.5 Flash | Thinking budget configuration |
| **Anthropic** | Claude Sonnet 4, Claude Opus 4 | Latest Claude models |
| **vLLM** | Any HuggingFace model | Batch processing, tensor parallelism |
| **Transformers** | Any HuggingFace model | CPU/GPU inference |

---

## Task Suites

### Easy Suite (44 Tasks)
Arithmetic (sum, multiplication, subtraction, division, absolute\_difference), Statistics (mean, median, mode), Counting (odd\_count, even\_count, count\_negative, count\_unique, and more), Extrema (find\_maximum, find\_minimum, second\_maximum, range, and more), Sequences (sorting, longest\_increasing\_subsequence, alternating\_sum, sum\_of\_digits), Comparison

### Medium Suite (15 Tasks, 49 Variations)
Fibonacci Sequence (6 variations), Algebraic Sequence (10), Geometric Sequence (10), Prime Sequence (11), Complex Pattern (12)

### Hard Suite (20 Tasks, 68 Variations)
Tower of Hanoi, N-Queens, Graph Coloring, Boolean SAT, Sudoku, Cryptarithmetic, Matrix Chain, Modular Systems, Constraint Optimization, Logic Grid Puzzles

---

## Leaderboard (Top 5)

| Rank | Model | Overall | Instruction Following |
|------|-------|---------|-----------------------|
| 1 | **GPT-5\*** | **83.56%** | 96.15% |
| 2 | **GPT-5-Nano\*** | **82.04%** | 93.58% |
| 3 | **GPT-5-Mini\*** | **81.67%** | 94.23% |
| 4 | **o3\*** | **80.36%** | 94.96% |
| 5 | **o4-Mini\*** | **79.04%** | 95.30% |

*\*Models use reasoning/thinking tokens. Full results for 101+ models on the [leaderboard](https://ctrl-gaurav.github.io/BeyondBench/).*

---

## Environment Variables

```bash
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."
```

---

## Citation

If you use BeyondBench in your research, please cite our paper (accepted at **ICLR 2026**):

```bibtex
@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}
```

---

## Links

- **Leaderboard**: [ctrl-gaurav.github.io/BeyondBench](https://ctrl-gaurav.github.io/BeyondBench/)
- **GitHub**: [github.com/ctrl-gaurav/BeyondBench](https://github.com/ctrl-gaurav/BeyondBench)
- **Paper**: [arXiv:2509.24210](https://arxiv.org/abs/2509.24210)
- **Documentation**: [Full Docs](https://github.com/ctrl-gaurav/BeyondBench/blob/main/docs/DOCUMENTATION.md) | [Usage Guide](https://github.com/ctrl-gaurav/BeyondBench/blob/main/docs/USAGE.md)
- **Issues**: [GitHub Issues](https://github.com/ctrl-gaurav/BeyondBench/issues)
- **Email**: [gks@vt.edu](mailto:gks@vt.edu), [xuanw@vt.edu](mailto:xuanw@vt.edu)

---

**Made with care by the BeyondBench Team** | Virginia Tech, Department of Computer Science | Amazon AGI

*Advancing the frontier of AI reasoning evaluation, one benchmark at a time.*

**License**: Apache-2.0
