Metadata-Version: 2.4
Name: eval-protocol
Version: 0.2.24
Summary: The official Python SDK for Eval Protocol (EP.) EP is an open protocol that standardizes how developers author evals for large language model (LLM) applications.
Author-email: Fireworks AI <info@fireworks.ai>
License-Expression: MIT
Project-URL: Homepage, https://github.com/fireworks-ai/eval-protocol
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: dataclasses-json>=0.5.7
Requires-Dist: uvicorn>=0.15.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: openai>=1.78.1
Requires-Dist: aiosqlite
Requires-Dist: aiohttp
Requires-Dist: mcp>=1.9.2
Requires-Dist: PyYAML>=5.0
Requires-Dist: datasets>=3.0.0
Requires-Dist: fsspec
Requires-Dist: hydra-core>=1.3.2
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: gymnasium>=0.29.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: anthropic>=0.59.0
Requires-Dist: ipykernel>=6.30.0
Requires-Dist: jupyter>=1.1.1
Requires-Dist: toml>=0.10.0
Requires-Dist: loguru>=0.6.0
Requires-Dist: docstring-parser>=0.15
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.8.0
Requires-Dist: litellm>=1.0.0
Requires-Dist: addict>=2.4.0
Requires-Dist: deepdiff>=6.0.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: websockets>=15.0.1
Requires-Dist: fastapi>=0.116.1
Requires-Dist: pytest>=6.0.0
Requires-Dist: pytest-asyncio>=0.21.0
Requires-Dist: peewee>=3.18.2
Requires-Dist: backoff>=2.2.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest-httpserver; extra == "dev"
Requires-Dist: werkzeug>=2.0.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: transformers>=4.0.0; extra == "dev"
Requires-Dist: types-setuptools; extra == "dev"
Requires-Dist: types-requests; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Requires-Dist: types-docker; extra == "dev"
Requires-Dist: versioneer>=0.20; extra == "dev"
Requires-Dist: openai>=1.78.1; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: e2b; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: docker==7.1.0; extra == "dev"
Requires-Dist: ipykernel>=6.30.0; extra == "dev"
Requires-Dist: jupyter>=1.1.1; extra == "dev"
Requires-Dist: pip>=25.1.1; extra == "dev"
Requires-Dist: haikus==0.3.8; extra == "dev"
Requires-Dist: syrupy>=4.0.0; extra == "dev"
Provides-Extra: trl
Requires-Dist: torch>=1.9; extra == "trl"
Requires-Dist: trl>=0.7.0; extra == "trl"
Requires-Dist: peft>=0.7.0; extra == "trl"
Requires-Dist: transformers>=4.0.0; extra == "trl"
Requires-Dist: accelerate>=0.28.0; extra == "trl"
Provides-Extra: openevals
Requires-Dist: openevals>=0.1.0; extra == "openevals"
Provides-Extra: fireworks
Requires-Dist: fireworks-ai>=0.19.19; extra == "fireworks"
Provides-Extra: box2d
Requires-Dist: swig; extra == "box2d"
Requires-Dist: gymnasium[box2d]>=0.29.0; extra == "box2d"
Requires-Dist: Pillow; extra == "box2d"
Provides-Extra: langfuse
Requires-Dist: langfuse>=2.0.0; extra == "langfuse"
Provides-Extra: huggingface
Requires-Dist: datasets>=3.0.0; extra == "huggingface"
Requires-Dist: transformers>=4.0.0; extra == "huggingface"
Provides-Extra: adapters
Requires-Dist: langfuse>=2.0.0; extra == "adapters"
Requires-Dist: datasets>=3.0.0; extra == "adapters"
Requires-Dist: transformers>=4.0.0; extra == "adapters"
Provides-Extra: langsmith
Requires-Dist: langsmith>=0.1.86; extra == "langsmith"
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == "bigquery"
Requires-Dist: google-auth>=2.0.0; extra == "bigquery"
Provides-Extra: svgbench
Requires-Dist: selenium>=4.0.0; extra == "svgbench"
Provides-Extra: pydantic
Requires-Dist: pydantic-ai>=1.0.2; extra == "pydantic"
Provides-Extra: supabase
Requires-Dist: supabase>=2.18.1; extra == "supabase"
Provides-Extra: chinook
Requires-Dist: psycopg2-binary>=2.9.10; extra == "chinook"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3.0; extra == "langchain"
Provides-Extra: braintrust
Requires-Dist: braintrust[otel]; extra == "braintrust"
Provides-Extra: langgraph
Requires-Dist: langgraph>=0.6.7; extra == "langgraph"
Requires-Dist: langchain-core>=0.3.75; extra == "langgraph"
Provides-Extra: langgraph-tools
Requires-Dist: langgraph>=0.6.7; extra == "langgraph-tools"
Requires-Dist: langchain>=0.3.0; extra == "langgraph-tools"
Requires-Dist: langchain-fireworks>=0.3.0; extra == "langgraph-tools"
Dynamic: license-file

# Eval Protocol (EP)

[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)

**The open-source toolkit for building your internal model leaderboard.**

When you have multiple AI models to choose from—different versions, providers,
or configurations—how do you know which one is best for your use case?

## 🚀 Features

- **Custom Evaluations**: Write evaluations tailored to your specific business needs
- **Auto-Evaluation**: Stack-rank models using LLMs as judges with just model traces using out-of-the-box evaluators
- **RL Environments via MCP**: Build reinforcement learning environments using the Model Control Protocol (MCP) to simulate user interactions and advanced evaluation scenarios
- **Consistent Testing**: Test across various models and configurations with a unified framework
- **Resilient Runtime**: Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations
- **Rich Visualizations**: Built-in pivot tables and visualizations for result analysis
- **Data-Driven Decisions**: Make informed model deployment decisions based on comprehensive evaluation results

## Quick Examples

### Basic Model Comparison

Compare models on a simple formatting task:

```python test_bold_format.py
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import default_single_turn_rollout_processor, evaluation_test

@evaluation_test(
    input_messages=[
        [
            Message(role="system", content="Use bold text to highlight important information."),
            Message(role="user", content="Explain why evaluations matter for AI agents. Make it dramatic!"),
        ],
    ],
    completion_params=[
        {"model": "fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct"},
        {"model": "openai/gpt-4"},
        {"model": "anthropic/claude-3-sonnet"}
    ],
    rollout_processor=default_single_turn_rollout_processor,
    mode="pointwise",
)
def test_bold_format(row: EvaluationRow) -> EvaluationRow:
    """Check if the model's response contains bold text."""
    assistant_response = row.messages[-1].content

    if assistant_response is None:
        row.evaluation_result = EvaluateResult(score=0.0, reason="No response")
        return row

    has_bold = "**" in str(assistant_response)
    score = 1.0 if has_bold else 0.0
    reason = "Contains bold text" if has_bold else "No bold text found"

    row.evaluation_result = EvaluateResult(score=score, reason=reason)
    return row
```

### Using Datasets

Evaluate models on existing datasets:

```python
from eval_protocol.pytest import evaluation_test
from eval_protocol.adapters.huggingface import create_gsm8k_adapter

@evaluation_test(
    input_dataset=["development/gsm8k_sample.jsonl"],  # Local JSONL file
    dataset_adapter=create_gsm8k_adapter(),  # Adapter to convert data
    completion_params=[
        {"model": "openai/gpt-4"},
        {"model": "anthropic/claude-3-sonnet"}
    ],
    mode="pointwise"
)
def test_math_reasoning(row: EvaluationRow) -> EvaluationRow:
    # Your evaluation logic here
    return row
```


## 📚 Resources

- **[Documentation](https://evalprotocol.io)** - Complete guides and API reference
- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** - Community discussions
- **[GitHub](https://github.com/eval-protocol/python-sdk)** - Source code and examples

## Installation

**This library requires Python >= 3.10.**

### Basic Installation

Install with pip:

```bash
pip install eval-protocol
```

### Recommended Installation with uv

For better dependency management and faster installs, we recommend using [uv](https://docs.astral.sh/uv/):

```bash
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install eval-protocol
uv add eval-protocol
```

### Optional Dependencies

Install with additional features:

```bash
# For Langfuse integration
pip install 'eval-protocol[langfuse]'

# For HuggingFace datasets
pip install 'eval-protocol[huggingface]'

# For all adapters
pip install 'eval-protocol[adapters]'

# For development
pip install 'eval-protocol[dev]'
```

## License

[MIT](LICENSE)
