Metadata-Version: 2.4
Name: agentpytest
Version: 0.0.0
Summary: Pytest for AI agents — record, score, replay, regress. Vendorable, judge-agnostic, domain-neutral. (Placeholder release; full library coming soon.)
Project-URL: Homepage, https://github.com/piyushbhavsarr/agentpytest
Project-URL: Repository, https://github.com/piyushbhavsarr/agentpytest
Project-URL: Documentation, https://github.com/piyushbhavsarr/agentpytest/tree/main/docs
Project-URL: Changelog, https://github.com/piyushbhavsarr/agentpytest/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/piyushbhavsarr/agentpytest/issues
Author: Piyush Bhavsar
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,evaluation,llm,pytest,regression,testing,trajectory
Classifier: Development Status :: 1 - Planning
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# agentpytest

> **Pytest for AI agents.** Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.

[![PyPI](https://img.shields.io/pypi/v/agentpytest)](https://pypi.org/project/agentpytest/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/agentpytest/agentpytest/actions/workflows/test.yml/badge.svg)](https://github.com/agentpytest/agentpytest/actions)

```bash
pip install agentpytest
```

---

## What it is

`agentpytest` is a **pytest plugin** that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.

**It works for any agent in any domain** — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point `agentpytest` at it, pick a judge, and you have regression tests.

## 60-second example

```python
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call

judge = Judge("gemini/gemini-2.5-pro")

@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
    result = my_agent("Refactor auth.py to use TokenStore")
    assert goal_completion(result, judge=judge).value > 0.8
    assert right_tool(result, judge=judge).value > 0.9
    assert redundant_call(result).value < 0.1
```

```bash
pytest                           # diffs against cassette, fails on regression
pytest --update-trajectories     # regenerate cassette after intentional changes
agentpytest regress \
  --baseline runs/main.json \
  --candidate runs/pr-412.json   # CI gate with p-value + effect size
```

## Why it exists

The agent-eval ecosystem has converged everywhere except the test layer:

| Layer | Owned by |
|---|---|
| Model abstraction | LiteLLM |
| Trace storage + UI | Phoenix, Langfuse, Weave |
| Benchmarks / leaderboards | Inspect AI, SWE-bench, τ-bench |
| Hosted eval platforms | Braintrust, LangSmith, Patronus |
| **The pytest layer in your repo** | **agentpytest** |

`agentpytest` fills the missing slot. It's the thing you `pip install` and run locally — like `pytest`, like `mypy`, like any other dev tool you'd commit alongside your code.

## What makes it different

- **Pytest-native, vendorable, no server.** `pip install`, write tests, commit cassettes to git. No telemetry, no account, runs offline.
- **Any judge model.** Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
- **Agent-model decoupled from judge-model.** Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
- **Statistical regression detection in core.** Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
- **Counterfactual replay.** `fork_from(cassette, span_id, mutate)` replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace.
- **Repo harness — eval on YOUR PRs.** Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
- **TRAIL failure-mode detectors.** ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.

## Comparison

| Feature | agentpytest | Inspect AI | Phoenix | Braintrust | DeepEval | LangSmith | Ragas |
|---|---|---|---|---|---|---|---|
| Pytest-native, vendorable, no server | ✅ | ⚠️ harness | ❌ | ❌ SaaS | ✅ | ❌ SaaS | ✅ |
| Cassette record/replay with tool-lock | ✅ | ⚠️ cache | ❌ | ❌ | ❌ | ❌ | ❌ |
| Mutate-and-replay debugging | ✅ | ❌ | ⚠️ prompt-only | ❌ | ❌ | ❌ | ❌ |
| Statistical CI gate | ✅ | ⚠️ epochs | ❌ | ✅ paid | ❌ | ⚠️ paid | ❌ |
| Eval on YOUR repo's past PRs | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Any judge (~100 providers) | ✅ | ✅ | ⚠️ | ⚠️ | ⚠️ | ⚠️ | ⚠️ |
| Judge ensemble + variance signal | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| TRAIL failure-mode detectors | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |

## Works for any domain

Same library, different rubrics:

| Domain | Example test |
|---|---|
| **Coding agents** | `assert diff_minimality(r, judge=judge).value > 0.8` |
| **SRE / incident response** | `assert dependency_order(r, expected_dag=runbook).value == 1.0` |
| **Outbound sales** | `assert can_spam_compliance(r, judge=judge).value > 0.95` |
| **Healthcare scheduling** | `assert phi_leak_check(r, judge=judge).value == 1.0` |
| **Finance / spend** | `assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99` |

Built-in coding-agent harness on day one; `Harness` plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.

## What you need to provide

| Required | Optional |
|---|---|
| An agent function | Expected tool sequence |
| A task / prompt | Custom rubric text |
| A judge model + key | Past PR list (for repo harness) |
| | Domain policy documents |

**No labeled corpus.** The cassette **is** the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.

## Install & quickstart

```bash
pip install agentpytest

export ANTHROPIC_API_KEY=...    # or OPENAI_API_KEY, GEMINI_API_KEY, etc.
```

```python
# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion

judge = Judge("anthropic/claude-sonnet-4-6")

@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
    result = my_agent("greet the user politely")
    assert goal_completion(result, judge=judge).value > 0.8
```

```bash
pytest tests/test_my_agent.py
```

First run records the cassette. Future runs diff against it.

## Documentation

- [Quickstart](https://agentpytest.dev/quickstart) — first test in 5 minutes
- [Concepts](https://agentpytest.dev/concepts) — trajectories, cassettes, scorers, judges
- [Scorers reference](https://agentpytest.dev/scorers) — every built-in scorer
- [Repo harness guide](https://agentpytest.dev/repo-harness) — eval on your own PRs
- [Domain cookbooks](https://agentpytest.dev/cookbooks) — SRE, sales, healthcare, finance, coding
- [CI integration](https://agentpytest.dev/ci) — GitHub Actions, GitLab, CircleCI

## Examples

Runnable example repos, one per domain:

- [`agentpytest-examples-coding`](https://github.com/agentpytest/examples-coding) — Claude Code-style agent
- [`agentpytest-examples-sre`](https://github.com/agentpytest/examples-sre) — incident-response agent
- [`agentpytest-examples-sales`](https://github.com/agentpytest/examples-sales) — outbound prospecting
- [`agentpytest-examples-healthcare`](https://github.com/agentpytest/examples-healthcare) — appointment booking
- [`agentpytest-examples-finance`](https://github.com/agentpytest/examples-finance) — expense classification

## Status

`v0.1.0-alpha` — core (`trajectory_test`, cassettes), 5 scorers (`goal_completion`, `right_tool`, `right_args`, `redundant_call`, `dependency_order`), `regression_test`, OTel export. Roadmap in [`ROADMAP.md`](./ROADMAP.md).

## License

MIT. No CLA. Fork freely.

## Contributing

Issues and PRs welcome. See [`CONTRIBUTING.md`](./CONTRIBUTING.md). One rule: **no PR that adds a server, dashboard, or hosted feature.** This stays a library.

## Acknowledgements

Built on the shoulders of [LiteLLM](https://github.com/BerriAI/litellm), [agent-vcr](https://github.com/Jarvis2021/agent-vcr), [pytest](https://pytest.org), and [OpenTelemetry](https://opentelemetry.io). TRAIL detector taxonomy inspired by [Patronus AI's published research](https://arxiv.org/html/2505.08638v1).
