Metadata-Version: 2.4
Name: tailtester
Version: 0.2.6
Summary: The pytest for AI agents — auto-generate and run tests for any AI agent
Project-URL: Homepage, https://github.com/avansaber/tailtest
Project-URL: Documentation, https://github.com/avansaber/tailtest
Project-URL: Repository, https://github.com/avansaber/tailtest
Author: AvanSaber Inc.
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: agents,ai,evaluation,llm,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: click>=8.1
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1
Requires-Dist: litellm>=1.40
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: watchfiles>=1.0
Provides-Extra: dev
Requires-Dist: pyright>=1.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# tailtest

**The pytest for AI agents.**

> "You don't write tests. You build your agent -- we watch, we learn, we test."

![Status: Work in Progress](https://img.shields.io/badge/status-work%20in%20progress-yellow)
![License: Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue)

---

## The Problem

93% of developers don't test their AI agents. The tooling doesn't exist, the patterns aren't established, and the only serious option -- Promptfoo -- just got acquired by OpenAI. There is no vendor-neutral, open-source, developer-first testing tool for AI agents. If you ship an agent today, you're shipping it blind.

## What This Will Be

- **Position 0**: Observes your development process and auto-generates tests
- **Deterministic + LLM-judged + red-team assertions** in a single framework
- **Any framework**: LangChain, CrewAI, PydanticAI, OpenAI Agents SDK, raw API calls
- **Any model**: OpenAI, Anthropic, Google, Ollama, anything via litellm
- **CLI-first, CI/CD native** -- exit codes, JUnit XML, parallel execution
- **Built-in red-teaming**: prompt injection, jailbreak, PII extraction, OWASP compliance
- **Production monitoring** with automatic regression test generation from failures
- **Zero telemetry, fully local, Apache 2.0** -- no data leaves your machine, ever

## Quick Start (Future)

```bash
pip install tailtester
tailtest scan .
tailtest run
```

Three commands. No config files. No account creation. Meaningful test results in under 3 minutes.

## Example Test

```python
from tailtest import agent_test, expect

@agent_test
async def test_order_lookup():
    response = await agent.chat("What's the status of order #12345?")
    expect(response).to_call_tool("lookup_order")
    expect(response).tool_called_with("lookup_order", order_id="12345")
    expect(response).to_contain("order")
    expect(response).no_pii()
    expect(response).latency_under(3000)
    expect(response).cost_under(0.50)

@agent_test
async def test_response_quality():
    response = await agent.chat("Explain your return policy")
    expect(response).faithful_to(context="Returns accepted within 30 days...")
    expect(response).helpful()
    expect(response).tone("professional", "empathetic")

@agent_test(retries=10)
async def test_reliability():
    response = await agent.chat("What are your business hours?")
    expect(response).to_contain("9am")
    expect(response).pass_rate(0.95)
```

Deterministic assertions (cost, latency, tool calls, PII) run instantly at zero cost.
LLM-judged assertions (faithfulness, tone, quality) default to a local model via Ollama.

## Architecture

```
+-------------------+     +-------------------+     +-------------------+
|  CONTEXT ENGINE   | --> |  TEST GENERATOR   | --> |   TEST RUNNER     |
|                   |     |                   |     |                   |
|  Scan codebase    |     |  Deterministic    |     |  Parallel exec    |
|  Watch file edits |     |  LLM-judged       |     |  Record / replay  |
|  Ingest OTel      |     |  Red-team         |     |  CI/CD mode       |
|  Detect framework |     |  Regression       |     |  JUnit XML output |
+-------------------+     +-------------------+     +-------------------+
                                                            |
                                                            v
                                                    +-------------------+
                                                    |  ASSERTION ENGINE |
                                                    |                   |
                                                    |  Deterministic    |
                                                    |  LLM-judged       |
                                                    |  Reliability      |
                                                    +-------------------+
```

## What We Are NOT Building

- Not a dashboard-first enterprise product (that's Braintrust)
- Not a framework-specific tool (that's LangSmith)
- Not a security-only scanner (that's Promptfoo/OpenAI now)
- Not a cloud-required service (runs fully local, forever)

## Current Status

**Phases 1-9 complete, Phase 10 in progress.** The core engine is built and published. v0.2.4 on PyPI and npm.

| Metric | Value |
|--------|-------|
| Python files | ~165 |
| Lines of code | ~27,000 |
| Internal tests | 1097 passing in 28s |
| CLI commands | 20 (init, scan, run, generate, redteam, watch, guard, ingest, record, replay, report, doctor, drift, status, suggest, predict, optimize, mcp-serve, wrap, interview) |
| Assertion types | 26 (12 deterministic + 7 LLM-judge + 5 reliability + tier ordering) |
| Framework detectors | 6 (OpenAI, Anthropic, LangChain, CrewAI, PydanticAI, generic) |
| Red-team attacks | 64 across 8 categories |
| OWASP checks | 20 (LLM Top 10 + Agent Top 10) |
| MCP server tools | 6 (LLM-powered with keyword fallback) |
| Report formats | 6 (terminal, JUnit XML, JSON, HTML, compliance text, compliance HTML) |
| Example projects | 5 (hello-world, openai-assistant, crewai-research, raw-api-agent, acme-support) |

See `examples/` for sample agent projects demonstrating the full pipeline.

## Tech Stack

- **Python 3.11+** with **uv** for package management
- **Click** for CLI, **Pydantic v2** for data models
- **litellm** for model-agnostic LLM calls
- **asyncio + httpx** for parallel test execution
- **opentelemetry-sdk** for production trace ingestion

## Contributing

This project is in early development. Contribution guidelines will be published soon.

## License

[Apache 2.0](LICENSE)
