Metadata-Version: 2.4
Name: probegpt
Version: 0.3.1
Summary: Automated red-team discovery for AI models
License-Expression: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: pydantic-ai>=1.63.0
Requires-Dist: openai>=2.0.0
Requires-Dist: azure-identity>=1.15.0
Requires-Dist: rich>=13.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: textual>=8.0.0

# probegpt

Automated red-team discovery for AI models. Generates adversarial probe prompts across 8 attack strategies, sends them to a target model, judges responses against your objective, and iteratively refines the best-scoring probes using intelligent seed grading.

## Install

```bash
pipx install probegpt
```

## Usage

```bash
probegpt
```

Full-screen TUI. Set up your models in **Config**, then go to **Run Discovery**, enter your objective, and hit **Run**.

## Models

Any of the following providers can be assigned independently to the **target**, **generator**, and **judge** roles:

| Provider | Auth |
|----------|------|
| Azure OpenAI | `az login` |
| OpenRouter | `OPENROUTER_API_KEY` env var |
| GCP Vertex AI | `gcloud auth application-default login` |
| AWS Bedrock | `aws configure` |
| Cerebras | `CEREBRAS_API_KEY` env var |
| Mistral | `MISTRAL_API_KEY` env var |

Config is saved to `~/.outofthebox/config.json`.

## How it works

1. Loads 9 seed probes covering known red-team techniques
2. Each iteration distributes candidates across all enabled **attack strategies** (weighted)
3. Every probe is sent to the **target** model; the **judge** scores the response 0.0 → 1.0
4. Probes scoring ≥ 0.5 are sent to the **seed grader**, which analyzes *why* the probe worked and extracts technique tags
5. Novel, high-transferability probes are added to a weighted seed pool; the generator is told which techniques are working
6. Results optionally exported to JSON

## Attack strategies

Eight strategies run in parallel each iteration. Candidates are distributed proportionally to their configured weights.

| Strategy | Technique family |
|----------|-----------------|
| **objective** | Fresh probes written directly for the objective |
| **mutation** | Variations of high-scoring seeds from the pool |
| **encoding** | base64, ROT13, leetspeak, fragmentation, reversal — half LLM-generated, half programmatic |
| **roleplay** | Evil-twin personas, simulation frames, nested hierarchy, game context |
| **persuasion** | Emotional appeal, urgency, flattery, logical fallacy, social proof |
| **linguistic** | Typos, phonetic spelling, double negatives, logic tricks, passive voice |
| **structural** | Payload splitting, chain-of-thought hijack, prefix/suffix injection, context stuffing |
| **distraction** | Benign prefix, topic pivot, attention splitting, distract-and-attack |

Enable/disable strategies and adjust their relative weights in **Config → Strategies**.

## Intelligent seed grading

After each successful probe (score ≥ 0.5), an LLM agent analyzes it and assigns:

- **technique tags** — what made it work (57-tag vocabulary)
- **novelty** — how different it is from existing seeds (duplicates are discarded)
- **transferability** — how likely the technique generalises to other objectives

From iteration 2 onward, the generator receives the top working technique tags as a feedback hint, and mutation sampling weights toward high-scoring, novel, transferable seeds.

## Requirements

- Python 3.11+
- At least one configured provider for each role (target, generator, judge)
