Metadata-Version: 2.4
Name: proxym
Version: 0.1.36
Summary: Intelligent AI proxy with multi-provider routing, semantic caching, and delta context buffers
Author-email: Tom Sapletta <tom@sapletta.com>
License-Expression: Apache-2.0
Requires-Python: >=3.11
Requires-Dist: click>=8.1.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: litellm>=1.55.0
Requires-Dist: pydantic-settings>=2.5.0
Requires-Dist: pydantic>=2.9.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: redis>=5.0.0
Requires-Dist: rich>=13.9.0
Requires-Dist: structlog>=24.4.0
Requires-Dist: tiktoken>=0.8.0
Requires-Dist: uvicorn[standard]>=0.30.0
Requires-Dist: watchfiles>=0.24.0
Requires-Dist: xxhash>=3.5.0
Provides-Extra: dev
Requires-Dist: mypy>=1.12.0; extra == 'dev'
Requires-Dist: pre-commit>=4.0.0; extra == 'dev'
Requires-Dist: ruff>=0.7.0; extra == 'dev'
Provides-Extra: gpu
Requires-Dist: sentence-transformers>=3.3.0; extra == 'gpu'
Requires-Dist: torch>=2.4.0; extra == 'gpu'
Provides-Extra: jetson
Requires-Dist: onnxruntime-gpu; extra == 'jetson'
Provides-Extra: test
Requires-Dist: fakeredis>=2.25.0; extra == 'test'
Requires-Dist: playwright>=1.48.0; extra == 'test'
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'test'
Requires-Dist: pytest-cov>=5.0.0; extra == 'test'
Requires-Dist: pytest-httpx>=0.32.0; extra == 'test'
Requires-Dist: pytest-timeout>=2.3.0; extra == 'test'
Requires-Dist: pytest>=8.3.0; extra == 'test'
Requires-Dist: respx>=0.21.0; extra == 'test'
Provides-Extra: tts
Requires-Dist: piper-tts>=1.2.0; extra == 'tts'
Provides-Extra: vm
Requires-Dist: clonebox>=1.1.2; extra == 'vm'
Provides-Extra: voice
Requires-Dist: faster-whisper>=1.0.0; extra == 'voice'
Requires-Dist: numpy>=1.26.0; extra == 'voice'
Requires-Dist: sounddevice>=0.4.0; extra == 'voice'
Description-Content-Type: text/markdown

# Proxym — Intelligent Multi-Provider LLM Gateway

> Lokalne proxy łączące 10 providerów, 15 modeli, NVIDIA Jetson Orin, i delta context buffer
> w jednym OpenAI-compatible API. Budżet $20–60/mies. zamiast $150+.

```
┌─────────────────────────────────────────────────┐
│  IDE (Roo Code / Cline / Continue.dev / Aider)  │
│           ↓ localhost:4000                      │
├─────────────────────────────────────────────────┤
│              Proxym (FastAPI)                   │
│  ┌─────────┐ ┌────────────┐ ┌───────────────┐   │
│  │Analyzer │→│  Router    │→│  LiteLLM      │   │
│  │(tier+   │ │(cost+      │ │(10 providers  │   │
│  │ caps)   │ │ fallbacks) │ │ 15 models)    │   │
│  └─────────┘ └────────────┘ └───────────────┘   │
│       ↑            ↑              ↑             │
│  Delta Buffer  Redis Cache  Budget Ledger       │
├─────────────────────────────────────────────────┤
│  Ollama (Jetson Orin / GPU / CPU)               │
└─────────────────────────────────────────────────┘
```

## Features

- **Content-based routing** — analyzes your prompt to pick the cheapest model that can handle the task (Opus 4.6 for architecture, Haiku 4.5 for typos)
- **10 providers, 15 models** — Anthropic, OpenAI, Google, DeepSeek, Groq, OpenRouter, Mistral, Together, Fireworks, Cerebras + local Ollama
- **Delta context buffer** — watches `code2llm` output and sends only file diffs, not full context (saves 60–80% tokens)
- **Budget enforcement** — daily/monthly USD limits with per-request caps
- **Fallback chains** — if Anthropic is rate-limited, auto-fallback to OpenAI → DeepSeek → local
- **OpenAI-compatible API** — drop-in replacement for any tool expecting OpenAI format
- **Voice Chat Interface** — natural language management via DSL + LLM fallback, optional STT/TTS
- **MCP Self-Server** — exposes proxym management as LLM tools at `/mcp/self/tools/*`
- **Docker + Podman + Quadlet** — development, production, and systemd-native deployments

## Quick Start

### Option A: Local Python

```bash
git clone https://github.com/wronai/proxym && cd proxym
bash scripts/setup.sh
# Edit .env with your API keys
proxym serve
```

### Option B: Docker Compose

```bash
cp .env.example .env
# Edit .env with your API keys
docker compose up -d
```

### Option C: Jetson Orin

```bash
docker build -f Dockerfile.jetson -t proxym:jetson .
docker run --runtime nvidia --gpus all \
  -p 4000:4000 -p 11434:11434 \
  --env-file .env \
  proxym:jetson
```

### Test it

```bash
curl http://localhost:4000/health

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-proxy-local-dev" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "balanced",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

## Model Routing Strategy

The proxy analyzes each prompt and picks the optimal model:

| Task Type | Tier | Model Selected | Cost/1M tokens |
|-----------|------|----------------|----------------|
| "Fix this typo" | trivial | Cerebras Llama 70B / DeepSeek V3 | $0.27–$0.60 |
| "What does this function do?" | operational | Haiku 4.5 / Gemini Flash | $0.15–$1.00 |
| "Implement a REST endpoint" | standard | Sonnet 4.6 / GPT-4.1 | $2.00–$3.00 |
| "Refactor auth across 20 files" | complex | Sonnet 4.6 / Gemini Pro | $3.00–$10.00 |
| "Debug this race condition step by step" | deep | Opus 4.6 / DeepSeek R1 | $0.55–$5.00 |

### Model Aliases

Use these as the `model` parameter for explicit routing:

| Alias | Routes To | When to Use |
|-------|-----------|-------------|
| `cheap` | Haiku 4.5 | Debug, validation, simple Q&A |
| `balanced` | Sonnet 4.6 | Default coding, implementation |
| `premium` | Opus 4.6 | Complex refactoring, architecture |
| `free` | Gemini 2.5 Flash | Planning, analysis (free tier) |
| `local` | Qwen 3B (Ollama) | Offline, privacy, autocomplete |

### Configuring Model Aliases

Aliases are configurable via environment variables (see `.env.example`):

```env
PROXYM_ALIAS_CHEAP=anthropic:claude-3-5-haiku-20241022
PROXYM_ALIAS_BALANCED=anthropic:claude-3-5-sonnet-20241022
PROXYM_ALIAS_PREMIUM=anthropic:claude-3-opus-20240229
PROXYM_ALIAS_FREE=google:gemini-1.5-flash
```

Format: `provider:model-name` or just `provider` (uses provider default). The corresponding API key must also be set (e.g., `ANTHROPIC_API_KEY` for anthropic provider).

### Automatic Routing

Without a model alias, the proxy analyzes your message:

```bash
# Automatically routes to cheap model
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"messages": [{"role": "user", "content": "What is a for loop?"}]}'

# Automatically routes to premium model
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"messages": [{"role": "user", "content": "Refactor the entire auth module to microservices"}]}'

# Force a tier with header
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -H "X-Task-Tier: deep" \
  -d '{"messages": [{"role": "user", "content": "Why does this deadlock?"}]}'
```

## Delta Context Buffer

The proxy maintains a buffer of your project files (from `code2llm` output) and sends only diffs to the LLM, dramatically reducing token usage.

### Setup

```bash
# Terminal 1: Generate code2llm output
pip install code2llm
code2llm ./ -f all -o ./project --no-chunk

# Terminal 2: Start the watcher
proxym-watch --watch ./project --proxy http://localhost:4000

# Terminal 3: Query with context injection
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -H "X-Inject-Context: true" \
  -d '{"messages": [{"role": "user", "content": "Explain the auth module"}]}'
```

### How It Works

1. `code2llm` generates project analysis files in `./project/`
2. `proxym-watch` watches the directory with `watchfiles`
3. On change, it computes a unified diff against the last-sent snapshot
4. Only changed portions are sent to the proxy as a `<context_delta>` block
5. When you add `X-Inject-Context: true`, the delta is injected into the system prompt

**Before** (full context every request): ~120K tokens × $3/1M = $0.36/request
**After** (delta only): ~5K tokens × $3/1M = $0.015/request → **96% savings**

## IDE Integration

### Roo Code

Settings → Provider: OpenAI Compatible
- API Base: `http://localhost:4000`
- API Key: (your master key from .env)
- Sticky Models per mode:
  - Architect → `free`
  - Code → `balanced`
  - Debug → `cheap`
  - Custom Opus → `premium`

### Cline / Continue.dev / Aider

Same pattern: point API base to `http://localhost:4000` with your master key.

## Deployment

### Docker Compose (Development)

```bash
docker compose up -d          # proxy + redis + ollama
docker compose logs -f proxy  # watch logs
```

### Docker Compose + Traefik (Production)

```bash
docker compose -f docker-compose.prod.yml up -d
# Access via https://proxym.local
```

### Podman Quadlet (Systemd-native)

```bash
# Copy quadlet files
mkdir -p ~/.config/containers/systemd
cp quadlet/*.container quadlet/*.network ~/.config/containers/systemd/

# Build and tag image
podman build -t localhost/proxym:latest .

# Create config dir
mkdir -p ~/.config/proxym
cp .env ~/.config/proxym/.env

# Enable and start
systemctl --user daemon-reload
systemctl --user start proxym
systemctl --user status proxym
```

### Jetson Orin

The Jetson Dockerfile bundles Ollama + Proxym in a single container:

```bash
docker build -f Dockerfile.jetson -t proxym:jetson .
docker run --runtime nvidia --gpus all \
  -p 4000:4000 -p 11434:11434 \
  -v ~/ollama:/root/.ollama \
  --env-file .env \
  proxym:jetson
```

Models available on Jetson Orin 8GB:
- `qwen2.5-coder:1.5b` — autocomplete (~1GB, ~30 tok/s)
- `qwen2.5-coder:3b` — code generation (~2GB, ~18 tok/s)
- `phi3:3.8b` — general tasks (~2.5GB, ~15 tok/s)

## API Reference

### `POST /v1/chat/completions`

OpenAI-compatible. Extra features:

| Header | Description |
|--------|-------------|
| `X-Task-Tier` | Force tier: `trivial\|operational\|standard\|complex\|deep` |
| `X-Inject-Context` | `true` to inject latest code2llm delta |

Response includes `_proxy` metadata:
```json
{
  "choices": [...],
  "_proxy": {
    "model_id": "anthropic/sonnet-4.6",
    "tier": "standard",
    "cost_usd": 0.000045,
    "routing_reason": "tier=standard, cost=$0.0000",
    "elapsed_ms": 1234.5,
    "fallback_index": 0
  }
}
```

## Dashboard Guides

- `docs/DASHBOARD_VOICE_NOVNC.md`

### `GET /v1/models`

List all available models with pricing and capabilities.

### `GET /v1/budget`

Current spend vs. limits.

### `POST /v1/context/delta`

Receive context delta from the watcher client.

### `GET /v1/context/stats`

Delta buffer statistics.

## CLI Reference

```bash
# Server
proxym serve                    # start the proxy server
proxym serve --port 4001        # custom port

# Status & models
proxym status                   # system overview (costs, budget, VMs)
proxym models                   # list all available models with pricing

# Accounts
proxym accounts list            # list all accounts
proxym accounts add --name Work --provider anthropic --api-key sk-ant-...
proxym accounts costs           # cost breakdown per account

# VMs (requires pip install proxym[vm])
proxym vm list                  # list VMs
proxym vm create --tool windsurf --account work-anthropic
proxym vm start windsurf-work   # start a VM
proxym vm open windsurf-work    # open SPICE viewer
proxym vm ssh windsurf-work     # SSH into VM
proxym vm switch windsurf-work --project other-project
proxym vm stop windsurf-work
proxym vm snapshot windsurf-work

# Browser profiles
proxym browser list             # detect Firefox/Chrome profiles on host
proxym browser assign default-release --account abc123
proxym browser sync windsurf-work
proxym browser snapshot windsurf-work

# Interactive chat (DSL first, LLM fallback)
proxym chat                     # text mode
proxym chat --voice              # microphone input (Whisper STT)
proxym chat --tts                # text-to-speech responses
proxym chat --voice --tts        # full voice loop
```

### Chat DSL Examples

The chat command first tries to match your input against a built-in DSL (zero tokens, instant).
Unmatched phrases are forwarded to the LLM with proxym MCP tools.

```
You: status                → GET /dashboard/system (DSL)
You: pokaż VM-y            → GET /dashboard/vms (DSL)
You: koszty                → GET /dashboard/costs (DSL)
You: start windsurf-work   → POST /dashboard/vms/windsurf-work/start (DSL)
You: dlaczego proxy jest wolne? → forwarded to LLM with tools
```

Customize DSL patterns: copy `src/proxym/cli/dsl.yaml` to `~/.config/proxym/dsl.yaml`.

## Testing

```bash
# Unit tests (no external services needed)
pytest tests/ --ignore=tests/test_e2e.py -v

# E2E tests (mock LiteLLM, no real API calls)
pytest tests/test_e2e.py -v -m e2e

# All tests with coverage
pytest tests/ -v --cov=proxym --cov-report=html
```

## Budget Examples

### $25/month (casual, 4h/day)

```env
DAILY_BUDGET_USD=1.5
MONTHLY_BUDGET_USD=25
```

Autocomplete: Ollama local ($0) → Planning: Gemini free ($0) → Coding: Sonnet 4.6 ($15) → Complex: skip Opus, use DeepSeek R1 ($5)

### $60/month (intensive, 8h/day)

```env
DAILY_BUDGET_USD=3.0
MONTHLY_BUDGET_USD=60
```

Full model spectrum with Opus 4.6 for 2–3 complex tasks/week.

## Project Structure

```
proxym/
├── src/proxym/
│   ├── main.py              # FastAPI app + OpenAI-compatible endpoint
│   ├── ctl.py               # Unified CLI (proxym command)
│   ├── config.py            # Pydantic settings from .env
│   ├── providers/__init__.py # Model registry (15 models, 10 providers)
│   ├── router/
│   │   ├── __init__.py      # Content analyzer (tier classification)
│   │   └── strategy.py      # Router + cost ledger + fallbacks
│   ├── cache/__init__.py    # Delta context buffer
│   ├── middleware/__init__.py # Auth + cost tracking
│   ├── watch/__init__.py    # File watcher client
│   ├── accounts/__init__.py # Account manager (multi-provider vault)
│   ├── dashboard/__init__.py # REST API for dashboard + CLI
│   ├── cli/
│   │   ├── dsl.py           # DSL parser (regex pattern matching)
│   │   ├── chat.py          # Interactive chat command
│   │   ├── voice.py         # Whisper STT input
│   │   └── tts.py           # espeak/Piper TTS output
│   ├── mcp/
│   │   ├── self_server.py   # MCP tool endpoints for LLM
│   │   ├── registry.py      # MCP server registry
│   │   └── router.py        # MCP tool routing
│   └── virt/
│       ├── __init__.py      # VMOrchestrator (CloneBox + virsh)
│       ├── clonebox_adapter.py # CloneBox CLI wrapper
│       ├── browser_profiles.py # Firefox/Chrome profile detection
│       └── profiles/        # CloneBox YAML templates per tool
│           ├── windsurf.clonebox.yaml
│           ├── cursor.clonebox.yaml
│           ├── vscode.clonebox.yaml
│           ├── jetbrains.clonebox.yaml
│           └── browser.clonebox.yaml
├── tests/
│   ├── test_analyzer.py         # Tier classification tests
│   ├── test_router.py           # Router strategy + budget tests
│   ├── test_delta_buffer.py     # Delta computation tests
│   ├── test_clonebox_adapter.py # CloneBox adapter tests
│   ├── test_browser_profiles.py # Browser detection tests
│   ├── test_accounts.py         # Account management tests
│   ├── test_dsl.py              # DSL pattern matching tests (35)
│   ├── test_chat.py             # Chat formatter tests (8)
│   ├── test_self_mcp.py         # MCP self-server tests (8)
│   ├── test_dashboard*.py       # Dashboard API tests
│   └── test_e2e.py              # Full HTTP API tests
├── docker-compose.yml       # Development (proxy + redis + ollama)
├── docker-compose.prod.yml  # Production (+ traefik)
├── Dockerfile               # Standard build
├── Dockerfile.jetson        # Jetson Orin (ARM64 + CUDA)
├── quadlet/                 # Podman systemd integration
├── traefik/                 # Reverse proxy config
└── scripts/
    ├── setup.sh             # First-time setup
    └── jetson-entrypoint.sh # Jetson startup script
```

## License

Apache License 2.0 - see [LICENSE](LICENSE) for details.

## Author

Created by **Tom Sapletta** - [tom@sapletta.com](mailto:tom@sapletta.com)
