Metadata-Version: 2.4
Name: vaultlayer
Version: 0.1.2
Summary: AI compute arbitrage CLI — move GPU training jobs between clouds automatically
Author: VaultLayer
License: MIT
Keywords: gpu,cloud,training,arbitrage,mlops,ai
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: click>=8.1.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: redis>=5.0.0
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: boto3>=1.35.0
Requires-Dist: anthropic>=0.40.0
Provides-Extra: server
Requires-Dist: fastapi>=0.115.0; extra == "server"
Requires-Dist: uvicorn[standard]>=0.29.0; extra == "server"
Requires-Dist: supabase>=2.4.0; extra == "server"
Requires-Dist: resend>=2.0.0; extra == "server"
Requires-Dist: stripe>=9.0.0; extra == "server"
Requires-Dist: pydantic[email]>=2.0.0; extra == "server"
Requires-Dist: python-multipart>=0.0.9; extra == "server"

# VaultLayer

> Run AI training jobs across 11 GPU cloud providers — 60–70% cheaper than AWS on-demand, with 99.9% job completion SLA, using your existing commands unchanged. 93% GPU cloud market coverage (~$39B addressable).

```bash
pip install vaultlayer
vaultlayer run python train.py --model llama-3-7b --epochs 10
```

```
✓ Job completed in 4h 32m
✓ 1 interruption recovered automatically (AWS Spot → Lambda H100)
✓ Saved $142.40 vs AWS On-Demand
→ View full report: https://vaultlayer.pages.dev/jobs/j-0042
```

---

## What It Does

VaultLayer sits between your training script and the cloud. It:

- **Checkpoints automatically** — syncs model weights + optimizer state to a zero-egress R2 Vault on every save
- **Detects interruptions** — intercepts AWS/GCP/Azure termination signals before your job dies
- **Migrates instantly** — provisions a replacement node on the cheapest available provider and resumes from last checkpoint
- **Tracks savings** — shows real-time cost vs what you would have paid on AWS On-Demand

No changes to your PyTorch or JAX code. No YAML configs. No PhD-level infra knowledge required.

### Commands

```bash
# Training
vaultlayer run python train.py                                         # run with full protection
vaultlayer run --data s3://bucket/prefix python train.py              # mirror S3→R2 then run
vaultlayer run --data r2://my-dataset python train.py                 # use dataset already in R2
vaultlayer run --regions eu-central-1,eu-west-1 python train.py      # GDPR-only regions
vaultlayer run --excluded-regions cn-north-1 python train.py          # never use China region
vaultlayer stop <job-id>                                               # graceful stop + checkpoint
vaultlayer logs <job-id> [--tail N] [--follow]                        # stream logs from R2

# Dataset storage (no S3 required)
vaultlayer sync ./data --dataset-id my-dataset                        # upload local data → R2
vaultlayer sync s3://bucket/prefix --dataset-id my-dataset            # mirror S3 → R2 (one-time egress)
vaultlayer datasets                                                    # list datasets + storage costs
vaultlayer datasets --delete my-dataset                                # delete + stop billing

# Region discovery
vaultlayer regions list-all                                            # list all valid regions + compliance notes
vaultlayer regions current                                             # show current provisioning region
```

---

## Supported Providers

| Provider | Type | Status |
|---|---|---|
| AWS EC2 Spot | Hyperscaler | ✅ Live |
| Lambda Labs | Neocloud | ✅ Live |
| CoreWeave | Neocloud | ✅ Live |
| RunPod | Neocloud | ✅ Live |
| Vast.ai | Neocloud | ✅ Live |
| Voltage Park | Neocloud | ✅ Live |
| Crusoe | Neocloud | ✅ Live |
| Nebius | Neocloud | ✅ Live |
| Hyperstack | Neocloud | ✅ Live |
| GCP | Hyperscaler | ✅ Live |
| Azure | Hyperscaler | ✅ Live |
| AWS On-Demand | Hyperscaler | ✅ Last-resort fallback |

11 providers live — 93% GPU cloud market coverage (~$39B addressable, ~$21B migratable training).

---

## Model Size Support

| Model Size | Method | Checkpoint Size | Status |
|---|---|---|---|
| 7B | Full fine-tune | ~69 GB | ✅ MVP |
| 13B | Full fine-tune | ~125 GB | ✅ MVP |
| 30B | Full fine-tune | ~288 GB | ✅ MVP |
| 70B | QLoRA (4-bit) | ~46 GB | ✅ MVP |
| 70B | Full fine-tune | ~782 GB | 🔜 Phase 2 |

---

## Tech Stack

| Layer | Technology | Cost |
|---|---|---|
| Code + Docs | GitHub (this repo) | Free |
| CI/CD | GitHub Actions | Free (2k min/mo) |
| Vault / Storage | Cloudflare R2 | Free up to 10GB |
| Agent Runtime | Railway | Free $5/mo credit |
| Webhooks | Cloudflare Workers | Free 100k req/day |
| Agent Message Queue | Upstash Redis | Free 10k cmd/day |

---

## Repository Structure

```
vaultlayer/
├── README.md
├── docs/
│   ├── PRD.md              # Full product requirements
│   ├── ARCHITECTURE.md     # System design + agent topology
│   └── AGENTS.md           # Agent specs + build order
├── dashboard/
│   └── index.html          # Savings dashboard prototype
└── src/
    ├── cli/
    │   ├── main.py
    │   ├── run.py
    │   ├── checkpoint_template.py
    │   └── init.py
    ├── vaultlayer/
    │   └── _resume_hook.py
    ├── agents/
    │   ├── orchestration/
    │   ├── pricing/
    │   ├── watchdog/
    │   │   └── signals.py
    │   ├── vault/
    │   ├── broker/
    │   ├── finops/
    │   └── namespace/
    └── shared/
```

---

## SLA

**99.9% job completion rate** — not node uptime. Jobs survive infrastructure failures.
Recovery SLA: interrupted job resumes within 10 minutes from last checkpoint.

---

## Dataset Storage (No S3 Required)

VaultLayer's Neutral Zone (Cloudflare R2) is a first-class storage provider. Users with no AWS or
cloud storage account can upload training data directly and train from it on any provider.

```bash
# Upload from your laptop / on-prem server
vaultlayer sync ./training-data --dataset-id my-dataset

# Train — data is mounted at /mnt/vaultlayer on every provisioned node
vaultlayer run --data r2://my-dataset python train.py

# See what you're storing and the monthly cost
vaultlayer datasets
```

**Pricing:**

| Action | Cost |
|--------|------|
| Upload (local → R2) | Free |
| Storage | $0.020 / GB / month ($0.0195 — 30% markup over Cloudflare R2 base rate) |
| Read (R2 → training node) | $0.00 (zero egress within Cloudflare network) |
| S3 mirror (one-time) | AWS egress charge (~$0.09/GB, first 100 GB/month free) |

**Storage quotas by plan:**

| Plan | Storage limit |
|------|--------------|
| Free | 10 GB |
| Pro | 500 GB |
| Enterprise | Unlimited |

Datasets are soft-deleted with `vaultlayer datasets --delete <id>` — billing stops immediately,
R2 objects are purged within 24 hours.

---

## Region Control

VaultLayer can provision nodes in any AWS region that has GPU capacity. By default it uses
the region from your `vaultlayer init` configuration.

**Restrict to specific regions** (e.g. GDPR compliance — EU data stays in EU):
```bash
vaultlayer run --regions eu-central-1,eu-west-1 python train.py
```

**Exclude regions** (e.g. avoid China, GovCloud, sanctioned territories):
```bash
vaultlayer run --excluded-regions cn-north-1,cn-northwest-1 python train.py
```

If both flags are given, `--excluded-regions` takes priority. If neither is given, any region is allowed.

**Discover regions:**
```bash
vaultlayer regions list-all    # all GPU-capable regions with compliance notes
vaultlayer regions current     # show which region your credentials point to
```

> **Compliance note:** H100/A100 exports to certain regions (China, Russia, some Middle East countries)
> may require a US Bureau of Industry and Security (BIS) export license. VaultLayer blocks
> `cn-north-1`, `cn-northwest-1`, and `ru-central-1` by default via OFAC screening.
> Use `--regions` to limit jobs to GDPR-compliant EU regions.

---

## Getting Started

```bash
pip install vaultlayer
vaultlayer init
vaultlayer run python train.py
```

---

## License

Private — © 2026 VaultLayer
