Metadata-Version: 2.4
Name: gradient-desc
Version: 0.1.8
Summary: Git-like deterministic checkpointing for ML training
Author: Malhar Shah, Akshat Shah
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: numpy
Dynamic: license-file

Gradient
========

Git-like deterministic checkpointing for ML training with anchor + delta
checkpoints, forkable branches, and a workspace/repo hierarchy.

**[Watch the Demo Video](https://www.youtube.com/watch?v=ttNAmlQhHwI)**

**[Visit the site](https://www.gradient-desc.com)**

**[Read the docs](https://www.gradient-desc.com/docs)**

Highlights
----------
- **Anchor + delta checkpointing** to reduce storage by up to 80%.
- **Deterministic resume** (model state, RNG, optimizer, scheduler).
- **Branching and forking** from any checkpoint ref.
- **Workspace/repo hierarchy** for organizing multiple models.
- **Auto-create mode** - just specify workspace + repo, everything is created automatically.
- **Git-style CLI** with workspace and repo management commands.
- Manifest-based run metadata for dashboards and tooling.

Install
-------

```bash
pip install gradient-desc
```

Required dependency: `torch`.

Quick Start
-----------

### Zero Setup (Auto-Create Mode)

The simplest way to get started - just specify a workspace and repo name:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from gradient import GradientEngine

model = nn.Linear(4, 1)
opt = optim.Adam(model.parameters(), lr=1e-3)

# Both workspace and repo are auto-created!
engine = GradientEngine.attach(
    model, opt,
    workspace="./my_workspace",
    repo="my_model"
)
engine.autocommit(every=5)

start = engine.current_step
for step in range(start + 1, start + 21):
    loss = (model(torch.randn(32, 4)) ** 2).mean()
    loss.backward()
    opt.step()
    opt.zero_grad(set_to_none=True)
    engine.maybe_commit(step)
```

### CLI-Initialized Workflow

For more control, initialize workspace and repo explicitly:

```bash
# Initialize workspace
gradient workspace init ./ml-experiments

# Create a repo for your model
cd ml-experiments
gradient repo init gpt4 --description "GPT-4 training runs"

# Check status
gradient workspace status
```

Then in your training script:

```python
from gradient import GradientEngine

# Auto-discovers workspace/repo from current directory
engine = GradientEngine.attach(model, optimizer)
```

Workspace/Repo Hierarchy
------------------------

Gradient organizes checkpoints in a Git-like hierarchy:

```
my_workspace/           # Workspace (contains multiple repos)
├── .gradient/          # Workspace marker
│   └── config.json
├── gpt4/               # Repo (one model)
│   ├── .gradient-repo/ # Repo marker
│   │   └── config.json
│   ├── manifest.json
│   ├── ckpt_main_s0.pt
│   └── ckpt_main_s100.pt
└── llama/              # Another repo
    └── ...
```

- **Workspace**: Contains multiple repos (one per model/project)
- **Repo**: Contains branches and checkpoints for a single model

CLI
---

### Workspace Commands

```bash
gradient workspace init [path]        # Initialize a new workspace
gradient workspace status             # Show all repos in workspace
```

### Repo Commands

```bash
gradient repo init <name> [-d DESC]   # Create a new repo in workspace
gradient repo list                    # List all repos
```

### Training Commands

```bash
gradient status                       # Show current repo status
gradient resume <ref> -- python train.py
gradient fork <from_ref> <new_branch> [--reset-optimizer] [--seed N] -- python train.py
```

### Checkpoint Refs

Refs use the format `branch@step`:
- `main@100` - step 100 on main branch
- `experiment@50` - step 50 on experiment branch
- `latest` - most recent checkpoint on current branch

### Environment Variables

Set by the CLI for training script handoff:
- `GRADIENT_WORKSPACE`: workspace path
- `GRADIENT_REPO`: repo name
- `GRADIENT_RESUME_REF`: checkpoint ref to resume from
- `GRADIENT_BRANCH`: branch name override
- `GRADIENT_AUTOCOMMIT`: auto-commit interval

Public API
----------

### Import Surface

```python
from gradient import (
    GradientEngine,
    GradientConfig,
    # Workspace/Repo management
    WorkspaceConfig,
    RepoConfig,
    init_workspace,
    init_repo,
    find_workspace,
    find_repo,
    resolve_context,
)
```

### GradientEngine.attach

Attach to a model and optimizer for checkpointing:

```python
# Auto-create mode (simplest)
engine = GradientEngine.attach(
    model, optimizer,
    workspace="./my_workspace",
    repo="my_model"
)

# Auto-discover from current directory
engine = GradientEngine.attach(model, optimizer)

# With explicit config
engine = GradientEngine.attach(
    model, optimizer,
    scheduler=lr_scheduler,
    config=GradientConfig(
        workspace_path="./my_workspace",
        repo_name="my_model",
        branch="experiment",
    )
)
```

Behavior:
- Auto-creates workspace and repo if both are explicitly provided
- Auto-discovers from current directory if inside an initialized repo
- Respects CLI environment variables for handoff
- Creates `manifest.json` on first attach

### Checkpoint Operations

```python
engine.commit(step, message="")      # Write checkpoint (anchor or delta)
engine.resume("main@100")            # Resume from ref
engine.resume_latest()               # Resume latest on current branch
engine.fork(
    from_ref="main@100",
    new_branch="experiment",
    reset_optimizer=False,
    reset_scheduler=False,
    reset_rng_seed=None,
    message=""
)
```

### Training + Commit Patterns

```python
# Periodic auto-commit
engine.autocommit(every=10)
start = engine.current_step

for step in range(start + 1, start + 1001):
    loss = train_step(...)
    engine.maybe_commit(step)
```

```python
# Manual milestone commits
for step in range(start + 1, start + 501):
    loss = train_step(...)

    if step in {1, 50, 100, 250, 500}:
        engine.commit(step, message=f"milestone step {step}")
```

### Training Loop Helpers

```python
engine.autocommit(every=10)          # Set auto-commit interval
engine.maybe_commit(step)            # Commit if step matches interval
engine.current_step                  # Step resumed from (0 for fresh run)
```

### Properties

```python
engine.workspace_path                # Path to workspace
engine.repo_name                     # Current repo name
engine.repo_path                     # Full path to repo
engine.branch                        # Current branch name
```

### Extensibility

Register external state (RL envs, curriculum, etc.):

```python
engine.register_state(
    "env_state",
    getter=lambda: env.get_state(),
    setter=lambda s: env.set_state(s)
)
```

GradientConfig
--------------

```python
GradientConfig(
    workspace_path="./my_workspace",
    repo_name="my_model",
    branch="main",
    reanchor_interval=None,
    compression="auto",  # "off" | "auto" | "aggressive"
)
```

Notes:
- `reanchor_interval`: force new anchor after N delta checkpoints
- `compression`: lightweight delta compression mode

Manifest Format
---------------

`manifest.json` is created in each repo and updated on every commit:

```json
{
  "repo_name": "my_model",
  "checkpoints": [
    {
      "step": 10,
      "branch": "main",
      "file": "ckpt_main_s10.pt",
      "type": "delta"
    }
  ]
}
```
