Metadata-Version: 2.4
Name: matcha-gpu
Version: 0.2.6
Summary: GPU energy metering for AI training workloads
Author-email: Keeya Labs <hello@keeyalabs.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://usematcha.dev
Project-URL: Repository, https://github.com/keeyalabs/matcha-gpu
Keywords: gpu,energy,power,training,nvml,observability
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nvidia-ml-py
Dynamic: license-file

<p align="center">
  <b>matcha</b>
</p>

<h3 align="center">GPU energy metering for AI training workloads</h3>

<div align="center">

[![PyPI version](https://img.shields.io/pypi/v/matcha-gpu?color=4ade80)](https://pypi.org/project/matcha-gpu/)
[![PyPI Downloads](https://static.pepy.tech/badge/matcha-gpu/month)](https://pepy.tech/projects/matcha-gpu)
[![Python](https://img.shields.io/pypi/pyversions/matcha-gpu?color=4ade80)](https://pypi.org/project/matcha-gpu/)
[![License](https://img.shields.io/badge/license-Apache%202.0-4ade80)](https://opensource.org/licenses/Apache-2.0)

</div>

<p align="center">
  Measure GPU energy consumption of any training run. One command. Zero overhead. Zero code changes.
</p>

---

## Install

```bash
pip install matcha-gpu
```

Requires an NVIDIA GPU with drivers installed.

## Quick Start

Prefix your training command with `matcha run`:

```bash
matcha run torchrun --standalone --nproc_per_node=1 train_gpt.py
```

Your training runs at full speed. Matcha appends one line at the end:

```
matcha_energy total:364722J (101.31Wh) duration:746.0s avg_power:489W peak_power:700W samples:7449
```

That's it. No code changes. No config files. Works with any training script.

## How It Works

Matcha runs a lightweight background thread that polls GPU power via NVML at 100ms intervals. Your training process runs natively — no stdout interception, no wrapper overhead. When training finishes, Matcha computes total energy using trapezoidal integration of instantaneous power readings.

## Commands

### `matcha run` — Zero overhead energy measurement

```bash
# Measure total energy for any training command
matcha run python train.py
matcha run torchrun --standalone --nproc_per_node=1 train_gpt.py
matcha run deepspeed --num_gpus=4 train.py --deepspeed ds_config.json
```

### `matcha wrap` — Per-step energy breakdown

```bash
# See energy for each training step (some overhead)
matcha wrap -p python train.py
```

Parses stdout for step markers (`step 10`, `iter 10`, `[10/1000]`, etc.) and reports energy between each step. Useful for diagnosing energy spikes and inefficient training phases.

### `matcha monitor` — Live GPU power monitoring

```bash
# Watch GPU power draw in real time
matcha monitor
matcha monitor --gpu 0 --window 2.0
```

### Python SDK

```python
import matcha

m = matcha.init()

for step in range(num_steps):
    m.step_start()
    # ... your training code, unchanged ...
    energy = m.step_end(step)

summary = m.finish()
# summary.total_energy_j, summary.energy_kwh, summary.j_per_step
```

## Example Output

```
step:1/20000 train_loss:6.9357 train_time:356ms step_avg:356.10ms
step:2/20000 train_loss:16.7414 train_time:725ms step_avg:362.47ms
...
step:1709/20000 val_loss:2.2111 val_bpb:1.3095 train_time:600097ms step_avg:351.14ms
final_int8_zlib_roundtrip_exact val_loss:2.21311047 val_bpb:1.31072868
matcha_energy total:364722J (101.31Wh) duration:746.0s avg_power:489W peak_power:700W samples:7449
```

## Tested On

- NVIDIA H100 80GB HBM3 — training nanoGPT variants at 500-700W
- Works with `torchrun`, `deepspeed`, `accelerate`, or plain `python`
- Compatible with PyTorch, JAX, and any framework that runs on NVIDIA GPUs

## Why

GPU rental is expensive. Electricity is cheap. But knowing your energy profile tells you whether your GPU is actually working hard or sitting idle — and that directly maps to training efficiency and cost.

```
10-minute H100 training run:
  Energy cost:   $0.01 (101 Wh @ $0.12/kWh)
  Compute cost:  $0.48 (RunPod @ $2.90/hr)

  → Compute is 48x the energy cost
  → Optimizing energy/step = faster training = less rental time
```

## Roadmap

- [ ] Multi-GPU support (aggregate across 8xH100)
- [ ] Log file tailing for zero-overhead per-step attribution
- [ ] JSONL output for downstream analysis
- [ ] Go sidecar binary for production deployments
- [ ] Carbon footprint estimation by region

## Built by

[Keeya Labs](https://keeyalabs.com) · [usematcha.dev](https://usematcha.dev)

## License

Apache 2.0
