Metadata-Version: 2.4
Name: hpc-runner
Version: 0.8.1
Summary: Unified HPC job submission across multiple schedulers
Project-URL: Homepage, https://github.com/sjalloq/hpc-runner
Project-URL: Repository, https://github.com/sjalloq/hpc-runner
Author: Shareef Jalloq
License-Expression: MIT
Keywords: cluster,hpc,job-submission,pbs,sge,slurm
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: System :: Clustering
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.0
Requires-Dist: rich-click>=1.7
Requires-Dist: textual>=6.11
Requires-Dist: tomli>=2.0; python_version < '3.11'
Provides-Extra: all
Requires-Dist: build; extra == 'all'
Requires-Dist: furo>=2024.0.0; extra == 'all'
Requires-Dist: hatch-vcs; extra == 'all'
Requires-Dist: mypy>=1.19; extra == 'all'
Requires-Dist: pre-commit; extra == 'all'
Requires-Dist: pytest-asyncio; extra == 'all'
Requires-Dist: pytest-cov; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: ruff>=0.15; extra == 'all'
Requires-Dist: sphinx>=7.0; extra == 'all'
Requires-Dist: twine; extra == 'all'
Provides-Extra: dev
Requires-Dist: build; extra == 'dev'
Requires-Dist: furo>=2024.0.0; extra == 'dev'
Requires-Dist: hatch-vcs; extra == 'dev'
Requires-Dist: mypy>=1.19; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.15; extra == 'dev'
Requires-Dist: sphinx>=7.0; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Description-Content-Type: text/markdown

# hpc-runner

**Unified HPC job submission across multiple schedulers**

Write your jobs once, run them on any cluster - SGE, Slurm, PBS, or locally for testing.

## Features

- **Unified CLI** - Same commands work across SGE, Slurm, PBS
- **Python API** - Programmatic job submission with dependencies and pipelines
- **Auto-detection** - Automatically finds your cluster's scheduler
- **Interactive TUI** - Monitor jobs with a terminal dashboard
- **Job Dependencies** - Chain jobs with afterok, afterany, afternotok
- **Array Jobs** - Batch processing with throttling support
- **Virtual Environment Handling** - Automatic venv activation on compute nodes
- **Module Integration** - Load environment modules in job scripts
- **Dry-run Mode** - Preview generated scripts before submission

## Installation

```bash
pip install hpc-runner
```

Or with uv:

```bash
uv pip install hpc-runner
```

## Quick Start

### CLI

```bash
# Basic job submission
hpc run python train.py

# With resources
hpc run --cpu 4 --mem 16G --time 4:00:00 "python train.py"

# GPU job
hpc run --queue gpu --cpu 4 --mem 32G "python train.py --epochs 100"

# Preview without submitting
hpc run --dry-run --cpu 8 "make -j8"

# Interactive session
hpc run --interactive bash

# Array job
hpc run --array 1-100 "python process.py --task-id \$SGE_TASK_ID"

# Wait for completion
hpc run --wait python long_job.py
```

### Python API

```python
from hpc_runner import Job

# Create and submit a job
job = Job(
    command="python train.py",
    cpu=4,
    mem="16G",
    time="4:00:00",
    queue="gpu",
)
result = job.submit()

# Wait for completion
status = result.wait()
print(f"Exit code: {result.returncode}")

# Read output
print(result.read_stdout())
```

### Job Dependencies

```python
from hpc_runner import Job

# First job
preprocess = Job(command="python preprocess.py", cpu=8, mem="32G")
result1 = preprocess.submit()

# Second job runs after first succeeds
train = Job(command="python train.py", cpu=4, mem="48G", queue="gpu")
train.after(result1, type="afterok")
result2 = train.submit()
```

### Pipelines

```python
from hpc_runner import Pipeline

with Pipeline("ml_workflow") as p:
    p.add("python preprocess.py", name="preprocess", cpu=8)
    p.add("python train.py", name="train", depends_on=["preprocess"], queue="gpu")
    p.add("python evaluate.py", name="evaluate", depends_on=["train"])

results = p.submit()
p.wait()
```

## Scheduler Support

| Scheduler | Status | Notes |
|-----------|--------|-------|
| SGE | Fully implemented | qsub, qstat, qdel, qrsh |
| Local | Fully implemented | Run as subprocess (for testing) |
| Slurm | Planned | sbatch, squeue, scancel |
| PBS | Planned | qsub, qstat, qdel |

### Auto-detection Priority

1. `HPC_SCHEDULER` environment variable
2. SGE (`SGE_ROOT` or `qstat` available)
3. Slurm (`sbatch` available)
4. PBS (`qsub` with PBS)
5. Local fallback

## Configuration

hpc-runner uses TOML configuration files. Location priority:

1. `--config /path/to/config.toml`
2. `./hpc-runner.toml`
3. `./pyproject.toml` under `[tool.hpc-runner]`
4. Git repository root `hpc-runner.toml`
5. `~/.config/hpc-runner/config.toml`
6. Package defaults

### Example Configuration

```toml
[defaults]
cpu = 1
mem = "4G"
time = "1:00:00"
inherit_env = true

[schedulers.sge]
parallel_environment = "smp"
memory_resource = "mem_free"
purge_modules = true

[types.gpu]
queue = "gpu"
resources = [{name = "gpu", value = 1}]

[types.interactive]
queue = "interactive"
time = "8:00:00"
```

Use named job types:

```bash
hpc run --job-type gpu "python train.py"
```

### SGE Configuration

SGE clusters vary widely in how resources are named. The `[schedulers.sge]`
section lets you match your site's conventions without touching job definitions.

**How job fields map to SGE flags:**

| Job Field | SGE Flag | Configurable Via |
|-----------|----------|------------------|
| `cpu`     | `-pe <pe_name> <slots>` | `parallel_environment` |
| `mem`     | `-l <resource>=<value>` | `memory_resource` |
| `time`    | `-l <resource>=<value>` | `time_resource` |
| `queue`   | `-q <queue>` | direct |
| `resources` | `-l <name>=<value>` | direct |

**Full `[schedulers.sge]` reference:**

```toml
[schedulers.sge]
# Resource naming -- these must match your site's SGE configuration
parallel_environment = "smp"    # PE name for CPU slots (some sites use "mpi", "threaded", etc.)
memory_resource = "mem_free"    # Memory resource name (common alternatives: "h_vmem", "virtual_free")
time_resource = "h_rt"          # Time limit resource name (commonly "h_rt")

# Output handling
merge_output = true             # Merge stderr into stdout (-j y)

# Module system
purge_modules = true            # Run 'module purge' before loading job modules
silent_modules = false          # Suppress module command output (-s flag)
module_init_script = ""         # Path to module init script (auto-detected if empty)

# Environment
expand_makeflags = true         # Expand $NSLOTS in MAKEFLAGS for parallel make
unset_vars = []                 # Environment variables to unset in jobs
                                # e.g. ["https_proxy", "http_proxy"]
```

**Fully populated config example:**

```toml
[defaults]
scheduler = "auto"
cpu = 1
mem = "4G"
time = "1:00:00"
queue = "batch.q"
use_cwd = true
inherit_env = true
stdout = "hpc.%N.%J.out"
modules = ["gcc/12.2", "python/3.11"]
resources = [
  { name = "scratch", value = "20G" }
]

[schedulers.sge]
parallel_environment = "smp"
memory_resource = "mem_free"
time_resource = "h_rt"
merge_output = true
purge_modules = true
silent_modules = false
expand_makeflags = true
unset_vars = ["https_proxy", "http_proxy"]

[tools.python]
cpu = 4
mem = "16G"
time = "4:00:00"
queue = "short.q"
modules = ["-", "python/3.11"]   # leading "-" replaces the list instead of merging
resources = [
  { name = "tmpfs", value = "8G" }
]

[types.interactive]
queue = "interactive.q"
time = "8:00:00"
cpu = 2
mem = "8G"

[types.gpu]
queue = "gpu.q"
cpu = 8
mem = "64G"
time = "12:00:00"
resources = [
  { name = "gpu", value = 1 }
]
```

This config can also be embedded in `pyproject.toml` under `[tool.hpc-runner]`.

## TUI Monitor

Launch the interactive job monitor:

```bash
hpc monitor
```

Key bindings:
- `q` - Quit
- `r` - Refresh
- `u` - Toggle user filter (my jobs / all)
- `/` - Search
- `Enter` - View job details
- `Tab` - Switch tabs

## CLI Reference

```
hpc run [OPTIONS] COMMAND

Options:
  --job-name TEXT       Job name
  --cpu INTEGER         Number of CPUs
  --mem TEXT            Memory (e.g., 16G, 4096M)
  --time TEXT           Time limit (e.g., 4:00:00)
  --queue TEXT          Queue/partition name
  --directory PATH      Working directory
  --module TEXT         Module to load (repeatable)
  --array TEXT          Array spec (e.g., 1-100, 1-100%5)
  --depend TEXT         Job dependencies
  --inherit-env         Inherit environment (default: true)
  --no-inherit-env      Don't inherit environment
  --interactive         Run interactively (qrsh/srun)
  --local               Run locally (no scheduler)
  --dry-run             Show script without submitting
  --wait                Wait for completion
  --keep-script         Keep job script for debugging
  -h, --help            Show help

Other commands:
  hpc status [JOB_ID]   Check job status
  hpc cancel JOB_ID     Cancel a job
  hpc monitor           Interactive TUI
  hpc config show       Show active configuration
```

## Development

```bash
# Setup environment
source sourceme
source sourceme --clean  # Clean rebuild

# Run tests
pytest
pytest -v
pytest -k "test_job"

# Type checking
mypy src/hpc_runner

# Linting
ruff check src/hpc_runner
ruff format src/hpc_runner
```

## License

MIT License - see LICENSE file for details.
