Metadata-Version: 2.4
Name: clserve
Version: 0.1.2
Summary: CLI tool for serving LLM models on Alps
Author: Nathan Ranchin
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: jinja2>=3.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: click>=8.0
Requires-Dist: prettytable>=3.0
Requires-Dist: huggingface_hub>=0.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"

# clserve

CLI tool for serving LLM models on Alps with sglang.

## Installation

```bash
pip install -e /path/to/clserve
```

Or from the clserve directory:

```bash
pip install -e .
```

## Quick Start

```bash
# Serve a model using predefined config
clserve serve deepseek-v3

# Check status of all clserve jobs
clserve status

# Get the endpoint URL by model name
clserve url deepseek-v3

# Stop the serving job by model name
clserve stop deepseek-v3
```

## Commands

### `clserve serve`

Start serving a model.

```bash
# Serve with predefined configuration
clserve serve deepseek-v3
clserve serve llama-405b
clserve serve qwen3-235b

# Serve with multiple workers
clserve serve deepseek-v3 --workers 2 --use-router

# Serve a custom model
clserve serve my-org/my-model --tp-size 4 --nodes-per-worker 1

# Serve a small model with 4 instances per node
clserve serve llama-8b --num-gpus-per-worker 1 --use-router
```

**Options:**
- `--workers, -w`: Number of workers (default: 1)
- `--nodes-per-worker, -n`: Nodes per worker (default: 1)
- `--partition, -p`: SLURM partition (default: normal)
- `--environment, -e`: Container environment (default: sglang_gb200)
- `--tp-size`: Tensor parallel size (default: 1)
- `--dp-size`: Data parallel size (default: 1)
- `--ep-size`: Expert parallel size (default: 1)
- `--num-gpus-per-worker`: GPUs per worker process (1, 2, or 4)
- `--use-router/--no-router`: Enable load balancer router
- `--router-policy`: Router policy (cache_aware, random, round_robin)
- `--time-limit, -t`: Job time limit in HH:MM:SS (default: 04:00:00)
- `--job-name, -j`: Custom job name

### `clserve status`

Show status of serving jobs.

```bash
# Show all running jobs
clserve status

# Show status for a specific job
clserve status 12345

# Show status for jobs serving a model
clserve status deepseek-v3
```

### `clserve url`

Get the endpoint URL for a serving job by model name.
If multiple jobs are serving the same model, you'll be prompted to select one.

```bash
# Get URL by model name
clserve url deepseek-v3

# Get URL by full model path
clserve url deepseek-ai/DeepSeek-V3.1
```

### `clserve stop`

Stop serving jobs by model name.
If multiple jobs are serving the same model, you'll be prompted to select one.

```bash
# Stop by model name (selector if multiple)
clserve stop deepseek-v3

# Stop all jobs for a model
clserve stop deepseek-v3 --all

# Stop all running jobs
clserve stop --all
```

### `clserve models`

List available predefined model configurations.

```bash
clserve models
```

### `clserve logs`

Get the log file path for a job by model name.
If multiple jobs are serving the same model, you'll be prompted to select one.
Logs are stored in `~/.clserve/logs/<job_id>/`.

```bash
clserve logs deepseek-v3
tail -f $(clserve logs deepseek-v3)/log.out
```

## Predefined Model Configurations

The following models have optimized configurations:

| Alias | Model | TP Size | Nodes/Worker | Description |
|-------|-------|---------|--------------|-------------|
| deepseek-v3 | deepseek-ai/DeepSeek-V3.1 | 16 | 4 | DeepSeek V3.1 MoE (FP8) |
| deepseek-r1 | deepseek-ai/DeepSeek-R1 | 16 | 4 | DeepSeek R1 reasoning model |
| llama-405b | meta-llama/Llama-3.1-405B-Instruct | 16 | 4 | Llama 3.1 405B |
| llama-70b | meta-llama/Llama-3.1-70B-Instruct | 4 | 1 | Llama 3.1 70B |
| llama-8b | meta-llama/Llama-3.1-8B-Instruct | 1 | 1 | Llama 3.1 8B (4x per node) |
| qwen3-235b | Qwen/Qwen3-235B-A22B-Instruct-2507 | 8 | 2 | Qwen3 235B MoE |
| qwen3-coder-480b | Qwen/Qwen3-Coder-480B-A35B-Instruct | 16 | 4 | Qwen3 Coder 480B MoE |
| qwen3-32b | Qwen/Qwen3-32B | 2 | 1 | Qwen3 32B (2x per node) |
| qwen3-8b | Qwen/Qwen3-8B | 1 | 1 | Qwen3 8B (4x per node) |
| apertus-8b | swiss-ai/Apertus-8B-Instruct-2509 | 1 | 1 | Apertus 8B (4x per node) |

## Examples

### Serve DeepSeek V3 with default config

```bash
clserve serve deepseek-v3
```

This will:
- Use 4 nodes with TP=16
- Start the model on the cluster
- Print the job ID and endpoint URL instructions

### Serve with multiple workers and router

```bash
clserve serve deepseek-v3 --workers 2 --use-router
```

This doubles capacity with load balancing.

### Serve a small model efficiently

```bash
clserve serve llama-8b
```

Predefined config runs 4 instances per node with a router for high throughput.

### Full workflow example

```bash
# Start serving
clserve serve deepseek-v3
# Output: Job ID: 12345

# Wait for startup, then get URL
clserve url 12345
# Output: http://10.0.0.1:30000

# Use the API
curl http://10.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-ai/DeepSeek-V3.1", "messages": [{"role": "user", "content": "Hello!"}]}'

# When done, stop the job
clserve stop 12345
```

## Architecture

clserve unifies single-node and multi-node deployments into a single template:

- **Single node, full GPU**: `--nodes-per-worker 1 --num-gpus-per-worker 4`
- **Multi-node distributed**: `--nodes-per-worker 4 --tp-size 16`
- **Multiple instances per node**: `--num-gpus-per-worker 1 --use-router`

The router is automatically configured when needed for load balancing across multiple worker processes.

## Job Naming

Jobs are automatically named with a `clserve_` prefix followed by a random ID:
- `clserve_abc123`
- `clserve_xyz789`

This prefix allows clserve to filter and show only its own jobs in `clserve status`,
without interference from other SLURM jobs.

You can override with `--job-name` (but should keep the `clserve_` prefix for filtering).
