Metadata-Version: 2.4
Name: chutes
Version: 0.5.5rc44
Summary: Chutes development kit and CLI.
Home-page: https://github.com/rayonlabs/chutes
Author: Jon Durbin
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: aiohttp[speedups]<4,>=3.10
Requires-Dist: backoff<3,>=2.2
Requires-Dist: requests>=2.32
Requires-Dist: loguru>=0.7.2
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn<0.39,>=0.32.0
Requires-Dist: pydantic<3,>=2.9
Requires-Dist: orjson>=3.10
Requires-Dist: setuptools>=0.75
Requires-Dist: substrate-interface>=1.7.11
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.12.5
Requires-Dist: graval>=0.2.6
Requires-Dist: prometheus-client>=0.21.0
Requires-Dist: cryptography
Requires-Dist: psutil
Requires-Dist: pyjwt>=2.10.1
Requires-Dist: netifaces
Requires-Dist: pyudev
Requires-Dist: aiofiles>=23
Requires-Dist: semver
Requires-Dist: huggingface_hub
Requires-Dist: setproctitle
Requires-Dist: cllmv==0.1.3
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: wheel; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: license-expression
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: summary

# Chutes!

This package provides the command line interface and development kit for use with the chutes.ai platform.

The miner code is available [here](https://github.com/rayonlabs/chutes-miner), and validator/API code [here](https://github.com/rayonlabs/chutes-api).

## 📚 Glossary

Before getting into the weeds, it might be useful to understand the terminology.

### 🐳 image

Images are simply docker images that all chutes (applications) will run on within the platform.

Images must meet a few requirements:
- Contain a cuda installation, preferably version 12.2-12.6
- Contain clinfo, opencl dev libraries, clblast, openmi, etc.
- Contain a python 3.10+ installation, where `python` and `pip` are contained within the executable path `PATH`

__*We HIGHLY, HIGHLY recommend you start with our base image: parachutes/python:3.12 to avoid dependency hell*__

### 🪂 chute

A chute is essentially an application that runs on top of an image, within the platform.  Think of a chute as a single FastAPI application.

### λ cord

A cord is a single function within the chute.  In the FastAPI analogy, this would be a single route & method.

### ✅ graval

GraVal is the graphics card validation library used to help ensure the GPUs that miners claim to be running are authentic/correct.
The library performs VRAM capacity checks, matrix multiplications seeded by device information, etc.

You don't really need to know anything about graval, except that it runs as middleware within the chute to decrypt traffic from the validator and perform additional validation steps (filesystem checks, device info challenges, pings, etc.)

## 🔐 Register

Currently, to become a user on the chutes platform, you must have a Bittensor wallet and hotkey, as authentication is performed via Bittensor hotkey signatures.
Once you are registered, you can create API keys that can be used with a simple "Authorization" header in your requests.

If you don't already have a wallet, you can create one by installing `bittensor<8`, e.g. `pip install 'bittensor<8'`  _note: you can use the newer bittensor-wallet package but it requires rust, which is absurd_

Then, create a coldkey and hotkey according to the library you installed, e.g.:
```bash
btcli wallet new_coldkey --n_words 24 --wallet.name chutes-user
btcli wallet new_hotkey --wallet.name chutes-user --n_words 24 --wallet.hotkey chutes-user-hotkey
```

Once you have your hotkey, just run:
```bash
chutes register
```

*__Don't override CHUTES_API_URL unless you are developing chutes, you can just stop here!__*

To use a development environment, simply set the `CHUTES_API_URL` environment variable accordingly to whatever your dev environment endpoint is, e.g.:
```bash
CHUTES_API_URL=https://api.chutes.dev chutes register
```

Once you've completed the registration process, you'll have a file in `~/.chutes/config.ini` which contains the configuration for using chutes.

## 🔑 Create API keys

You can create API keys, optionally limiting the scope of each key, with the `chutes keys` subcommand, e.g.:

Full admin access:
```bash
chutes keys create --name admin-key --admin
```

Access to images:
```bash
chutes keys create --name image-key --images
```

Access to a single chute.
```bash
chutes keys create --name foo-key --chute-ids 5eda1993-9f4b-5426-972c-61c33dbaf541
```

## 👨‍💻 Developer deposit

*_As of 2025-10-02, this is no longer required! You must have >= $50 balance to build images, and there is a deployment fee (also mentioned in this doc) to deploy chutes_*

### Return the developer deposit

To get your deposit back, perform a POST to the `/return_developer_deposit` endpoint, e.g.:
```bash
curl -XPOST https://api.chutes.ai/return_developer_deposit \
  -H 'content-type: application/json' \
  -H 'authorization: cpk_...' \
  -d '{"address": "5EcZsewZSTxUaX8gwyHzkKsqT3NwLP1n2faZPyjttCeaPdYe"}'
```

## 🛠️ Building an image

The first step in getting an application onto the chutes platform is to build an image.
This SDK includes an image creation helper library as well, and we have a recommended base image which includes python 3.12 and all necessary cuda packages: `parachutes/python:3.12`

Here is an entire chutes application, which has an image that includes `vllm` -- let's store it in `llama1b.py`:

```python
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
from chutes.image import Image

image = (
    Image(username="chutes", name="vllm", tag="0.6.3", readme="## vLLM - fast, flexible llm inference")
    .from_base("parachutes/python:3.12")
    .run_command("pip install 'vllm<0.6.4' wheel packaging")
    .run_command("pip install flash-attn")
    .run_command("pip uninstall -y xformers")
)

chute = build_vllm_chute(
    username="chutes",
    readme="## Meta Llama 3.2 1B Instruct\n### Hello.",
    model_name="unsloth/Llama-3.2-1B-Instruct",
    image=image,
    node_selector=NodeSelector(
        gpu_count=1,
    ),
)
```

The `chutes.image.Image` class includes many helper directives for environment variables, adding files, installing python from source, etc.

To build this image, you can use the chutes CLI:
```bash
chutes build llama1b:chute --public --wait --debug
```

Explanation of the flags:
- `--public` means we want this image to be public/available for ANY user to use -- use with care but we do like public/open source things!
- `--wait` means we want to stream the docker build logs back to the command line.  All image builds occur remotely on our platform, so without the `--wait` flag you just have to wait for the image to become available, whereas with this flag you can see real-time logs/status.
- `--debug` additional debug logging

## 🚀 Deploying a chute

Once you have an image that is built and pushed and ready for use (see above), you can deploy applications on top of those.

To use the same example `llama1b.py` file outlined in the image building section above, we can deploy the llama-3.2-1b-instruct model with:
```bash
chutes deploy llama1b:chute
```

*Note: this will ERROR and show you the deployment fee, as a safety mechanism, so you can confirm you want to accept that fee*

To acknowledge and accept the fee you must pass `--accept-fee`, e.g. `chutes deploy llama1b:chute --accept-fee`

### Deployment fee

You are charged a one-time deployment fee per chute, equivalent to 3 times the hourly rate based on the node selector (meaning, `gpu_count` * cheapest compatible GPU type hourly rate). There is no deployment fee for any updates to existing chutes.

For example, if the `node_selector` has `gpu_count=1` and nothing else, the cheapest compatible GPU is $0.1/hr, so your deployment fee is $0.3.

### Node selector configuration

Be sure to carefully craft the `node_selector` option within the chute, to ensure the code runs on GPUs appropriate to the task.
```python
node_selector=NodeSelector(
    gpu_count=1,
    # All options.
    # gpu_count: int = Field(1, ge=1, le=8)
    # min_vram_gb_per_gpu: int = Field(16, ge=16, le=80)
    # include: Optional[List[str]] = None
    # exclude: Optional[List[str]] = None
),
```

The most important fields are `gpu_count` and `min_vram_gb_per_gpu`.  If you wish to include specific GPUs, you can do so, where the `include` (or `exclude`) fields are the short identifier per model, e.g. `"a6000"`, `"a100"`, etc.  [All supported GPUs and their short identifiers](https://github.com/rayonlabs/chutes-api/blob/main/api/gpu.py)

### Scaling & billing of user-deployed chutes

All user-created chutes are charged at the standard hourly (per gpu, based on your `gpu_count` value in node selector), based on the cheapest compatible GPU type in the `node_selector` definition: https://api.chutes.ai/pricing
For example, if your chute can run on either a100 or h100, you are charged as though all instances are a100, even if it happens to deploy on h100s.

You can configure how much the chute will scale up, how quickly it scales up, and how quickly to spin down with the following flags:
```python
chute = Chute(
    ...,
    concurrency=10,
    max_instances=3,
    scaling_threshold=0.5,
    shutdown_after_seconds=300
)
```

#### concurrency (int, default=1)
This controls the maximum number of requests each instance can handle concurrently, which is dependent entirely on your code. For vLLM and SGLang template chutes, this value can be fairly high, e.g. 32+

#### max_instances (int, default=1)
Maximum number of instances that can be active at a time.

#### scaling_threshold (float, default=0.75)
The ratio of average requests in flight per instance that will trigger creation of another instance, when the number of instances is lower than the configured `max_instances` value.  For example, if your `concurrency` is set to 10, and your `scaling_threshold` is 0.5, and `max_instances` is 2 and you have one instance now, you will trigger a scale up of another instance once the platform observes you have 5 or more requests on average in flight consistently (i.e., you are using 50% of the concurrency supported by your chute).

#### shutdown_after_seconds (int, default=300)
The number of seconds to wait after the last request (per instance) before shutting down the instance to avoid incurring any additional charges.

#### Billable items and mechanism

Deployment fee: You are charged a one-time deployment fee per chute, equivalent to 3 times the hourly rate based on the node selector (meaning, `gpu_count` * cheapest compatible GPU type hourly rate). No deployment fee for any updates to existing chutes.

You are charged the standard hourly rate while any instance is hot, based on your criteria specified above, up through last request timestamp + `shutdown_after_seconds`

You are not charged for "cold start" times (e.g., downloading the model, downloading the chute image, etc.).  You are, however, charged for the `shutdown_after_seconds` seconds of compute while the instance is hot but not actively being called, because it keeps the instance hot.

For example:
- deploy a chute at 12:00:00 (new chute, one time node-selector based deployment fee, let's say a single 3090 at $0.12/hr = $0.36 total fee)
  - `max_instances` set to 1, `shutdown_after_seconds` set to 300
- send requests to the chute and/or call warmup endpoint: 12:00:01 (no charge)
- first instance becomes hot and ready for use: 12:00:30 (billing at $0.12/hr starts here)
- continuously send requests to the instance (no per-request inference charges)
- stop sending requests at 12:05:00
  - triggers the instance shutdown timer based on `shutdown_after_seconds` for 5 minutes...
- instance chutes down 12:10:00 (billing stops here)

Total charges are: $0.36 deployment fee + 5 minutes at $0.12/hr of active compute + 5 minutes `shutdown_after_seconds` = $0.38

Now, suppose you want to use that chute again:
- start requests at 13:00:00
- instance becomes hot at 13:00:30 (billing starts at $0.12/hr here)
- stop requests at 13:05:30
- instance stays hot due to `shutdown_after_seconds` for 5 minutes

Total additional charges = 5 minutes active compute + 5 minute shutdown delay = 10 minutes @ $0.12/hr = $0.02

*If you share a chute with another user, they also pay standard rates for usage on the chute!*

## 👥 Sharing a chute

For any user-deployed chutes, the chutes are private, but they can be shared. You can either use the `chutes share` entrypoint, or call the API endpoint directly.

```bash
chutes share --chute-id unsloth/Llama-3.2-1B-Instruct --user-id anotheruser
```

The `--chute-id` parameter can either be the chute name or the UUID.
Likewise, `--user-id` can be either the username or the user's UUID.

### Billing

When you share a chute with another user, you authorize that user to trigger the chute to scale up, and *you* as the chute owner are charged the hourly rate while it's running.

When the user you shared the chute with calls the chute, they are charged the standard rate (dependent on chute type, e.g. per million token for llms, per step on diffusion models, per second otherwise).

## ⚙️  Building custom/non-vllm chutes

Chutes are in fact completely arbitrary, so you can customize to your heart's content.

Here's an example chute showing some of this functionality:
```python
import asyncio
from typing import Optional
from pydantic import BaseModel, Field
from fastapi.responses import FileResponse
from chutes.image import Image
from chutes.chute import Chute, NodeSelector

image = (
    Image(username="chutes", name="foo", tag="0.1", readme="## Base python+cuda image for chutes")
    .from_base("parachutes/python:3.12")
)

chute = Chute(
    username="test",
    name="example",
    readme="## Example Chute\n\n### Foo.\n\n```python\nprint('foo')```",
    image=image,
    concurrency=4,
    node_selector=NodeSelector(
        gpu_count=1,
        # All options.
        # gpu_count: int = Field(1, ge=1, le=8)
        # min_vram_gb_per_gpu: int = Field(16, ge=16, le=80)
        # include: Optional[List[str]] = None
        # exclude: Optional[List[str]] = None
    ),
    allow_external_egress=False,
)


class MicroArgs(BaseModel):
    foo: str = Field(..., max_length=100)
    bar: int = Field(0, gte=0, lte=100)
    baz: bool = False


class FullArgs(MicroArgs):
    bunny: Optional[str] = None
    giraffe: Optional[bool] = False
    zebra: Optional[int] = None


class ExampleOutput(BaseModel):
    foo: str
    bar: str
    baz: Optional[str]


@chute.on_startup()
async def initialize(self):
    self.billygoat = "billy"
    print("Inside the startup function!")


@chute.cord(minimal_input_schema=MicroArgs)
async def echo(self, input_args: FullArgs) -> str:
    return f"{self.billygoat} says: {input_args}"


@chute.cord()
async def complex(self, input_args: MicroArgs) -> ExampleOutput:
    return ExampleOutput(foo=input_args.foo, bar=input_args.bar, baz=input_args.baz)


@chute.cord(
    output_content_type="image/png",
    public_api_path="/image",
    public_api_method="GET",
)
async def image(self) -> FileResponse:
    return FileResponse("parachute.png", media_type="image/png")


async def main():
    print(await echo("bar"))

if __name__ == "__main__":
    asyncio.run(main())
```

The main thing to notice here are the various the `@chute.cord(..)` decorators and `@chute.on_startup()` decorator.

Any code within the `@chute.on_startup()` decorated function(s) are executed when the application starts on the miner, it does not run in the local/client context.

Any function that you decorate with `@chute.cord()` becomes a function that runs within the chute, i.e. not locally - it's executed on the miners' hardware.

It is very important to give type hints to the functions, because the system will automatically generate OpenAPI schemas for each function for use with the public/hostname based API using API keys instead of requiring the chutes SDK to execute.

For a cord to be available from the public, subdomain based API, you need to specify `public_api_path` and `public_api_method`, and if the return content type is anything other than `application/json`, you'll want to specify that as well.

You can also spin up completely arbitrary webservers and do "passthrough" cords which pass along the request to the underlying webserver. This would be useful for things like using a webserver written in a different programming language, for example.

To see an example of passthrough functions and more complex functionality, see the [vllm template chute/helper](https://github.com/rayonlabs/chutes/blob/main/chutes/chute/template/vllm.py)

It is also very important to specify `concurrency=N` in your `Chute(..)` constructor.  In many cases, e.g. vllm, this can be fairly high (based on max sequences), where in other cases without data parallelism or other cases with contention, you may wish to leave it at the default of 1.

`allow_external_egress=(True|False)` is a flag indicating if network connections should be blocked after the chute has finished running all on_startup(..) hooks (e.g. downloading model weights, which obviously require networking). This won't block local connections, e.g. if you use sglang or comfyui or other daemon and proxy requests from the chute, those will be allowed, but for example you won't be able to fetch remote assets if this is disabled.

By default, allow_external_egress is __true__ for all custom chutes and most templates, but __false__ for vllm, sglang, and embedding templates!! This means, for example, if you are running sglang/vllm for a vision language model such as qwen3-vl variants, you should add `allow_external_egress=True` to the `Chute(..)` constructor to allow `image_url`.

## 🧪 Local testing

If you'd like to test your image/chute before actually deploying onto the platform, you can build the images with `--local`, then run in dev mode:
```bash
chutes build llama1b:chute --local
```

Then, you can start a container with that image:
```bash
docker run --rm -it -e CHUTES_EXECUTION_CONTEXT=REMOTE -p 8000:8000 vllm:0.6.3 chutes run llama1b:chute --port 8000 --dev
```

Then, you can simply perform http requests to your instance.
```bash
curl -XPOST http://127.0.0.1:8000/chat_stream -H 'content-type: application/json' -d '{
  "model": "unsloth/Llama-3.2-1B-Instruct",
  "messages": [{"role": "user", "content": "Give me a spicy mayo recipe."}],
  "temperature": 0.7,
  "seed": 42,
  "max_tokens": 3,
  "stream": True,
  "logprobs": True,
}'
```

