Metadata-Version: 2.2
Name: tractorun
Version: 0.61.0
Summary: Run distributed training in TractoAI
Author: TractoAI team
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: attrs>=23.1.0
Requires-Dist: cattrs>=23.1.0
Requires-Dist: ytsaurus-client>=0.13.16
Requires-Dist: ytsaurus-yson>=0.4.5
Requires-Dist: PyYAML>=5.4
Provides-Extra: torch
Requires-Dist: torch~=2.3.0; extra == "torch"
Provides-Extra: jax
Requires-Dist: jax~=0.4.24; extra == "jax"
Provides-Extra: tests
Requires-Dist: pytest<9; extra == "tests"
Requires-Dist: pytest-xdist~=3.6.1; extra == "tests"
Requires-Dist: testcontainers-yt-local==0.14.0; extra == "tests"
Provides-Extra: dev
Requires-Dist: attrs~=23.1.0; extra == "dev"
Requires-Dist: cattrs~=23.1.0; extra == "dev"
Requires-Dist: ytsaurus-client~=0.13.16; extra == "dev"
Requires-Dist: ytsaurus-yson~=0.4.5; extra == "dev"
Requires-Dist: isort~=5.13.2; extra == "dev"
Requires-Dist: black~=24.4.2; extra == "dev"
Requires-Dist: ruff~=0.4.4; extra == "dev"
Requires-Dist: mypy~=1.10.0; extra == "dev"
Requires-Dist: mypy-extensions~=1.0.0; extra == "dev"
Requires-Dist: types-PyYAML~=6.0.12; extra == "dev"
Requires-Dist: bump-my-version~=0.18.3; extra == "dev"
Requires-Dist: generate-changelog~=0.10.0; extra == "dev"
Requires-Dist: twine~=5.1.0; extra == "dev"
Provides-Extra: examples
Requires-Dist: wandb; extra == "examples"
Requires-Dist: numpy~=1.26.4; extra == "examples"
Requires-Dist: pytorch-lightning~=2.2.5; extra == "examples"
Provides-Extra: tensorproxy
Requires-Dist: orbax-checkpoint==0.5.2; extra == "tensorproxy"
Requires-Dist: ytpath==0.0.3; extra == "tensorproxy"
Requires-Dist: tensorstore==0.1.56.post6; extra == "tensorproxy"
Requires-Dist: numpy~=1.26.4; extra == "tensorproxy"
Provides-Extra: all
Requires-Dist: torch~=2.3.0; extra == "all"
Requires-Dist: jax~=0.4.24; extra == "all"
Requires-Dist: pytest<9; extra == "all"
Requires-Dist: pytest-xdist~=3.6.1; extra == "all"
Requires-Dist: testcontainers-yt-local==0.14.0; extra == "all"
Requires-Dist: attrs~=23.1.0; extra == "all"
Requires-Dist: cattrs~=23.1.0; extra == "all"
Requires-Dist: ytsaurus-client~=0.13.16; extra == "all"
Requires-Dist: ytsaurus-yson~=0.4.5; extra == "all"
Requires-Dist: isort~=5.13.2; extra == "all"
Requires-Dist: black~=24.4.2; extra == "all"
Requires-Dist: ruff~=0.4.4; extra == "all"
Requires-Dist: mypy~=1.10.0; extra == "all"
Requires-Dist: mypy-extensions~=1.0.0; extra == "all"
Requires-Dist: types-PyYAML~=6.0.12; extra == "all"
Requires-Dist: bump-my-version~=0.18.3; extra == "all"
Requires-Dist: generate-changelog~=0.10.0; extra == "all"
Requires-Dist: twine~=5.1.0; extra == "all"
Requires-Dist: wandb; extra == "all"
Requires-Dist: numpy~=1.26.4; extra == "all"
Requires-Dist: pytorch-lightning~=2.2.5; extra == "all"

![img.png](https://raw.githubusercontent.com/tractoai/tractorun/refs/heads/main/docs/_static/img.png)

# 🚜 Tractorun

`Tractorun` is a powerful tool for distributed ML operations on the [Tracto.ai](https://tracto.ai/) platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code:
* Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes efficiently.
* Offline batch inference. Perform fast and scalable model inference.
* Running arbitrary GPU operations, ideal for any computational tasks that require distributed GPU resources.

## How it works

Built on top of [Tracto.ai](https://tracto.ai/), `Tractorun` is responsible for coordinating distributed machine learning tasks. It has out-of-the-box integrations with PyTorch and Jax, also it can be easily used for any other training or inference framework.

Key advantages:
* No need to manage your cloud infrastructure, such as configuring Kubernetes cluster, or managing GPU and Infiniband drivers. Tracto.ai  solves all these infrastructure problems for you.
* No need to coordinate distributed processes. Tractorun handles it based on the training configuration: the number of nodes and GPUs used.

Key features:
* Simple distributed task setup, just specify the number of nodes and GPUs.
* Convenient ways to run and configure: CLI, YAML config, and Python SDK.
* A range of powerful capabilities, including [sidecars](https://github.com/tractoai/tractorun/blob/main/docs/options.md#sidecar) for auxiliary tasks and transparent [mounting](https://github.com/tractoai/tractorun/blob/main/docs/options.md#bind-local) of local files directly into distributed operations.
* Integration with the Tracto.ai platform: use datasets and checkpoints stored in the Tracto.ai storage, build pipelines with Tractorun, MapReduce, Clickhouse, Spark, and more.

# Getting started

To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at [tracto.ai](https://tracto.ai/).

Install tractorun into your python3 environment:

`pip install --upgrade tractorun`

Put your actual Tracto.ai cluster address to `$YT_PROXY` and your token to `$YT_TOKEN` and configure the client:

```shell
mkdir ~/.yt
cat <<EOF > ~/.yt/config
{
  "proxy"={
    "url"="$YT_PROXY";
  };
  "token"="$YT_TOKEN";
}
EOF
```

# How to try

Run an example script:

```
tractorun \
    --yt-path "//tmp/$USER/tractorun_getting_started" \
    --bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
    --bind-local-lib ./tractorun \
    --docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-02-25-18-24-08 \
    python3 /lightning_mnist_ddp_script.py
```

# How to run

## CLI

`tractorun --help`

or with yaml config

`tractorun --run-config-path config.yaml`

You can find a relevant examples:
* CLI arguments [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script).
* YAML config [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script_config).

## Python SDK

SDK is convenient to use from Jupyter notebooks for development purposes.

You can find a relevant example in [the repository](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist).

WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.
* This requirement is met in Jupyter Notebook on the Tracto.ai platform.
* For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in `tractorun`

# How to adapt code for tractorun

## CLI

1. Wrap all training/inference code to a function.
2. Initiate environment and Toolbox by `from tractorun.run.prepare_and_get_toolbox`

An example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/mnist_simple/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli

## SDK

1. Wrap all training/inference code to a function with a `toolbox: tractorun.toolbox.Toolbox` parameter.
2. Run this function by `tractorun.run.run`.

An example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk

# Features

## Toolbox

`tractorun.toolbox.Toolbox` provides extra integrations with the Tracto.ai platform:
* Preconfigured client by `toolbox.yt_client`
* Basic checkpoints by `toolbox.checkpoint_manager`
* Control over the operation description in the UI by `toolbox.description_manager`
* Access to coordination information by `toolbox.coordinator`

[Toolbox page](https://github.com/tractoai/tractorun/blob/main/docs/toolbox.md) provides an overview of all available toolbox components.

## Coordination

Tractorun always sets following environment variables in each process:
* `MASTER_ADDR` - the address of the master node
* `MASTER_PORT` - the port of the master node
* `WORLD_SIZE` - the total number of processes
* `NODE_RANK` - the unique id of the current node (job in terms of Tracto.ai)
* `LOCAL_RANK` - the unique id of the current process on the current node
* `RANK` - the unique id of the current process across all nodes

### Backends

Backends configure `tractorun` to work with a specific ML framework.

Tractorun supports multiple backends:
* [Tractorch](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorch) for PyTorch
  * [examples](https://github.com/tractoai/tractorun/tree/main/examples/pytorch)
* [Tractorax](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorax) for Jax
  * [examples](https://github.com/tractoai/tractorun/tree/main/examples/jax)
* [Generic](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/generic)
  * non-specialized backend, can be used as a basis for other backends

[Backend page](https://github.com/tractoai/tractorun/blob/main/docs/backend.md) provides an overview of all available backends.

# Options and settings

[Options reference](https://github.com/tractoai/tractorun/blob/main/docs/options.md) page provides an overview of all available options for `tractorun`, explaining their purpose and usage. Options can be defined by:
* CLI parameters
* yaml config
* python options

# How to enable logs

To enable logs, you should to set the `YT_LOG_LEVEL` environment variable. The following levels are supported:
* `DEBUG`
* `INFO`
* `WARNING`
* `ERROR`
* `CRITICAL`

By default, `tractorun` doesn't write logs on a local host, but writes logs inside operation on Tracto.ai platform using `INFO` log level. If `YT_LOG_LEVEL` is set, logs will be written on a local host to stderr.

# More information

* [Examples](https://github.com/tractoai/tractorun/tree/main/examples)
* [More examples in Jupyter Notebooks](https://github.com/tractoai/tracto-examples)
