Metadata-Version: 2.4
Name: instructlab-training
Version: 0.15.1
Summary: Training Library
Author-email: InstructLab <dev@instructlab.ai>
License: Apache-2.0
Project-URL: homepage, https://instructlab.ai
Project-URL: source, https://github.com/instructlab/training
Project-URL: issues, https://github.com/instructlab/training/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: Apache Software License
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: packaging>=20.9
Requires-Dist: wheel>=0.43
Requires-Dist: pyyaml
Requires-Dist: py-cpuinfo
Requires-Dist: torch>=2.6.0
Requires-Dist: transformers>=4.57.1
Requires-Dist: datasets>=2.15.0
Requires-Dist: numba>=0.62.0
Requires-Dist: numpy>=1.26.4
Requires-Dist: rich
Requires-Dist: trl>=0.9.4
Requires-Dist: peft
Requires-Dist: pydantic>=2.7.0
Requires-Dist: aiofiles>=23.2.1
Provides-Extra: cuda
Requires-Dist: flash-attn>=2.4.0; extra == "cuda"
Requires-Dist: bitsandbytes>=0.43.1; extra == "cuda"
Requires-Dist: kernels>=0.9.0; extra == "cuda"
Requires-Dist: accelerate>=0.34.2; extra == "cuda"
Requires-Dist: liger-kernel>=0.5.10; extra == "cuda"
Requires-Dist: mamba-ssm[causal-conv1d]>=2.2.5; extra == "cuda"
Provides-Extra: rocm
Requires-Dist: flash-attn>=2.6.2; extra == "rocm"
Requires-Dist: accelerate>=0.34.2; extra == "rocm"
Requires-Dist: mamba-ssm[causal-conv1d]>=2.2.5; extra == "rocm"
Provides-Extra: hpu
Provides-Extra: deepspeed
Requires-Dist: deepspeed>=0.14.3; extra == "deepspeed"
Provides-Extra: mlflow
Requires-Dist: mlflow; extra == "mlflow"
Provides-Extra: wandb
Requires-Dist: wandb; extra == "wandb"
Provides-Extra: tensorboard
Requires-Dist: tensorboard; extra == "tensorboard"
Dynamic: license-file

# InstructLab Training Library

![Lint](https://github.com/instructlab/training/actions/workflows/lint.yml/badge.svg?branch=main)
![Build](https://github.com/instructlab/training/actions/workflows/pypi.yaml/badge.svg?branch=main)
![Release](https://img.shields.io/github/v/release/instructlab/training)
![License](https://img.shields.io/github/license/instructlab/training)

![`e2e-nvidia-l40s-x4.yml` on `main`](https://github.com/instructlab/training/actions/workflows/e2e-nvidia-l40s-x4.yml/badge.svg?branch=main)

## About the Library

The InstructLab Training library is an optimized model instruction-tuning library, designed for messages-format data. This library can be used for efficiently fine-tuning Causal Language Models, working for both base models and previously-aligned models with existing chat templates. This library was used to achieve the results found in [Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs](https://arxiv.org/abs/2412.13337).

To simplify the process of fine-tuning models with the [LAB
method](https://arxiv.org/abs/2403.01081), or for general use, this library provides a simple pythonic training interface.

### Reasoning Content Support

The library now supports reasoning traces through the `reasoning_content` field in message samples. This enables training models that can handle both regular content and structured reasoning traces, making it ideal for training reasoning-capable models that can separate their thinking process from their final output.

## Usage and Guidance Sections

- [Installing](#installing-the-library)
  - [Additional Nvidia packages](#additional-nvidia-packages)
  - [Optional logging dependencies](#optional-logging-dependencies)
- [Using the library](#using-the-library)
- [Data format](#data-format)
  - [Reasoning content support](#reasoning-content-support-1)
- [Continual pretraining mode](#continual-pretraining-mode)
- [Documentation](#documentation)
- [Learning about the training arguments](#learning-about-training-arguments)
  - [`TrainingArgs`](#trainingargs)
  - [`DeepSpeedOptions`](#deepspeedoptions)
  - [`FSDPOptions`](#fsdpoptions)
  - [`loraOptions`](#loraoptions)
- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

## Installing the library

To get started with the library, you must clone this repository and install it via `pip`.

Install the library:

```bash
pip install instructlab-training 
```

You can then install the library for development:

```bash
pip install -e ./training
```

### Additional NVIDIA packages

This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.

Basic install

```bash
pip install .[cuda]
```

Editable install (development)

```bash
pip install -e .[cuda]
```

### Optional Logging Dependencies

The library supports optional logging backends for experiment tracking. Install the ones you need:

```bash
# MLflow logging
pip install mlflow

# Weights & Biases logging
pip install wandb

# TensorBoard logging
pip install tensorboard
```

For more details on configuring logging, see the [Logging Documentation](docs/logging.md).

## Using the library

See the `examples` dir for guided sample notebooks on library usage. Below provides some added details on library options:

You can utilize this training library by importing the necessary items.

```py
from instructlab.training import (
    run_training,
    TorchrunArgs,
    TrainingArgs,
    DeepSpeedOptions
)
```

You can then define various training arguments. They will serve as the parameters for your training runs. See:

- [Learning about the training argument](#learning-about-training-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

## Data format

The library expects training data in the messages format, where each sample contains a list of messages with different roles (user, assistant, system, etc.). Each message should have at minimum:

- `role`: The role of the message sender (e.g., "user", "assistant", "system")
- `content`: The main content of the message

### Reasoning content support

The library now supports an optional `reasoning_content` field in addition to the standard `content` field. This enables training models with structured reasoning traces. The `reasoning_content` field is particularly useful for:

- Training reasoning-capable models that can separate their thinking process from their output
- Supporting models that need to generate internal reasoning traces
- Enabling step-by-step reasoning in model responses

> **Note**: this is only supported for models with chat templates that use the DeepSeek R1-style parser. Models without a custom thought processor such as Phi-4 must still provide their reasoning traces in the `content` field.

**Example message structure with reasoning content:**

```json
{
  "messages": [
    {
      "role": "user",
      "content": "What is 15 * 23?"
    },
    {
      "role": "assistant",
      "reasoning_content": "I need to multiply 15 by 23. Let me break this down: 15 * 23 = 15 * (20 + 3) = 15 * 20 + 15 * 3 = 300 + 45 = 345",
      "content": "15 * 23 = 345"
    }
  ]
}
```

## Continual pretraining mode

In addition to instruction tuning, the library can run document-style continual pretraining on raw text corpora.
Enable this by supplying a block size when invoking `main_ds.py`:

```bash
torchrun main_ds.py \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --data_path /data/documents.jsonl \
  --ckpt_output_dir ./checkpoints \
  --effective_batch_size 128 \
  --max_batch_len 60000 \
  --block-size 8192 \
  --document-column-name text  # optional, defaults to "document"
```

- `--block-size` (required) toggles continual pretraining and controls how many tokens are packed into each block.
- `--document-column-name` (optional) specifies which JSONL field contains the raw document text.

The same options are available programmatically via `TrainingArgs.pretraining_config`:

```python
from instructlab.training import TrainingArgs, PretrainingConfig

train_args = TrainingArgs(
    model_name_or_path="mistralai/Mistral-7B-v0.1",
    data_path="documents.jsonl",
    ckpt_output_dir="./checkpoints",
    max_seq_len=4096,
    max_batch_len=40000,
    effective_batch_size=128,
    pretraining_config=PretrainingConfig(
        block_size=2048,
        document_column_name="text",  # optional
    ),
)
```

When a pretraining config is provided, `process_documents_for_pretraining()` is invoked under the hood to tokenize raw documents before training.

**Standard message structure:**

```json
{
  "messages": [
    {
      "role": "user", 
      "content": "Hello! How are you?"
    },
    {
      "role": "assistant",
      "content": "Hello! I'm doing well, thank you for asking. How can I help you today?"
    }
  ]
}
```

### Important Notes

1. **Automatic reasoning content processing**: If `reasoning_content` exists in a message, it will always be processed and unmasked as long as the message role is targeted for unmasking. This ensures that reasoning traces are properly included in the training data.

2. **DeepSeek R1 Thinking Compatibility**: Models using the DeepSeek R1 thought processor (such as Qwen3) must supply their thinking traces in the `reasoning_content` field to be processed correctly. Failure to do so may result in improper handling of reasoning tokens and suboptimal training performance.

## Documentation

For detailed information about specific features:

- **[Reasoning Content Support](docs/reasoning_content.md)**: Comprehensive guide to using the `reasoning_content` field for training reasoning-capable models
- **[CI Documentation](docs/ci.md)**: Information about continuous integration processes
- **[Logging Documentation](docs/logging.md)**: Guide to logging configuration and usage

## Learning about training arguments

The `TrainingArgs` class provides most of the customization options
for training jobs. There are a number of options you can specify, such as setting
`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.

### `TrainingArgs`

| Field | Description |
| --- | --- |
| model_path | Either a reference to a HuggingFace repo or a path to a model saved in the HuggingFace format.  |
| data_path | A path to the `.jsonl` training dataset. This is expected to be in the messages format.  |
| ckpt_output_dir | Directory where trained model checkpoints will be saved. |
| data_output_dir | Directory where the processed training data is stored (post filtering/tokenization/masking) |
|  max_seq_len | The maximum sequence length to be included in the training set. Samples exceeding this length will be dropped. |
| max_batch_len | Maximum tokens per gpu for each batch that will be handled in a single step. Used as part of the multipack calculation. If running into out-of-memory errors, try to lower this value, but not below the `max_seq_len`. |
| num_epochs | Number of epochs to run through before stopping. |
| effective_batch_size | The amount of samples in a batch to see before we update the model parameters. |
| save_samples | Number of samples the model should see before saving a checkpoint. Consider this to be the checkpoint save frequency. |
| learning_rate | How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
| warmup_steps | The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to `learning_rate`. |
| random_seed | The random seed PyTorch will use. |
| mock_data | Whether or not to use mock, randomly generated,  data during training. For debug purposes |
| mock_data_len | Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data. |
| deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
| lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |
| chat_tmpl_path | Specifies the chat template / special tokens for training. |
| checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |
| fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |
| distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |
| keep_last_checkpoint_only | Determines whether we should only keep the last checkpoint directory - the previous checkpoint directory is always overwritten. The checkpoint directory is called `last_epoch`. |

### `DeepSpeedOptions`

This library only currently support a few options in `DeepSpeedOptions`:
The default is to run with DeepSpeed, so these options only currently
allow you to customize aspects of the ZeRO stage 2 optimizer.

| Field | Description |
| --- | --- |
| cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |
| cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |
| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |

For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)

#### DeepSpeed with CPU Offloading

To use DeepSpeed with CPU offloading, you'll usually encounter an issue indicating that the optimizer needed to use the Adam optimizer on CPU doesn't exist. To resolve this, please follow the following steps:

**Rebuild DeepSpeed with CPUAdam**:

You'll need to rebuild DeepSpeed in order for the optimizer to be present:

```bash
# uninstall deepspeed & reinstall with the flags for installing CPUAdam
pip uninstall deepspeed
DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install deepspeed --no-deps
```

**Ensure `-lcurand` is linked correctly**:

A problem that we commonly encounter is that the `-lcurand` linker will not be present when
DeepSpeed recompiles. To resolve this, you will need to find the location of the `libcurand.so` file in your machine:

```bash
find / -name 'libcurand*.so*' 2>/dev/null
```

If libcurand.so is not present in the `/usr/lib64` directory, you'll need to add a symlink. Ex:

```bash
sudo ln -s /usr/local/cuda/lib64/libcurand.so.10 /usr/lib64/libcurand.so
```

### `FSDPOptions`

Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
They are listed below:

| Field | Description |
| --- | --- |
| cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |
| sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are:  `FULL_SHARD` (ZeRO-3), `HYBRID_SHARD` (ZeRO-3*), `SHARD_GRAD_OP` (ZeRO-2), and `NO_SHARD`. |

> [!NOTE]
> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.

### `loraOptions`

LoRA options currently supported:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

#### Example run with LoRa options

If you'd like to do a LoRA train, you can specify a LoRA
option to `TrainingArgs` via the `LoraOptions` object.

```python
from instructlab.training import LoraOptions, TrainingArgs

training_args = TrainingArgs(
    lora = LoraOptions(
        rank = 4,
        alpha = 32,
        dropout = 0.1,
    ),
    # ...
)
```

### Learning about `TorchrunArgs` arguments

When running the training script, we always invoke `torchrun`.

If you are running a single-GPU system or something that doesn't
otherwise require distributed training configuration, you can create a default object:

```python
run_training(
    torchrun_args=TorchrunArgs(),
    training_args=TrainingArgs(
        # ...
    ),
)
```

However, if you want to specify a more complex configuration,
the library currently supports all the options that [torchrun accepts
today](https://pytorch.org/docs/stable/elastic/run.html#definitions).

> [!NOTE]
> For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).

#### Example training run with `TorchrunArgs` arguments

For example, in a 8-GPU, 2-machine system, we would
specify the following torchrun config:

```python
MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'

# on machine 1
torchrun_args = TorchrunArgs(
    nnodes = 2, # number of machines 
    nproc_per_node = 4, # num GPUs per machine
    node_rank = 0, # node rank for this machine
    rdzv_id = 123,
    rdzv_endpoint = RDZV_ENDPOINT
)

run_training(
    torchrun_args=torchrun_args,
    training_args=training_args
)
```

```python
MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'

# on machine 2
torchrun_args = TorchrunArgs(
    nnodes = 2, # number of machines 
    nproc_per_node = 4, # num GPUs per machine
    node_rank = 1, # node rank for this machine
    rdzv_id = 123,
    rdzv_endpoint = f'{MASTER_ADDR}:{MASTER_PORT}'
)

run_training(
    torch_args=torchrun_args,
    train_args=training_args
)
```

## Example training run with arguments

Define the training arguments which will serve as the
parameters for our training run:

```py
# define training-specific arguments
training_args = TrainingArgs(
    # define data-specific arguments
    model_path = "ibm-granite/granite-7b-base",
    data_path = "path/to/dataset.jsonl",
    ckpt_output_dir = "data/saved_checkpoints",
    data_output_dir = "data/outputs",

    # define model-trianing parameters
    max_seq_len = 4096,
    max_batch_len = 60000,
    num_epochs = 10,
    effective_batch_size = 3840,
    save_samples = 250000,
    learning_rate = 2e-6,
    warmup_steps = 800,
    random_seed = 42,
)
```

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.

```py
torchrun_args = TorchrunArgs(
    nnodes = 1, # number of machines 
    nproc_per_node = 8, # num GPUs per machine
    node_rank = 0, # node rank for this machine
    rdzv_id = 123,
    rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.

```py
run_training(
    torchrun_args=torchrun_args,
    training_args=training_args,
)

```

## Example training with separate data pre-processing

If the machines in the example above have shared storage, users can pre-process the training dataset a single time so that it can then be distributed to each machine by making the following updates.

```python
from instructlab.training import (
    run_training,
    TorchrunArgs,
    TrainingArgs,
    DeepSpeedOptions,
    DataProcessArgs,
    data_process as dp
)

training_args = TrainingArgs(
    # define data-specific arguments
    model_path = "ibm-granite/granite-7b-base",
    data_path = "path/to/dataset.jsonl",
    ckpt_output_dir = "data/saved_checkpoints",
    data_output_dir = "data/outputs",

    # define model-trianing parameters
    max_seq_len = 4096,
    max_batch_len = 60000,
    num_epochs = 10,
    effective_batch_size = 3840,
    save_samples = 250000,
    learning_rate = 2e-6,
    warmup_steps = 800,
    random_seed = 42,
    process_data = True,
)
...

data_process_args = DataProcessArgs(
    data_output_path = training_args.data_output_dir,
    model_path = training_args.model_path,
    data_path = training_args.data_path,
    max_seq_len = training_args.max_seq_len,
    chat_tmpl_path =  training_args.chat_tmpl_path
)

dp.main(data_process_args)
run_training(
    torch_args=torchrun_args,
    train_args=training_args,
)
```

## Environment variables

Below is a list of custom environment variables users can set in the training library.

1. `INSTRUCTLAB_NCCL_TIMEOUT_MS`, this environment variable controls the NCCL timeout in milliseconds. Consider increasing if seeing FSDP related NCCL errors.

## Developer Certificate of Origin

When you make a contribution to InstructLab training, you implicitly agree to the Developer Certificate of Origin terms as set in `DCO.txt` at the root of this repository.
