Metadata-Version: 2.4
Name: mcap-data-loader
Version: 0.1.7
Summary: MCAP Data Loader
Author-email: OpenGHz <your.email@example.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic
Requires-Dist: pydantic-settings
Requires-Dist: pydantic_yaml
Requires-Dist: numpy
Requires-Dist: more-itertools
Requires-Dist: toolz
Requires-Dist: cachetools
Requires-Dist: typing-extensions
Requires-Dist: flatbuffers
Requires-Dist: foxglove-schemas-flatbuffer
Requires-Dist: mcap
Requires-Dist: pymcap
Requires-Dist: av
Requires-Dist: PyTurboJPEG
Requires-Dist: natsort
Requires-Dist: array-api-compat
Requires-Dist: termcolor
Requires-Dist: send2trash>=1.8.3
Dynamic: license-file

<div align="center">

<h1>MCAP Data Loader</h1>

[![PyPI](https://img.shields.io/pypi/v/mcap-data-loader)](https://pypi.org/project/mcap-data-loader/)
[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/github/license/OpenGHz/MCAP-DataLoader)](LICENSE)

A Python library for loading and processing MCAP data files in a way that is more suitable for machine learning and robotics training pipelines.

</div>

## Features

- Dataset-style APIs for iterating MCAP data as episodes/samples
- Built-in statistics utilities (dataset-level and episode-level)
- Convenient access to topics and attachments
- Integration CLI for training with LeRobot using MCAP as the dataset backend

## Installation

Install from PyPI:

```bash
pip install mcap-data-loader
```

Or install from source:

```bash
git clone https://github.com/OpenGHz/MCAP-DataLoader.git --depth 1
cd MCAP-DataLoader
pip install -e .
```

## Quickstart (basic usage)

A basic example showing how to load MCAP files from a directory, inspect statistics, and iterate through episodes/samples:

```python
from mcap_data_loader.datasets.mcap_dataset import (
    McapFlatBuffersEpisodeDataset,
    McapFlatBuffersEpisodeDatasetConfig,
)
from pprint import pprint

dataset = McapFlatBuffersEpisodeDataset(
    McapFlatBuffersEpisodeDatasetConfig(
        data_root="data/example",
        # keys typically include topic names and optional special fields (e.g. "log_stamps")
        keys=["/follow/arm/joint_state/position", "log_stamps"],
    )
)

print(f"All files: {dataset.all_files}")
print(f"Dataset length: {len(dataset)}")

print("Dataset statistics:")
pprint(dataset.statistics())

for episode in dataset:
    print(f"Current file: {episode.config.data_root}")

    for sample in episode:
        print(f"Sample keys: {sample.keys()}")
        break

    print(f"Episode length: {len(episode)}")
    print(f"All topics: {episode.reader.all_topic_names()}")
    print(f"All attachments: {episode.reader.all_attachment_names()}")

    print("Episode statistics:")
    pprint(episode.statistics())
    print("----" * 10)
```

More examples and detailed usage can be found in the [examples](examples) directory.

## Integration with LeRobot training

MCAP Data Loader provides a CLI to train LeRobot models using MCAP data files. This allows you to use MCAP datasets directly as the training data source for LeRobot, without needing to convert them into a different format.

You should have LeRobot installed in your environment to use this feature. You can install it from PyPI (0.4.3 is tested):

```bash
pip install lerobot
```

### Train with an MCAP dataset

Run:

```bash
mcap_lerobot_train -c configs/config.yaml
```

Recommended: place your config file under a `configs/` directory in your current working directory.

#### Configuration reference

The top level is the standard LeRobot configuration, with an additional `mcap` section for MCAP dataset loading settings:

```yaml
batch_size: 2
num_workers: 1
policy:
  type: act
  push_to_hub: false
  chunk_size: 2
  n_action_steps: 2

dataset:
  root: data
  repo_id: example
  streaming: true

mcap:
  states:
    - /follow/arm/joint_state/position
    - /follow/eef/joint_state/position
  actions:
    - /lead/arm/pose/position
    - /lead/arm/pose/orientation
  images:
    - /env_camera/color/image_raw
```

The lists of topics specified by `states` and `actions` will be loaded and concatenated to form the `observation.state` and `action` required by lerobot, serving as low-dimensional state and action inputs in the training data. Meanwhile, `images` will be appended to the `observation.images` field, using the first part of the name (e.g., `env_camera` in the example above) as a suffix for image input, such as `observation.images.env_camera`, for use during training.

Notes:
- `dataset.root` and `dataset.repo_id` are reused to specify the MCAP dataset root directory and dataset name.
- Command-line overrides compatible with LeRobot are supported and take the highest priority (they override values in the config file). For example:
  ```bash
  mcap_lerobot_train -c configs/config.yaml --dataset.repo_id=example_task
  ```

### Train with LeRobot’s original dataset format

If you want to use LeRobot’s original data format (while still using this CLI), add `--ori`:

```bash
mcap_lerobot_train -c configs/ori.yaml --ori
```

Make sure the dataset path in your config points to the actual LeRobot dataset location.

### Help / supported CLI args

Show supported parameters:

```bash
mcap_lerobot_train -h
```

If the output is long, redirect to a file:

```bash
mcap_lerobot_train -h > lerobot_help.txt
```

## License

See [LICENSE](LICENSE).
