Metadata-Version: 2.4
Name: chunkformer
Version: 1.0.0
Summary: ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
Author-email: khanld <khanhld218@gmail.com>
Project-URL: Homepage, https://github.com/khanld/chunkformer
Project-URL: Documentation, https://github.com/khanld/chunkformer/blob/main/README.md
Project-URL: Repository, https://github.com/khanld/chunkformer
Project-URL: Bug Reports, https://github.com/khanld/chunkformer/issues
Project-URL: Paper, https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf
Keywords: speech-recognition,asr,chunkformer,transformer,conformer,pytorch,long-form-audio,machine-learning,deep-learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: torchaudio>=0.9.0
Requires-Dist: transformers>=4.20.0
Requires-Dist: PyYAML>=5.4.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: jiwer>=2.3.0
Requires-Dist: colorama>=0.4.4
Requires-Dist: pydub>=0.25.0
Requires-Dist: huggingface_hub>=0.10.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0.0; extra == "dev"
Requires-Dist: black==25.9.0; extra == "dev"
Requires-Dist: flake8==6.0.0; extra == "dev"
Requires-Dist: flake8-pyproject>=1.2.0; extra == "dev"
Requires-Dist: isort==5.12.0; extra == "dev"
Requires-Dist: mypy==1.5.1; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Requires-Dist: types-PyYAML>=6.0.0; extra == "dev"
Requires-Dist: autopep8>=2.0.0; extra == "dev"
Requires-Dist: autoflake>=2.0.0; extra == "dev"
Dynamic: license-file

# ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
---

This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, **"ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription"**. The paper has been fully accepted by the reviewers with scores: **4/4/4**.

[![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
[![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)

- [`paper.pdf`](docs/paper.pdf): The ICASSP 2025 paper describing ChunkFormer.
- [`reviews.pdf`](docs/chunkformer_reviews.pdf): Reviewers' feedback from the ICASSP review process.
- [`rebuttal.pdf`](docs/rebuttal.pdf): Our rebuttal addressing reviewer concerns.

## Table of Contents
- [Introduction](#introduction)
- [Key Features](#key-features)
- [Installation](#installation)
  - [Install from PyPI (Recommended)](#option-1-install-from-pypi-recommended)
  - [Install from source](#option-2-install-from-source)
  - [Pretrained Models](#pretrained-models)
- [Usage](#usage)
  - [Python API Usage](#python-api-usage)
  - [Command Line Usage](#command-line-usage)
- [Training the Model](#training)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

<a name = "introduction" ></a>
## Introduction
ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a **chunk-wise processing mechanism** with **relative right context** and employs the **Masked Batch technique** to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.
![chunkformer_architecture](docs/chunkformer_architecture.png)

<a name = "key-features" ></a>
## Key Features
- **Transcribing Extremely Long Audio**: ChunkFormer can **transcribe audio recordings up to 16 hours** in length with results comparable to existing models. It is currently the first model capable of handling this duration.
- **Efficient Decoding on Low-Memory GPUs**: Chunkformer can **handle long-form transcription on GPUs with limited memory** without losing context or mismatching the training phase.
- **Masked Batching Technique**: ChunkFormer efficiently **removes the need for padding in batches with highly variable lengths**.  For instance, **decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.**

| GPU Memory | Total Batch Duration (minutes) |
|---|---|
| 80GB | 980 |
| 24GB | 240 |

<a name = "installation" ></a>
## Installation

### Option 1: Install from PyPI (Recommended)
```bash
pip install chunkformer
```

### Option 2: Install from source
```bash
# Clone the repository
git clone https://github.com/your-username/chunkformer.git
cd chunkformer

# Install in development mode
pip install -e .
```

### Pretrained Models
| Language | Model |
|----------|-------|
| Vietnamese  | [khanhld/chunkformer-large-vie](https://huggingface.co/khanhld/chunkformer-large-vie) |
| English   | [khanhld/chunkformer-large-en-libri-960h](https://huggingface.co/khanhld/chunkformer-large-en-libri-960h) |

<a name = "usage" ></a>
## Usage

### Python API Usage
```python
from chunkformer import ChunkFormerModel

# Load a pre-trained model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-large-vie")

# For single long-form audio transcription
transcription = model.endless_decode(
    audio_path="path/to/long_audio.wav",
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=14400,  # in seconds
    return_timestamps=True
)
print(transcription)

# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
    audio_paths=audio_files,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=1800  # Total batch duration in seconds
)

for i, transcription in enumerate(transcriptions):
    print(f"Audio {i+1}: {transcription}")

```

### Command Line Usage
After installation, you can use the command line interface:

#### Long-Form Audio Testing
To test the model with a single [long-form audio file](data/common_voice_vi_23397238.wav). Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:
```bash
chunkformer-decode \
    --model_checkpoint path/to/hf/checkpoint/repo \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```

#### Batch Transcription Testing
The [audio_list.tsv](data/audio_list.tsv) file must have at least one column named **wav**. Optionally, a column named **txt** can be included to compute the **Word Error Rate (WER)**. Output will be saved to the same file.

```bash
chunkformer-decode \
    --model_checkpoint path/to/hf/checkpoint/repo \
    --audio_list path/to/audio_list.tsv \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```
Example Output:
```
WER: 0.1234
```

---

<a name = "training" ></a>
## Training
For training/finetuning ChunkFormer models, follow the implementation in this [WeNet PR](https://github.com/wenet-e2e/wenet/pull/2723).

### Setting Up Your Model for Inference
After training is complete, you need to prepare your model for use with this library. Follow these steps:

#### Step 1: Create Model Directory
```bash
mkdir my_chunkformer_model
cd my_chunkformer_model
```

#### Step 2: Copy Required Files
Copy the following files from your training output to the model directory:

1. **Model Checkpoint**
   ```bash
   # Supported formats: .pt, .ckpt, .bin.
   # Wenet uses .pt by default
   cp /path/to/your/final.pt pytorch_model.pt
   ```

2. **Training Configuration**
   ```bash
   cp /path/to/your/train.yaml config.yaml
   ```

3. **CMVN Statistics**
   ```bash
   cp /path/to/your/global_cmvn global_cmvn
   ```

4. **Vocabulary File (_units.txt)**:
   ```bash
   cp /path/to/your/_units.txt vocab.txt
   ```

#### Step 3: Verify Model Structure
Your model directory should look like this:
```
my_chunkformer_model/
├── pytorch_model.pt (or .ckpt/.bin)
├── config.yaml
├── global_cmvn
└── vocab.txt
```

#### Step 4: Test Your Local Model Directory
```python
import chunkformer
model = chunkformer.ChunkFormerModel.from_pretrained("./my_chunkformer_model")
result = model.endless_decode("test_audio.wav")
print(result)
```

<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex
@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}

```

<a name = "acknowledgments" ></a>
## Acknowledgments
We would like to thank Zalo for providing resources and support for training the model. This work was completed during my tenure at Zalo.

This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.

---
