Metadata-Version: 2.1
Name: ffcv_pl
Version: 0.1.1
Summary: manage fast data loading with ffcv and pytorch lightning
Home-page: https://github.com/SerezD/ffcv_pytorch_lightning
Author: DSerez
License: MIT
Description-Content-Type: text/markdown

# FFCV Dataloader with Pytorch Lightning

FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv  

In this repository, all the steps to install and configure it with pytorch-lightning are presented.  
Moreover, some useful methods to quickly create, preprocess and load Datasets with *FFCV* and *pytorch-lightning* 
are proposed.

## Installation

### Dependencies

There are actually some known issues about the installation of the FFCV package.   
Check for instance issues of FFCV ([#133](https://github.com/libffcv/ffcv/issues/133) 
[#54](https://github.com/libffcv/ffcv/issues/54)). 

The first suggestion to install dependencies is to use the provided `environment.yml` file:  
```
conda env create --file environment.yml
```
This should correctly create a conda environment named `ffcv-pl`.  

**If the above does not work**, then 
you can try installing packages manually: 

1. create conda environment
    ```
    conda create --name ffcv-pl
    conda activate ffcv-pl
    ```

2. install pytorch according to [official website](https://pytorch.org/get-started/locally/)

    ```
    # in my environment the command is the following 
    conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
    ```

3. install ffcv dependencies 
    ```
    # can take a very long time, but should not create conflicts
    conda install cupy pkg-config compilers libjpeg-turbo opencv numba -c pytorch -c conda-forge
    ```

4. install ffcv and pytorch-lighting
    ```
    pip install ffcv
    pip install pytorch-lightning
    ```

### Package

Once dependencies are installed, it is safe to install package: 
```
pip install ffcv_pl
```

## Dataset Creation

You need to save your dataset in ffcv format (`.beton`).  
Official FFCV [docs](https://docs.ffcv.io/writing_datasets.html).

This package allows different types of Datasets, listed in the `dataset` subpackage.
A quick example on how to create a dataset is provided in the `dataset_creation.py script`:

```
from ffcv_pl.ffcv_utils.generate_dataset import create_image_dataset

if __name__ == '__main__':

    # write dataset in ".beton" format
    test_folder = '/media/dserez/datasets/imagenet/test/'
    create_image_dataset(test_folder=test_folder) 
```

For example, this code will create the file `/media/dserez/datasets/imagenet/test.beton`, 
loading images from folder `/media/dserez/datasets/imagenet/test/`.

Note that you can pass also train/validation folders, all in one call. 

## Dataloader and Datamodule

Merge the PL Datamodule with the FFCV Loader object.  
It should be compatible with ddp/multiprocessing.  
See `main.py` for a complete example.  
Official FFCV [docs](https://docs.ffcv.io/making_dataloaders.html).

```
import pytorch_lightning as pl
import torch
from pytorch_lightning.strategies.ddp import DDPStrategy

from torch import nn
from torch.optim import Adam

from ffcv_pl.datasets.image import ImageDataModule


# define the LightningModule
class LitAutoEncoder(pl.LightningModule):

    def __init__(self):

        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(256 * 256 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 256 * 256 * 3))

    def training_step(self, batch, batch_idx):

        x, y = batch

        b, c, h, w = x.shape
        x = x.reshape(b, -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == '__main__':

    SEED = 1234

    pl.seed_everything(SEED, workers=True)

    dataset = 'cub2002011'
    image_size = 256
    batch_size = 16
    train_folder = f'/media/dserez/datasets/{dataset}/train.beton'
    val_folder = f'/media/dserez/datasets/{dataset}/test.beton'

    gpus = 2
    workers = 8

    # define model
    model = LitAutoEncoder()

    # trainer
    trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
                         accelerator='gpu', devices=gpus, num_nodes=1, max_epochs=5)

    # Note: set is_dist True if you are using DDP and more than one GPU
    data_module = ImageDataModule(train_folder, val_folder, val_folder, image_size, torch.float32, batch_size,
                                  num_workers=1, is_dist=gpus > 1, seed=SEED)

    trainer.fit(model, data_module)
```

Each `ffcv_pl.datasets.*` contains a couple of classes (Dataset, Dataloader).

## Citations

1. Pytorch-Lightning:  
    Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Version 1.4) 
    [Computer software]. https://doi.org/10.5281/zenodo.3828935

2. FFCV: 
    ```
    @misc{leclerc2022ffcv,
        author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
        title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks},
        year = {2022},
        howpublished = {\url{https://github.com/libffcv/ffcv/}},
        note = {commit xxxxxxx}
    }
    ```
