Metadata-Version: 2.1
Name: gpu-tester
Version: 1.2.0
Summary: A python template
Home-page: https://github.com/rom1504/gpu_tester
Author: Romain Beaumont
Author-email: romain.rom1@gmail.com
License: MIT
Keywords: machine learning
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
Description-Content-Type: text/markdown
Requires-Dist: fire
Requires-Dist: numpy
Requires-Dist: torch

# gpu_tester
[![pypi](https://img.shields.io/pypi/v/gpu_tester.svg)](https://pypi.python.org/pypi/gpu_tester)

Gpu tester finds all your bad gpus.

Works on slurm.

Features:
* does a forward on each gpu
* check for gpu returning incorrect results
* check for gpu failing due to ECC errors

Roadmap:
* sanity check forward speed
* sanity check broadcast speed

## Install

Create a venv:

```
python3 -m venv .env
source .env/bin/activate
pip install -U pip
```

Then:
```
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester
```

## Python examples

Checkout these examples to call this as a lib:
* [example.py](examples/example.py)

## Output

Output looks like this:

```
job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]
```

## Recommended testing strategy

### Pair based strategy

The easiest way to quickly spot broken node is to do the pair-based strategy.
It will run many jobs in parallel and find which node can talk together
Here is one example
```
gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'
```

### All at once strategy

Once you validated this works, you may want to try the DDP strategy over all nodes, eg:
```
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'
```

### Simple forward

If you want to only validate the forward functionality of gpus and not the communication, you may use:

```
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'
```


## API

This module exposes a single function `gpu_tester` which takes the same arguments as the command line tool:

* **cluster** the cluster. (default *slurm*)
* **job_name** slurm job name. (default *gpu_tester*)
* **partition** slurm partition. (default *compute-od-gpu*)
* **gpu_per_node** numbe of gpu per node. (default *8*)
* **nodes** number of gpu nodes. (default *1*)
* **output_folder** the output folder. (default *None* which means current folder / results)
* **job_timeout** job timeout (default *150* seconds)
* **job_comment** optional comment arg given to slurm (default *None*)
* **job_account** optional account arg given to slurm (default *None*)
* **test_kind** simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default *simple_forward*)
* **parallel_tests** number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default *1*)
* **nodelist** node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default *None*)
* **exclude** node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default *None*)

## For development

Either locally, or in [gitpod](https://gitpod.io/#https://github.com/rom1504/gpu_tester) (do `export PIP_USER=false` there)

Setup a virtualenv:

```
python3 -m venv .env
source .env/bin/activate
pip install -e .
```

to run tests:
```
pip install -r requirements-test.txt
```
then 
```
make lint
make test
```

You can use `make black` to reformat the code

`python -m pytest -x -s -v tests -k "dummy"` to run a specific test


