Metadata-Version: 2.1
Name: tensor-parallel
Version: 1.0.18
Summary: Automatically shard your large model between multiple GPUs, works without torch.distributed
Home-page: https://github.com/BlackSamorez/tensor_parallel
Author: Andrei Panferov and Yaroslav Lisnyak
Author-email: yalisnyak@nes.com
Project-URL: Bug Tracker, https://github.com/BlackSamorez/tensor_parallel/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: torch (>=1.12)
Requires-Dist: transformers (>=4.25.1)
Provides-Extra: dev
Requires-Dist: pytest (==6.2.5) ; extra == 'dev'
Requires-Dist: pytest-forked ; extra == 'dev'
Requires-Dist: pytest-asyncio (==0.16.0) ; extra == 'dev'
Requires-Dist: accelerate (==0.15.0) ; extra == 'dev'
Requires-Dist: black (==22.3.0) ; extra == 'dev'
Requires-Dist: isort (==5.10.1) ; extra == 'dev'
Requires-Dist: psutil ; extra == 'dev'

# petals_local_parallel
YSDA project

```python
import torch, torch.nn as nn
model = nn.Sequential(nn.Embedding(1337, 64), nn.LayerNorm(64), nn.ReLU(), nn.Linear(128, 10))

from tensor_parallel import TensorParallel
model = TensorParallel(model, device_ids=['cuda:0', 'cuda:1'])

normal_outputs = model(**normal_inputs)  # forward and backward works just like in the base model
```

## Benchmarking tutorial

You may either use manual benchmark (```benchmark_manual.py```) or auto (```markbench.py```) 

#### Manual benchmark

consider command line arguments:

```-d | do_backward``` -- wether you need backward passes or not

```-n | num_iter ```   -- number of iterations

```-s | seq_length```  -- sequence length

```-b | batch_size```  -- okay

```-c | bloomconfig``` -- str used in BloomConfig().from_pretrained to specify the model you need

```CUDA_VISIBLE_DEVICES``` -- gpus, you are using

```nproc_per_node```       -- # of gpus/ processes

Don't forget to set correct gpu ids: ```export CUDA_DEVICE_ORDER=PCI_BUS_ID```

So the following command
``` CUDA_VISIBLE_DEVICES=4,5 torchrun --nproc_per_node 2 benchmark.py -d 0 -n 100 -s 17 -b 16 -c bloom ```
will run the manual benchmark with no backward pass, 100 iterations, sequence length of 17, batch size of 16 and "bloom" 176B model.


#### Auto benchmark

no command line arguments this time, just run ```markbench.py```

The script will run several experiments in cycle. To see the parameters, check the experiment setting section in the ```markbench.py```.
Models are tested both with and without backward passes. The results will be printed for all of the ranks. (MESS)

#### TODO:

- Decide which models are too big for backward passes and don't check them
- Decide what to do if one of the experiments failed



