Metadata-Version: 2.1
Name: engawa
Version: 0.1.3
Summary: 
Author: sobamchan
Author-email: oh.sore.sore.soutarou@gmail.com
Requires-Python: >=3.10.8,<4.0.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: datasets (>=2.8.0,<3.0.0)
Requires-Dist: nltk (>=3.8,<4.0)
Requires-Dist: pytorch-lightning (>=1.8.6,<2.0.0)
Requires-Dist: sentencepiece (>=0.1.97,<0.2.0)
Requires-Dist: sienna (>=0.1.5,<0.2.0)
Requires-Dist: tokenizers (>=0.13.2,<0.14.0)
Requires-Dist: transformers (>=4.25.1,<5.0.0)
Requires-Dist: wandb (>=0.13.7,<0.14.0)
Description-Content-Type: text/markdown

# engawa

**NOT YET FULLY TESTED**

A simple implementation to pre-train BART from scratch with your own corpus.


# Usage

Soon, I will make this pip-installable with CLI commands but at the moment, you need to run it as a repository.

## Installation

```bash
git clone git@github.com:sobamchan/engawa.git && cd engawa
poetry install
```

## Build tokenizer

```bash
python engawa/tokenizer.py --data-path /path/to/train.txt --save-dir /path/to/save

# Checkout other options by
python engawa/tokenizer.py -h
```

## Pre-train BART

```bash
python engawa/train.py --tokenizer-file /path/to/tokenizer.json --train-file /path/to/train.txt --val-file /path/to/val.txt --default-root-dir /path/to/save/things

# Checkout other options by
python engawa/train.py -h
```

