Metadata-Version: 2.1
Name: dolma
Version: 0.6.4
Classifier: Development Status :: 3 - Alpha
Classifier: Typing :: Typed
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Dist: anyascii >=0.3.2
Requires-Dist: blingfire ==0.1.8
Requires-Dist: boto3
Requires-Dist: cached-path ==1.3.4
Requires-Dist: detect-secrets ==1.4.0
Requires-Dist: fasttext-wheel ==0.9.2
Requires-Dist: fsspec
Requires-Dist: msgspec >=0.14.2
Requires-Dist: nltk ==3.8.1
Requires-Dist: omegaconf >=2.3.0
Requires-Dist: presidio_analyzer ==2.2.32
Requires-Dist: pycld2 ==0.41
Requires-Dist: pyyaml
Requires-Dist: requests
Requires-Dist: rich
Requires-Dist: s3fs
Requires-Dist: smart-open
Requires-Dist: tokenizers >=0.13.3, <1.0.0
Requires-Dist: tqdm
Requires-Dist: uniseg
Requires-Dist: black >=22.6.0 ; extra == 'dev'
Requires-Dist: isort >=5.10.1 ; extra == 'dev'
Requires-Dist: mypy >=0.971 ; extra == 'dev'
Requires-Dist: pytest >=5.2 ; extra == 'dev'
Requires-Dist: ipython >=8.4.0 ; extra == 'dev'
Requires-Dist: autopep8 >=1.7.0 ; extra == 'dev'
Requires-Dist: flake8 >=5.0 ; extra == 'dev'
Requires-Dist: ipdb >=0.13.0 ; extra == 'dev'
Requires-Dist: flake8-pyi >=22.8.1 ; extra == 'dev'
Requires-Dist: Flake8-pyproject >=1.1.0 ; extra == 'dev'
Provides-Extra: dev
License-File: LICENSE
Summary: Data filters
Author-email: Allen Institute for Artificial Intelligence <contact@allenai.org>, Luca Soldaini <luca@soldaini.net>, Kyle Lo <kylel@allenai.org>, Rodney Kinney <rodneyk@allenai.org>, Aakanksha Naik <aakankshan@allenai.org>, Abhilasha Ravichander <abhilashar@allenai.org>, Akshita Bhagia <akshitab@allenai.org>, Dirk Groeneveld <dirkg@allenai.org>, Dustin Schwenk <dustins@allenai.org>, Ian Magnusson <ianm@allenai.org>, Khyathi Chandu <khyathic@allenai.org>
Maintainer-email: Allen Institute for Artificial Intelligence <contact@allenai.org>
License: Apache-2.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/allenai/dolma

# dolma

*Data to feed OLMo's Appetite*


<img alt="DOLMa logo. It's a watercolor of grape leaves with the word DOLMa in the top left." src="https://github.com/allenai/dolma/blob/main/res/logo.png?raw=true" width="256">

Data and tools for generating and inspecting OLMo pre-training data.

To get started, install dolma using [pip](https://pypi.org/project/dolma/).

```shell
pip install dolma
```

## Usage

The dolma CLI can be access using the `dolma` command. To see the available commands, use the `--help` flag.

```shell
dolma --help
```

At the moment, the CLI supports three commands: `tag`, `dedupe`, and `mix`.

For all commands, configurations can be specified from command line, or by passing a YAML or JSON file using the `-c` flag. For example:

```shell
dolma -c config.yaml dedupe --dedupe.name "test"
```

### `dolma tag`

The tag command is used to run any of the built-in taggers on a set of documents. For example:

```shell
dolma tag \
    --experiment sample \
    --documents \
        's3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/documents/**/*.json.gz' \
        's3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/documents/*.json.gz' \
    --taggers random_number_v1 \
    --processes 2
```

This command will run the `random_number_v1` tagger on all documents in the specified S3 paths. The results will be written to the `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/attributes/sample` and `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/attributes/sample` paths.

### `dolma dedupe`

The dedupe command is used to deduplicate a set of documents at the attribute level using a bloom filter.
For example configurations, see directory `tests/config`. For example:

```shell
dolma dedupe -c tests/config/dedupe-paragraphs.json
```

### `dolma mix`

The mix command is used to mix documents from multiple sources, optionally filtering by attributes and/or performing string replacement. For example configurations, see directory `tests/config`. For example:

```shell
dolma mix -c tests/config/mixer.json
```


## Development

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.

```shell
conda create -n dolma python=3.10
```

After creating the environment, activate it and install necessary tools using the included makefile.

```shell
conda activate dolma
make setup
```

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.

```shell
make develop
```

To run tests, use the following command.

```shell
make test
```

You can choose to run just the Python or Rust tests by calling `make test-python` or `make test-rust` respectively.


## Citation

If you use this repository, please cite it as:

```bibtex
@software{dolma,
    author = {{Soldaini, Luca and Lo, Kyle and Kinney, Rodney and Naik, Aakanksha and Ravichander, Abhilasha and Bhagia, Akshita and Groeneveld, Dirk and Schwenk, Dustin and Magnusson, Ian and Chandu, Khyathi}},
    license = {{Apache-2.0}},
    title = {{DOLMa}},
    url = {https://github.com/allenai/dolma}
}
```

