Metadata-Version: 2.1
Name: smashed
Version: 0.15.1
Summary: Sequential MAppers for Sequences of HEterogeneous Dictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries.
Author-email: Allen Institute for Artificial Intelligence <contact@allenai.org>, Luca Soldaini <luca@soldaini.net>
License: Apache-2.0
Project-URL: Homepage, https://github.com/allenai/smashed
Project-URL: Repository, https://github.com/allenai/smashed
Project-URL: Bug Tracker, https://github.com/allenai/smashed/issues
Keywords: mappers,pytorch,torch,huggingfae,transformers,datasets,dict,datset,pipeline,preprocessing,nlp,natural language processing,text,prompting
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch (>=1.9)
Requires-Dist: transformers (>=4.5)
Requires-Dist: necessary (>=0.3.1)
Requires-Dist: trouting (>=0.2.2)
Requires-Dist: ftfy (>=6.1.1)
Requires-Dist: platformdirs (>=2.5.0)
Requires-Dist: glom (>=21.0.0)
Provides-Extra: all
Requires-Dist: smashed[dev] ; extra == 'all'
Requires-Dist: smashed[datasets] ; extra == 'all'
Requires-Dist: smashed[torchdata] ; extra == 'all'
Requires-Dist: smashed[remote] ; extra == 'all'
Requires-Dist: smashed[prompting] ; extra == 'all'
Provides-Extra: datasets
Requires-Dist: datasets (>=2.4.0) ; extra == 'datasets'
Requires-Dist: dill (>=0.3.0) ; extra == 'datasets'
Provides-Extra: dev
Requires-Dist: springs (>=1.8.3) ; extra == 'dev'
Requires-Dist: black[jupyter] (>=21.12b0) ; extra == 'dev'
Requires-Dist: isort (>=5.8.0) ; extra == 'dev'
Requires-Dist: mypy (>=0.971) ; extra == 'dev'
Requires-Dist: pytest (>=5.2) ; extra == 'dev'
Requires-Dist: ipython (>=8.4.0) ; extra == 'dev'
Requires-Dist: autopep8 (>=1.7.0) ; extra == 'dev'
Requires-Dist: flake8 (>=5.0) ; extra == 'dev'
Requires-Dist: ipdb (>=0.13.0) ; extra == 'dev'
Requires-Dist: flake8-pyi (>=22.8.1) ; extra == 'dev'
Requires-Dist: Flake8-pyproject (>=1.1.0) ; extra == 'dev'
Provides-Extra: prompting
Requires-Dist: promptsource (>=0.2.3) ; extra == 'prompting'
Requires-Dist: blingfire (>=0.1.8) ; extra == 'prompting'
Requires-Dist: PyYAML (>=6.0.0) ; extra == 'prompting'
Provides-Extra: remote
Requires-Dist: smart-open (>=5.2.1) ; extra == 'remote'
Provides-Extra: torchdata
Requires-Dist: torch (>=1.12.1) ; extra == 'torchdata'
Requires-Dist: torchdata (>=0.4.1) ; extra == 'torchdata'

![Colorful logo of smashed. It is the word smashed written in a playful font that vaguely looks like pipes.](https://github.com/allenai/smashed/raw/main/resources/smashed.png)

**S**equential **MA**ppers for **S**equences of **HE**terogeneous **D**ictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries. To start, run

```bash
pip install smashed
```

## Example of Usage

Mappers are initialized and then applied sequentially. In the following example, we create a mapper that is applied to a samples, each containing a sequence of strings.
The mappers are responsible for the following operations.

1. Tokenize each sequence, cropping it to a maximum length if necessary.
2. Stride sequences together to a maximum length or number of samples.
3. Add padding symbols to sequences and attention masks.
4. Concatenate all sequences from a stride into a single sequence.

```python
import transformers
from smashed.mappers import (
    TokenizerMapper,
    MultiSequenceStriderMapper,
    TokensSequencesPaddingMapper,
    AttentionMaskSequencePaddingMapper,
    SequencesConcatenateMapper,
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path='bert-base-uncased',
)

mappers = [
    TokenizerMapper(
        input_field='sentences',
        tokenizer=tokenizer,
        add_special_tokens=False,
        truncation=True,
        max_length=80
    ),
    MultiSequenceStriderMapper(
        max_stride_count=2,
        max_length=512,
        tokenizer=tokenizer,
        length_reference_field='input_ids'
    ),
    TokensSequencesPaddingMapper(
        tokenizer=tokenizer,
        input_field='input_ids'
    ),
    AttentionMaskSequencePaddingMapper(
        tokenizer=tokenizer,
        input_field='attention_mask'
    ),
    SequencesConcatenateMapper()
]

dataset = [
    {
        'sentences': [
            'This is a sentence.',
            'This is another sentence.',
            'Together, they make a paragraph.',
        ]
    },
    {
        'sentences': [
            'This sentence belongs to another sample',
            'Overall, the dataset is made of multiple samples.',
            'Each sample is made of multiple sentences.',
            'Samples might have a different number of sentences.',
            'And that is the story!',
        ]
    }
]

for mapper in mappers:
    dataset = mapper.map(dataset)

print(len(dataset))

# >>> 5

print(dataset[0])

# >>> {
#    'input_ids': [
#        101,
#        2023,
#        2003,
#        1037,
#        6251,
#        1012,
#        102,
#        2023,
#        2003,
#        2178,
#        6251,
#        1012,
#        102
#    ],
#    'attention_mask': [
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1
#    ]
# }
```

## Building a Pipeline

Mappers can also be composed into a pipeline using the `>>` (or `<<`) operator. For example, the code above can be rewritten as follows:

```python
pipeline = TokenizerMapper(
    input_field='sentences',
    tokenizer=tokenizer,
    add_special_tokens=False,
    truncation=True,
    max_length=80
) >> MultiSequenceStriderMapper(
    max_stride_count=2,
    max_length=512,
    tokenizer=tokenizer,
    length_reference_field='input_ids'
) >> TokensSequencesPaddingMapper(
    tokenizer=tokenizer,
    input_field='input_ids'
) >> AttentionMaskSequencePaddingMapper(
    tokenizer=tokenizer,
    input_field='attention_mask'
) >> SequencesConcatenateMapper()

dataset = ...

# apply the full pipeline to the dataset
pipeline.map(dataset)
```

## Dataset Interfaces Available

The initial version of SMASHED supports two interfaces for dataset:

1. **`interfaces.simple.Dataset`**: A simple dataset representation that is just a list of python dictionaries with some extra convenience methods to make it work with SMASHED. You can crate a simple dataset by passing a list of dictionaries to `interfaces.simple.Dataset`.
2. **HuggingFace `datasets` library**. SMASHED mappers work with any datasets from HuggingFace, whether it is a regular or iterable dataset.

## Developing SMASHED

To contribute to SMASHED, make sure to:

1. (If you are not part of AI2) Fork the repository on GitHub.
2. Clone it locally.
3. Create a new branch in for the new feature.
4. Install development dependencies with `pip install -r dev-requirements.txt`.
5. Add your new mapper or feature.
6. Add unit tests.
7. Run tests, linting, and type checking from the root directory of the repo:
    1. *Style:* `black .` (Should format for you)
    2. *Style:* `flake8 .`  (Should return no error)
    3. *Style:* `isort .` (Should sort imports for you)
    4. *Static type check:* `mypy .` (Should return no error)
    5. *Tests:* `pytest -v --color=yes tests/` (Should return no error)
8. Commit, push, and create a pull request.
9. Tag `soldni` to review the PR.

### A note about versioning

SMASHED follows [Semantic Versioning](https://semver.org/). In short, this means that the version number is MAJOR.MINOR.PATCH, where:

- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards compatible manner; adding a mapper typically falls under this category, and
- PATCH version when you make backwards compatible bug fixes.
