Metadata-Version: 2.1
Name: tinystoriesmodel
Version: 0.1.1
Summary: 
License: MIT
Author: Noa Nabeshima
Author-email: noanabeshima@gmail.com
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: einops (>=0.8.0,<0.9.0)
Requires-Dist: huggingface-hub (>=0.23.4,<0.24.0)
Requires-Dist: numpy (>=1.26.4,<2.0.0)
Requires-Dist: torch (>=2.3.1,<3.0.0)
Requires-Dist: tqdm (>=4.66.4,<5.0.0)
Requires-Dist: transformers (>=4.41.2,<5.0.0)
Requires-Dist: unidecode (>=1.3.8,<2.0.0)
Description-Content-Type: text/markdown

See https://github.com/noanabeshima/tiny_model

TinyModel is a 44M parameter model trained on [TinyStories V2](https://arxiv.org/abs/2305.07759) for mechanistic interpretability.

It has 4 layers, uses ReLU activations, and has no layernorms.

It was trained for 3 epochs on a [preprocessed version of TinyStoriesV2](https://huggingface.co/datasets/noanabeshima/TinyStoriesV2).


```
from tiny_model import TinyModel, tokenizer

lm = TinyModel()

# for inference
tok_ids, attn_mask = tokenizer(['Once upon a time', 'In the forest'])
logprobs = lm(tok_ids)

# or
lm.generate('Once upon a time, Ada was happily walking through a magical forest with')

# To decode tok_ids you can use
tokenizer.decode(tok_ids)
```

Tokenization is done as follows:
- the top-10K most frequent tokens using the GPT-NeoX tokenizer are selected and sorted by frequency.
- To tokenize a document, first tokenize with the GPT-NeoX tokenizer. Then replace tokens not in the top 10K tokens with a special \[UNK\] token id. All token ids are then mapped to be between 1 and 10K, roughly sorted from most frequent to least.
- Finally, prepend the document with a [BEGIN] token id.


