Metadata-Version: 2.1
Name: tokeniser-py
Version: 0.1.0
Summary: A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.
Home-page: https://github.com/Tasmay-Tibrewal
Author: Tasmay Pankaj Tibrewal
Author-email: keshavtibrewal2@gmail.com
License: MIT
Keywords: Tokens,Tokeniser,Tokenizer,LLMs,LMs,LLM,LM,Language Model,Language Models,Large Language Models,Large Language Model
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
License-File: LICENCE.txt

### 📄 `README.md`
# 🧠 Custom Tokeniser Library

A high-performance, fully custom tokeniser built from scratch — no BPE, no existing NLP tokenisation scheme. This tokeniser is based on a unique algorithm developed independently and trained on over **1 billion tokens** from the **SlimPajama dataset** (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.

## 🚀 What This Library Offers

- Tokeniser built on a vocabulary of **131,072 tokens**
- Two versions of vocab:
  - `0.5B`: Validation-only data
  - `1B`: Validation + Test data
- Token vocab built via a **custom algorithm** — no Byte Pair Encoding (BPE)
- Tokenisation logic includes:
  - Token lookup from pre-generated token map
  - Dynamic programming-based segmentation for out-of-vocab tokens
  - One-hot encoding (NumPy or PyTorch)
  - Visualisation utilities for tokens and token IDs
- Lightweight JSON format for token maps & token count maps
- Ready for integration into any LLM pre-tokenisation pipeline

> Note: Files (chunked less than 2GB) are stored on [Hugging Face](https://huggingface.co/) instead of [GitHub](https://github.com/) due to LFS file size constraints. On [GitHub](https://github.com/) (files chunked below 100MB) are available.

## 📦 Installation
```bash
pip install tokeniser-py
```

## 🛠 Usage
```python
from tokeniser import Tokeniser

t = Tokeniser(ln='1b', token_ordered=True)
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)
```

Use `t.one_hot_tokens(token_ids)` for NumPy-based one-hot encoding, or `op='torch'` for PyTorch.

## 📚 Data Sources

All token maps and token counts are generated from the [SlimPajama dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B) by Cerebras.

## 📁 Vocab Files
- `ordered_tokenizer_1b_val_test_data.json` — Ordered tokens (1B data)
- `unordered_tokenizer_1b_val_test_data.json` — Unordered tokens (1B)
- `count_tokenizer_1b_val_test_data.json` — Token counts (1B)
- (Similar structure for 0.5B val-only version)

## 📌 Design Philosophy

This tokeniser is **built from scratch** before learning existing algorithms like BPE. It is designed with the intent to **understand, innovate**, and compare with existing solutions from first principles.

> Some parts may overlap with BPE/WordPiece in spirit — but the core algorithm was independently designed.

## 🤝 Contributions

Feel free to contribute anything via GitHub.

## 📖 License

MIT License

### 📄 `CHANGELOG`
# 📦 Changelog

## [0.1.0] - 2025-03-22
### Added
- Initial release of custom tokeniser library
- Tokeniser class with support for:
  - `tokenise()` using DP segmentation
  - Custom token map and count map loading
  - One-hot encoding support (NumPy & PyTorch)
  - Token and token ID visualisation functions
  - `token_map()`, `token_count_map()`, `max_token_length()` accessors
- Full support for:
  - 0.5B val-only vocab
  - 1B val + test vocab
- JSON-based token and count maps from SlimPajama corpus

### Notes
- Built on top of a **custom token creation algorithm** not based on any standard BPE/WordPiece method
- SlimPajama dataset used for vocab extraction
- Token count files are optimized to stay under 2GB for compatibility with Git LFS (and Hugging Face storage)
