Metadata-Version: 2.4
Name: chatan
Version: 0.1.1
Summary: Create synthetic datasets with LLM generators and samplers
Project-URL: Documentation, https://github.com/cdreetz/chatan#readme
Project-URL: Issues, https://github.com/cdreetz/chatan/issues
Project-URL: Source, https://github.com/cdreetz/chatan
Author-email: Christian Reetz <cdreetz@gmail.com>
License-Expression: MIT
Keywords: dataset generation,llm,machine learning,synthetic data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Requires-Dist: anthropic>=0.7.0
Requires-Dist: datasets>=2.0.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: openai>=1.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pydantic>=2.0.0
Description-Content-Type: text/markdown

## Examples

Prompt a dataset

```
import chatan

gen = chatan.generator.client("YOUR_OPENAI_API_KEY")
ds = chatan.dataset("create a QA dataset for finetuning an LLM on pharmacology")
```

Creating datasets with different data mixes

```
import uuid
from chatan import dataset, generator, mix

gen = generator.client("YOUR_OPENAI_API_KEY")
#generator.client("anthropic", "YOUR_ANTHROPIC_API_KEY")

mix = {
    "implementation": "Can you implement a matmul kernel in Triton",
    "conversion": "Convert this pytorch model to Triton",
    "explanation": "What memory access optimizations are being used here?"
}

ds = dataset({
    "id": uuid,
    "task": sample.choice(mix),
    "prompt": gen("write a prompt for {task}"),
    "response": gen("write a response to {prompt}"),
)}
```

Augment datasets

```
import uuid
from chatan import dataset, generator
from dataset import load_dataset

gen = generator.client("YOUR_OPENAI_API_KEY")
hf_dataset = load_dataset("GPU_MODE/KernelBook")

ds = dataset({
    "id": sample.from_dataset(hf_data, "id", default=sample.uuid()),
    "prompt": sample.from_dataset(hf_data, "prompt", aug=gen("provide a variation of this prompt")),
    "response": gen("write a response to {prompt}")

})

```
