Metadata-Version: 2.4
Name: data-designer
Version: 0.1.5
Summary: General framework for synthetic data generation
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Requires-Python: >=3.10
Requires-Dist: anyascii<1.0,>=0.3.3
Requires-Dist: datasets>=4.0.0
Requires-Dist: duckdb==1.1.3
Requires-Dist: faker==20.1.0
Requires-Dist: httpx-retries>=0.4.2
Requires-Dist: httpx>=0.27.2
Requires-Dist: huggingface-hub>=0.34.4
Requires-Dist: jinja2<4,>=3.1.6
Requires-Dist: json-repair==0.48.0
Requires-Dist: jsonpath-rust-bindings>=1.0
Requires-Dist: litellm==1.73.6
Requires-Dist: lxml>=6.0.2
Requires-Dist: marko==2.1.2
Requires-Dist: networkx==3.0
Requires-Dist: numpy>=1.23.5
Requires-Dist: pandas>=1.5.3
Requires-Dist: prompt-toolkit>=3.0.0
Requires-Dist: pyarrow>=19.0.1
Requires-Dist: pydantic>=2.9.2
Requires-Dist: pydantic[email]>=2.9.2
Requires-Dist: pygments>=2.19.2
Requires-Dist: python-json-logger==2.0.7
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: requests<3,>=2.32.2
Requires-Dist: rich>=13.7.1
Requires-Dist: ruff==0.12.3
Requires-Dist: scipy>=1.11.0
Requires-Dist: sqlfluff==3.2.0
Requires-Dist: tiktoken>=0.8.0
Requires-Dist: typer>=0.12.0
Description-Content-Type: text/markdown

# 🎨 NeMo Data Designer

[![CI](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.10 - 3.13](https://img.shields.io/badge/🐍_Python-3.10_|_3.11_|_3.12_|_3.13-blue.svg)](https://www.python.org/downloads/) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html) [![Code](https://img.shields.io/badge/Code-Documentation-8A2BE2.svg)](https://nvidia-nemo.github.io/DataDesigner/)

**Generate high-quality synthetic datasets from scratch or using your own seed data.**

---

## Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

## What can you do with Data Designer?

- **Generate diverse data** using statistical samplers, LLMs, or existing seed datasets
- **Control relationships** between fields with dependency-aware generation
- **Validate quality** with built-in Python, SQL, and custom local and remote validators
- **Score outputs** using LLM-as-a-judge for quality assessment
- **Iterate quickly** with preview mode before full-scale generation

---

## Quick Start

### 1. Install

```bash
pip install data-designer
```

Or install from source:

```bash
git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install
```

### 2. Set your API key

Get your API key from [build.nvidia.com](https://build.nvidia.com) or [OpenAI](https://platform.openai.com/api-keys):

```bash
export NVIDIA_API_KEY="your-api-key-here"
# Or use OpenAI
export OPENAI_API_KEY="your-openai-api-key-here"
```

### 3. Start generating data!
```python
from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
)

# Initialize with default settings
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()
```

---

## What's next?

### 📚 Learn more

- **[Quick Start Guide](https://nvidia-nemo.github.io/DataDesigner/latest/quick-start/)** – Detailed walkthrough with more examples
- **[Tutorial Notebooks](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/)** – Step-by-step interactive tutorials
- **[Column Types](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/columns/)** – Explore samplers, LLM columns, validators, and more
- **[Validators](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/validators/)** – Learn how to validate generated data with Python, SQL, and remote validators
- **[Model Configuration](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/model-configs/)** – Configure custom models and providers
- **[Person Sampling](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/)** – Learn how to sample realistic person data with demographic attributes

### 🔧 Configure models via CLI

```bash
data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings
```

### 🤝 Get involved

- **[Contributing Guide](https://nvidia-nemo.github.io/DataDesigner/latest/CONTRIBUTING)** – Help improve Data Designer
- **[GitHub Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues)** – Report bugs or make a feature request

---

## License

Apache License 2.0 – see [LICENSE](LICENSE) for details.

---

## Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

```bibtex
@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}
```
