Metadata-Version: 2.4
Name: mostlyai
Version: 4.0.4
Summary: Synthetic Data SDK
Project-URL: homepage, https://app.mostly.ai/
Project-URL: repository, https://github.com/mostly-ai/mostlyai
Project-URL: documentation, https://mostly-ai.github.io/mostlyai/
Author-email: MOSTLY AI <dev@mostly.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: adlfs<2024,>=2023.4.0
Requires-Dist: azure-storage-blob<13,>=12.16.0
Requires-Dist: cloudpathlib[azure,gs,s3]<0.18,>=0.17.0
Requires-Dist: cryptography<44,>=43.0.1
Requires-Dist: environs<10,>=9.5.0
Requires-Dist: fastparquet<2024,>=2023.4.0
Requires-Dist: filelock>=3.16.1
Requires-Dist: gcsfs<2024,>=2023.1.0
Requires-Dist: gputil<2,>=1.4.0
Requires-Dist: greenlet<4,>=3.1.1
Requires-Dist: gunicorn<24,>=23.0.0
Requires-Dist: httpx<0.28.0,>=0.25.0
Requires-Dist: joblib<2,>=1.2.0
Requires-Dist: networkx~=3.1
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pandas<3,>=1.5.3
Requires-Dist: psutil<6,>=5.9.5
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: pycryptodomex<4,>=3.20.0
Requires-Dist: pydantic<3,>=2.4.2
Requires-Dist: pydot<2,>=1.4.2
Requires-Dist: requests<3,>=2.31.0
Requires-Dist: rich>=13.7.0
Requires-Dist: s3fs<2024,>=2023.1.0
Requires-Dist: schema<0.8,>=0.7.5
Requires-Dist: semantic-version<3,>=2.10.0
Requires-Dist: smart-open>=6.0.0
Requires-Dist: smart-open[azure,gcs,s3]<7,>=6.3.0
Requires-Dist: sqlalchemy<3,>=2.0.0
Requires-Dist: sshtunnel<0.5,>=0.4.0
Requires-Dist: typer<0.10,>=0.9.0
Requires-Dist: xlsxwriter<4,>=3.1.9
Requires-Dist: xxhash<4,>=3.2.0
Provides-Extra: databricks
Requires-Dist: databricks-sql-connector<4,>=3.2.0; extra == 'databricks'
Provides-Extra: googlebigquery
Requires-Dist: sqlalchemy-bigquery<2,>=1.6.1; extra == 'googlebigquery'
Provides-Extra: hive
Requires-Dist: impyla<0.20,>=0.19.0; extra == 'hive'
Requires-Dist: kerberos<2,>=1.3.1; extra == 'hive'
Requires-Dist: pyhive[hive-pure-sasl]<0.8,>=0.7.0; extra == 'hive'
Provides-Extra: local
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local'
Requires-Dist: mostlyai-engine==1.0.2; extra == 'local'
Requires-Dist: mostlyai-qa==1.5.1; extra == 'local'
Requires-Dist: python-multipart>=0.0.20; extra == 'local'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local'
Provides-Extra: local-cpu
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-cpu'
Requires-Dist: mostlyai-engine[cpu]==1.0.2; extra == 'local-cpu'
Requires-Dist: mostlyai-qa==1.5.1; extra == 'local-cpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-cpu'
Requires-Dist: torch>=2.5.1; extra == 'local-cpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-cpu'
Provides-Extra: local-gpu
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-gpu'
Requires-Dist: mostlyai-engine[gpu]==1.0.2; extra == 'local-gpu'
Requires-Dist: mostlyai-qa==1.5.1; extra == 'local-gpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-gpu'
Requires-Dist: torch>=2.5.1; extra == 'local-gpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-gpu'
Provides-Extra: mssql
Requires-Dist: pyodbc<6,>=5.1.0; extra == 'mssql'
Provides-Extra: mysql
Requires-Dist: mysql-connector-python<10,>=9.1.0; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: oracledb<3,>=2.2.1; extra == 'oracle'
Provides-Extra: postgres
Requires-Dist: psycopg2<3,>=2.9.4; extra == 'postgres'
Provides-Extra: snowflake
Requires-Dist: snowflake-sqlalchemy<2,>=1.6.1; extra == 'snowflake'
Description-Content-Type: text/markdown


# Synthetic Data SDK ✨

[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai/) [![stats](https://pepy.tech/badge/mostlyai)](https://pypi.org/project/mostlyai/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai)

[SDK Documentation](https://mostly-ai.github.io/mostlyai/) | [Platform Documentation](https://mostly.ai/docs) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/)

The official SDK of [MOSTLY AI](https://app.mostly.ai/), a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.

- **Client mode** connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- **Local mode** trains and generates synthetic data locally on your own compute resources.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.

## Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
3. **Connectors** - Connect to any data source within your organization, for reading and writing data

| Intent                                        | Primitive                         | Documentation                                                                                                     |
|-----------------------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------|
| Train a Generator on tabular or language data | `g = mostly.train(config)`        | see [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train)       |
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | see [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
| Live probe the generator on demand            | `df = mostly.probe(g, config)`    | see [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe)       |
| Connect to any data source within your org    | `c = mostly.connect(config)`      | see [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect)   |

## Installation

**Client mode only**

```shell
pip install -U mostlyai
```

**Client + Local mode**

```shell
pip install -U 'mostlyai[local]'       # for CPU
#pip install -U 'mostlyai[local-gpu]'  # for GPU
```

NOTE: installing `mostlyai[local]` on Linux requires `--extra-index-url https://download.pytorch.org/whl/cpu` to be specified.

**Optional Connectors**

Add any of the following extras for further data connectors support: `databricks`, `googlebigquery`, `hive`, `mssql`, `mysql`, `oracle`, `postgres`, `snowflake`.

E.g.
```shell
pip install -U 'mostlyai[local, databricks, snowflake]'
```

## Quick Start

For client mode, initialize with `base_url` and `api_key` obtained from your [account settings page](https://app.mostly.ai/settings/api-keys). For local mode, initialize the client simply with `local=True`.

```python
import pandas as pd
from mostlyai.sdk import MostlyAI

# load original data
repo_url = 'https://github.com/mostly-ai/public-demo-data'
df_original = pd.read_csv(f'{repo_url}/raw/dev/census/census.csv.gz')

# initialize the SDK in local or client mode
mostly = MostlyAI(local=True)
# mostly = MostlyAI(base_url='https://app.mostly.ai', api_key='YOUR_API_KEY')

# train a synthetic data generator
g = mostly.train(config={
        'name': 'US Census Income',          # name of the generator
        'tables': [{                         # provide list of table(s)
            'name': 'census',                # name of the table
            'data': df_original,             # the original data as pd.DataFrame
            'tabular_model_configuration': { # tabular model configuration (optional)
                'max_training_time': 1,      # - limit training time (in minutes)
                # model, max_epochs,,..      # further model configurations (optional)
                'differential_privacy': {    # differential privacy configuration (optional)
                    'max_epsilon': 5.0,      # - max epsilon value, used as stopping criterion
                    'delta': 1e-5,           # - delta value
                }
            },
            # columns, keys, compute,..      # further table configurations (optional)
        }]
    },
    start=True,                              # start training immediately (default: True)
    wait=True,                               # wait for completion (default: True)
)
```

Once the generator has been trained, you can use it to generate synthetic data samples. Either via probing:

```python
# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples
```

or by creating a synthetic dataset entity for larger data volumes:

```python
# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic
```

or by conditionally probing / generating synthetic data:

```python
# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
    'age': [24] * 100,
    'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples
```

## Key Features

- **Broad Data Support**
    - Mixed-type data (categorical, numerical, geospatial, text, etc.)
    - Single-table, multi-table, and time-series
- **Multiple Model Types**
    - TabularARGN for SOTA tabular performance
    - Fine-tune HuggingFace-based language models
    - Efficient LSTM for text synthesis from scratch
- **Advanced Training Options**
    - GPU/CPU support
    - Differential Privacy
    - Progress Monitoring
- **Automated Quality Assurance**
    - Quality metrics for fidelity and privacy
    - In-depth HTML reports for visual analysis
- **Flexible Sampling**
    - Up-sample to any data volumes
    - Conditional generation by any columns
    - Re-balance underrepresented segments
    - Context-aware data imputation
    - Statistical fairness controls
    - Rule-adherence via temperature
- **Seamless Integration**
    - Connect to external data sources (DBs, cloud storages)
    - Fully permissive open-source license

## Citation

Please consider citing our project if you find it useful:

```bibtex
@software{mostlyai,
    author = {{MOSTLY AI}},
    title = {{MOSTLY AI SDK}},
    url = {https://github.com/mostly-ai/mostlyai},
    year = {2025}
}
```
