Metadata-Version: 2.4
Name: aetherdialect
Version: 0.1.2
Summary: Deterministic, validation-first Text-to-SQL system for business databases
Author-email: Akul Ameya <akul.ameya@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/akul-ameya/aetherdialect
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: jsonschema<5,>=4.0
Requires-Dist: openai<3,>=2.0.0
Requires-Dist: platformdirs<5,>=2.0.0
Requires-Dist: python-dotenv<2,>=1.0.0
Requires-Dist: SQLAlchemy<3,>=2.0
Provides-Extra: databricks
Requires-Dist: sqlglot<30,>=29.0; extra == "databricks"
Requires-Dist: pyspark<4,>=3.3; extra == "databricks"
Requires-Dist: databricks-sql-connector<4,>=3.0; extra == "databricks"
Requires-Dist: databricks-sqlalchemy<3,>=2.0; extra == "databricks"
Provides-Extra: postgresql
Requires-Dist: psycopg2-binary<3,>=2.9; extra == "postgresql"
Requires-Dist: pglast<8,>=5.0; extra == "postgresql"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: vulture<3,>=2.11; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: pre-commit>=3.0; extra == "dev"
Requires-Dist: black<25,>=24; extra == "dev"
Requires-Dist: docformatter<2,>=1.7; extra == "dev"
Dynamic: license-file

# Deterministic, validation-first Text-to-SQL for business databases

Questions resolve more reliably when you state analytical intent explicitly—entities, grain, filters, time scope, and ordering—instead of leaving those details implied.

## Installation

```bash
pip install text2sql
pip install "text2sql[postgresql]"
pip install "text2sql[databricks]"
pip install "text2sql[postgresql,databricks]"
```

Requires Python ≥ 3.10 and either an [OpenAI API key](https://platform.openai.com/api-keys) or Azure OpenAI credentials.

| Extra        | Brings in                                                                                        | Use when              |
| ------------ | ------------------------------------------------------------------------------------------------ | --------------------- |
| (base)       | **SQLAlchemy** (shared introspection / execution interface)                                      | Always installed      |
| `postgresql` | PostgreSQL driver (`psycopg2-binary`), **`pglast`**                                              | `engine="postgresql"` |
| `databricks` | Databricks SQL connector (preferred), PySpark (fallback), **`databricks-sqlalchemy`**, `sqlglot` | `engine="databricks"` |

**SQL parsing for validation:** PostgreSQL uses **`pglast`** for structural AST checks (join pairs, CTE bodies, `ast_validate`). Databricks / Spark SQL uses **`sqlglot`** with the **Spark** dialect.

---

## Quickstart

```python
from text2sql import Text2SQL

t2s = Text2SQL(
    engine="postgresql",
    host="localhost",
    database="mydb",
    password="secret",
    openai_api_key="sk-...",
)

t2s.run_interactive()
```

Constructor options, credentials (`set_openai_api_key`, `set_azure_openai_api_key`, `set_env`), modes, and the full API are in **[USAGE.md](USAGE.md)**. Pass **`artifacts_dir=`** to put the per-connection cache under a root you choose; otherwise it lives under the platform user-data directory (see USAGE.md).

**Interactive two ways:** **`run_interactive()`** is a stdin loop. For your own UI or protocol, use **`Text2SQL.pipeline_session()`** and drive **`PipelineSession`** step by step: one natural-language question is a **turn** that may return several **`SessionEvent`** objects (prompt / answer / …) until **`done`** is true. Types and methods are documented in **[USAGE.md](USAGE.md)**.

---

## What this is

A **validation-first** text-to-SQL layer for **PostgreSQL** and **Databricks**. It targets **stable business schemas** and **repeated analytical questions**, not open-ended “any SQL” generation.

- Natural language is turned into a **structured intent** (tables, select expressions, filters, grouping, ordering, optional CTEs) that is **shared across dialects**; dialect-specific SQL is produced later.
- **Templates** store previously accepted query patterns; **negative memory** records rejections so bad shapes are less likely to repeat.
- LLM calls run at **temperature 0**; for the same inputs and schema state, behavior is **repeatable**.
- **Bounded LLM use**: strong paths reuse templates or deterministic structure before asking the model for SQL.

---

## Philosophy

- **Determinism over creativity** — prefer a correct, boring plan to a novel one.
- **Correct joins over clever SQL** — join paths come from **foreign keys** and precomputed paths, not free-form guessing.
- **Validate before execute** — schema checks, intent consistency, join shape, and dialect-safe **read-only** `SELECT` rules.
- **Minimal LLM surface** — parse intent, resolve ambiguous joins when needed, repair SQL; everything else is rules and stores.
- **Safe defaults for non-analysts** — narrow allowed SQL; you still choose **database credentials** (read-only vs write-capable is outside this library).

---

## What it supports (at a glance)

**Backends**

- PostgreSQL
- Databricks

**Schema**

- Load from **live introspection** (primary).
- Optional **`CREATE TABLE` file** as extra or fallback (especially when you cannot reach all metadata from the driver).
- Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
- **Table roles** (e.g. fact vs dimension), **column roles** (measure, categorical, temporal, identifier, etc.), **filter / aggregation / HAVING** allowances per column, **value domains** from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
- Optional **human notes** (plain text), via `Text2SQL(..., notes_file=...)` (see **[USAGE.md](USAGE.md)**): merged when the graph is built or when notes change; if the cache already contains notes and you omit `notes_file` on a later run, cached roles and hints are kept.
- Optional **deny lists** for tables or columns so they stay out of prompts and can be stripped from intents.

**Intent / SQL shape (analytical subset)**

- **Queries:** `SELECT` only (enforced with pattern checks and dialect parsing). **CTEs** reuse the same intent model as the outer query.
- **Joins:** related tables are wired using the schema’s relationships; when more than one valid path could link the same tables, one coherent path is chosen for the whole query, and join style follows table roles (e.g. `LEFT` toward dimensions where that fits). Self-joins use **CTEs** instead of repeating the same base table in one `FROM` chain.
- **Select list:** bare columns, **aggregates** (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, etc.), **arithmetic and string expressions** where the schema allows them, **`DISTINCT`**, and **scalar functions** subject to column metadata.
- **Filters / boolean logic:** comparisons, **`AND` / `OR`**, **`IN`**, **`LIKE`**; **`ILIKE` / `NOT ILIKE` on PostgreSQL only** (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the repair chain.
- **`BETWEEN`** in intent is **decomposed** into a pair of comparable predicates.
- **Grouping / ordering:** **`GROUP BY`**, **`HAVING`** (aggregate-aware), **`ORDER BY`**, **`LIMIT`**; rules tie **grain** (row-level vs grouped) to aggregates and grouped columns.
- **Dates:** structured **`date_window`** (anchor unit + offset) and **date-difference** filters between columns where supported.
- **Windows:** `ROW_NUMBER`, `RANK`, `DENSE_RANK`, and windowed `SUM`/`AVG`, `LAG`, `LEAD`, `FIRST_VALUE`, and `LAST_VALUE` on select columns (main query and CTEs).
- **`CASE` / `WHEN`:** only in the **select list** in the intent model (not in `WHERE` / `HAVING`).
- **Arrays / lists:** membership-style filters; SQL uses dialect-appropriate forms; optional **UNNEST / EXPLODE-style** expansion in CTE select lists for typed array columns.
- **Metadata:** **UNIQUE** (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.

**Operational modes**

- **Interactive** — ask questions, accept/reject, results export; via **`run_interactive()`** or a programmatic **`PipelineSession`** (see Quickstart above and **[USAGE.md](USAGE.md)**).
- **Coverage simulator** — seed questions → gold intents → **deterministic expansion** (many operators, deduplicated) → validate/execute → NL question generation for new templates.
- **QSim** — reproducible synthetic questions from schema and profiles (seeded randomness).

---

## What it is not

- Not a **full SQL** or **stored-procedure** generator: no `UNION`/`INTERSECT`/`EXCEPT`, no correlated subqueries, no `EXISTS`, no `LATERAL`, no **DML/DDL**, no arbitrary **RIGHT/FULL OUTER** join policy in the constrained path.
- Not a substitute for **database security**: use credentials with **least privilege** (`SELECT` and `EXPLAIN`).
- Not **schema-agnostic**: quality depends on **FKs**, sensible types, and optional notes for domain language.

---

## How a question becomes SQL

1. **Template match** — if a trusted pattern fits, reuse parameterized SQL (often **no** SQL LLM call).
2. **Intent parse** — structured intent from the question + schema summary, then a long **deterministic repair chain**.
3. **Join resolution** — choose among valid **FK paths** for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
4. **SQL generation & validation** — deterministic skeleton, injected joins, LLM fill/repair **under constraints**, then **semantic validation**, **`SELECT`-only / forbidden-pattern checks**, **dialect AST** validation (**`pglast`** or **`sqlglot`**), and **`EXPLAIN`**.
5. **Execute** (where the mode allows) and **learn** — accept → promote template trust; reject → record negative pattern.

---

## Validation (layers)

- **Safety / shape** — `SELECT`-only enforcement, configurable **forbidden SQL** substrings, then **dialect `ast_validate`** (**`pglast`** or **`sqlglot`**). **`EXPLAIN`** is used as an extra executability check.
- **Schema vs intent** — tables/columns/CTEs, **selectability**, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
- **Semantic consistency** — grouped queries require proper aggregation; contradictions and impossible HAVING; **grain** alignment.
- **Joins** — paths must match the FK graph; guarded path avoids ad-hoc join guessing.

---

## Learning and reuse

- **Accepted templates** — intent fingerprint, parameterized SQL, optional example question, **trust** that rises with validation and falls with rejection.
- **Rejected templates** — categorized failures so similar bad intents are discouraged.
- Persistence is under a **per-connection artifact directory** (see **[USAGE.md](USAGE.md)**); you can back it up or reset it by removing that directory.

---

## Coverage simulator (brief)

1. Parse each **seed** line into a gold intent.
2. **Expand** with a fixed set of **deterministic operators** (filters, aggregates, joins, time windows, numeric transforms, distinct/limit/OR-groups, expressions, **window variants**, etc.), **deduplicated** across depths.
3. Resolve joins once per table set where possible; validate and **execute** as a gate.
4. One LLM step to produce a **natural language question** from the SQL, with a **realism** filter.

The simulator loads at most **500** seed lines per run (fixed internal cap; larger files are truncated). **`estimate_simulator_costs`** uses the same loader, prints **rough** per-phase upper-bound lines to stdout, and returns **`None`**.

---

## QSim (brief)

Generates **reproducible** question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.

---

## When to use it

**Good fit:** star/snowflake-style models, clear FKs, repeated BI-style questions, PostgreSQL or Databricks.

**Poor fit:** schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL features outside the supported analytical subset.
