Metadata-Version: 2.4
Name: aetherdialect
Version: 0.1.0
Summary: Deterministic, validation-first Text-to-SQL system for business databases
Author-email: Akul Ameya <akul.ameya@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/akul-ameya/aetherdialect
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: jsonschema<5,>=4.0
Requires-Dist: openai<3,>=2.0.0
Requires-Dist: sqlglot<30,>=29.0
Requires-Dist: platformdirs<5,>=2.0.0
Requires-Dist: python-dotenv<2,>=1.0.0
Provides-Extra: databricks
Requires-Dist: pyspark<4,>=3.3; extra == "databricks"
Requires-Dist: databricks-sql-connector<4,>=3.0; extra == "databricks"
Provides-Extra: postgresql
Requires-Dist: SQLAlchemy<3,>=2.0; extra == "postgresql"
Requires-Dist: psycopg2-binary<3,>=2.9; extra == "postgresql"
Requires-Dist: pglast<8,>=5.0; extra == "postgresql"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: vulture<3,>=2.11; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: pre-commit>=3.0; extra == "dev"
Requires-Dist: black<25,>=24; extra == "dev"
Requires-Dist: docformatter<2,>=1.7; extra == "dev"
Dynamic: license-file

# Deterministic, validation-first Text-to-SQL for business databases

## Installation

```bash
pip install text2sql
pip install "text2sql[postgresql]"
pip install "text2sql[databricks]"
pip install "text2sql[postgresql,databricks]"
```

Requires Python ≥ 3.10 and an [OpenAI API key](https://platform.openai.com/api-keys).

| Extra        | Brings in                                      | Use when                    |
| ------------ | ---------------------------------------------- | --------------------------- |
| `postgresql` | SQLAlchemy, PostgreSQL driver, `pglast`        | `engine="postgresql"`       |
| `databricks` | PySpark, Databricks SQL connector              | `engine="databricks"`       |

**SQL parsing for validation:** PostgreSQL uses **`pglast`** for structural AST checks (join pairs, CTE bodies, `ast_validate`). Databricks / Spark SQL uses **`sqlglot`** with the **Spark** dialect. The base package already depends on `sqlglot`; `pglast` is installed with the `postgresql` extra.

---

## Quickstart

```python
from text2sql import Text2SQL

t2s = Text2SQL(
    engine="postgresql",
    host="localhost",
    database="mydb",
    password="secret",
    openai_api_key="sk-...",
)

t2s.run_interactive()
```

Constructor options, modes, optional files, and the full method list are in **[USAGE.md](USAGE.md)**.

---

## What this is

A **validation-first** text-to-SQL layer for **PostgreSQL** and **Databricks**. It targets **stable business schemas** and **repeated analytical questions**, not open-ended “any SQL” generation.

- Natural language is turned into a **structured intent** (tables, select expressions, filters, grouping, ordering, optional CTEs) that is **shared across dialects**; dialect-specific SQL is produced later.
- **Templates** store previously accepted query patterns; **negative memory** records rejections so bad shapes are less likely to repeat.
- LLM calls run at **temperature 0**; for the same inputs and schema state, behavior is **repeatable**.
- **Bounded LLM use**: strong paths reuse templates or deterministic structure before asking the model for SQL.

---

## Philosophy

- **Determinism over creativity** — prefer a correct, boring plan to a novel one.
- **Correct joins over clever SQL** — join paths come from **foreign keys** and precomputed paths, not free-form guessing.
- **Validate before execute** — schema checks, intent consistency, join shape, and dialect-safe **read-only** `SELECT` rules.
- **Minimal LLM surface** — parse intent, resolve ambiguous joins when needed, generate or repair SQL; everything else is rules and stores.
- **Safe defaults for non-analysts** — narrow allowed SQL; you still choose **database credentials** (read-only vs write-capable is outside this library).

---

## What it supports (at a glance)

**Backends**

- PostgreSQL via SQLAlchemy.
- Databricks via Unity Catalog introspection (with optional DDL file fallback when the catalog is empty).

**Schema**

- Load from **live introspection** (primary).
- Optional **`CREATE TABLE` file** as extra or fallback (especially when you cannot reach all metadata from the driver).
- Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
- **Table roles** (e.g. fact vs dimension), **column roles** (measure, categorical, temporal, identifier, etc.), **filter / aggregation / HAVING** allowances per column, **value domains** from profiling.
- Optional **human notes** (plain text) fed once when the schema graph is built: richer **descriptions**, **roles**, and optional **sensitivity** labels — without renaming tables or inventing columns.
- Optional **deny lists** for tables or columns so they stay out of prompts and can be stripped from intents.

**Intent / SQL shape (analytical subset)**

- **Queries:** `SELECT` only (enforced with pattern checks and dialect parsing). **CTEs** reuse the same intent model as the outer query.
- **Joins:** only along **FK-backed paths**; join type for injected joins is chosen **deterministically** from table roles (e.g. dimension side as `LEFT` where applicable).
- **Select list:** bare columns, **aggregates** (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, etc.), **arithmetic and string expressions** where the schema allows them, **`DISTINCT`**, and **scalar functions** subject to column metadata.
- **Filters / boolean logic:** comparisons, **`AND` / `OR`**, **`IN`**, **`LIKE`**; **`ILIKE` / `NOT ILIKE` on PostgreSQL only** (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the repair chain.
- **`BETWEEN`** in intent is **decomposed** into a pair of comparable predicates.
- **Grouping / ordering:** **`GROUP BY`**, **`HAVING`** (aggregate-aware), **`ORDER BY`**, **`LIMIT`**; rules tie **grain** (row-level vs grouped) to aggregates and grouped columns.
- **Dates:** structured **`date_window`** (anchor unit + offset) and **date-difference** filters between columns where supported.
- **Windows:** `ROW_NUMBER`, `RANK`, `DENSE_RANK`, and windowed `SUM` / `AVG` on select columns (main query and CTEs).
- **`CASE` / `WHEN`:** only in the **select list** in the intent model (not in `WHERE` / `HAVING`).
- **Arrays / lists:** membership-style filters; SQL uses dialect-appropriate forms; optional **UNNEST / EXPLODE-style** expansion in CTE select lists for typed array columns.
- **Metadata:** **UNIQUE** (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.

**Operational modes**

- **Interactive** — ask questions, accept/reject, results export.
- **Coverage simulator** — seed questions → gold intents → **deterministic expansion** (many operators, deduplicated) → validate/execute → NL question generation for new templates.
- **QSim** — reproducible synthetic questions from schema and profiles (seeded randomness).

---

## What it is not

- Not a **full SQL** or **stored-procedure** generator: no `UNION`/`INTERSECT`/`EXCEPT`, no correlated subqueries, no `EXISTS`, no `LATERAL`, no **DML/DDL**, no arbitrary **RIGHT/FULL OUTER** join policy in the constrained path.
- Not a substitute for **database security**: use credentials with **least privilege** (`SELECT` / `EXPLAIN` as you prefer).
- Not **schema-agnostic**: quality depends on **FKs**, sensible types, and optional notes for domain language.

---

## How a question becomes SQL

1. **Template match** — if a trusted pattern fits, reuse parameterized SQL (often **no** SQL LLM call).
2. **Intent parse** — structured intent from the question + schema summary, then a long **deterministic repair chain** (see below).
3. **Join resolution** — choose among valid **FK paths** for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
4. **SQL generation & validation** — deterministic skeleton, injected joins, LLM fill/repair **under constraints**, then **semantic validation**, **`SELECT`-only / forbidden-pattern checks**, **dialect AST** validation (**`pglast`** or **`sqlglot`**), and optionally **`EXPLAIN`** when an engine is available.
5. **Execute** (where the mode allows) and **learn** — accept → promote template trust; reject → record negative pattern.

---

## Validation (layers)

- **Safety / shape** — `SELECT`-only enforcement, configurable **forbidden SQL** substrings, then **dialect `ast_validate`** (`pglast` on PostgreSQL, **`sqlglot` (Spark)** on Databricks). When a live **engine** is passed in, **`EXPLAIN`** can be used as an extra executability check.
- **Schema vs intent** — tables/columns/CTEs, **selectability**, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
- **Semantic consistency** — grouped queries require proper aggregation; contradictions and impossible HAVING; **grain** alignment (one of many cross-checks, not the only story).
- **Joins** — paths must match the FK graph; guarded path avoids ad-hoc join guessing.

## Deterministic repairs (after intent parse)

Applied in order (high level): `COUNT(*)` normalization; CTE naming and output aliases; qualify CTE outputs; sanitize table names; grain rules for grouped CTE usage and **grain consistency**; strip redundant `GROUP BY`; normalize filters/HAVING; null-equality fixes; strip join-condition leakage into filters; per-CTE sort order; simplify expressions; `IN` value normalization; date-diff classification and raw-value fixes; **`BETWEEN` → paired predicates**; auto-repair filters/HAVING; strip impossible HAVING; FK filter type repair; filter value case / enum alignment; boolean and null filter values; expand FK selects to descriptive columns; deduplicate contradictory filters; redundant PK re-qualification; **window**, **CASE**, and **array** intent repairs; sensitivity and access policy enforcement on the intent.

---

## Learning and reuse

- **Accepted templates** — intent fingerprint, parameterized SQL, optional example question, **trust** that rises with validation and falls with rejection.
- **Rejected templates** — categorized failures so similar bad intents are discouraged.
- Persistence is under a **per-connection artifact directory** (see USAGE); you can back it up or reset it by removing that directory.

---

## Coverage simulator (brief)

1. Parse each **seed** line into a gold intent.
2. **Expand** with a fixed set of **deterministic operators** (filters, aggregates, joins, time windows, numeric transforms, distinct/limit/OR-groups, expressions, **window variants**, etc.), **deduplicated** across depths.
3. Resolve joins once per table set where possible; validate and **execute** as a gate.
4. One LLM step to produce a **natural language question** from the SQL, with a **realism** filter.

The simulator loads at most **500** seed lines per run (fixed internal cap; larger files are truncated). **`estimate_simulator_costs`** uses the same loader, so its seed count reflects that cap. It returns **rough** LLM-call and execution estimates from a seed file and schema stats.

---

## QSim (brief)

Generates **reproducible** question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.

---

## When to use it

**Good fit:** star/snowflake-style models, clear FKs, repeated BI-style questions, PostgreSQL or Databricks.

**Poor fit:** schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL features outside the supported analytical subset.
