Metadata-Version: 2.4
Name: tval-cli
Version: 0.3.0
Summary: Table data validator for pre-analysis data validation
Project-URL: Homepage, https://github.com/b-trout/tval
Project-URL: Repository, https://github.com/b-trout/tval
Project-URL: Issues, https://github.com/b-trout/tval/issues
Author: b-trout
License-Expression: MIT
License-File: LICENSE
Keywords: csv,data-quality,duckdb,excel,parquet,validation
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: chardet<7,>=6.0.0.post1
Requires-Dist: duckdb<2,>=1.4.4
Requires-Dist: jinja2<4,>=3.1.6
Requires-Dist: pydantic<3,>=2.12.5
Requires-Dist: pyyaml<7,>=6.0.3
Provides-Extra: dev
Requires-Dist: mypy<2,>=1.19.1; extra == 'dev'
Requires-Dist: pre-commit<5,>=4.5.1; extra == 'dev'
Requires-Dist: pytest<10,>=9.0.2; extra == 'dev'
Requires-Dist: ruff<1,>=0.15.4; extra == 'dev'
Requires-Dist: types-chardet<6,>=5.0.4.6; extra == 'dev'
Requires-Dist: types-pyyaml<7,>=6.0.12.20250915; extra == 'dev'
Description-Content-Type: text/markdown

# tval

**Table data schema validator** - Validate CSV, Excel, and Parquet files against YAML-defined schemas using DuckDB.

[![PyPI version](https://img.shields.io/pypi/v/tval-cli)](https://pypi.org/project/tval-cli/)
![Python >= 3.10](https://img.shields.io/badge/Python-%3E%3D3.10-blue)
![License: MIT](https://img.shields.io/badge/License-MIT-green)
![CI](https://img.shields.io/badge/CI-passing-brightgreen)

## Quick Start

```bash
pip install tval-cli           # Install tval
tval init                     # Scaffold project skeleton
# Place YAML schemas in tval/schema/ and data files in tval/data/
tval run                      # Run validation and generate report
```

Open `tval/output/report.html` in your browser to review the results.

---

## Table of Contents

- [1. Purpose](#1-purpose)
  - [1.1 What tval Does](#11-what-tval-does)
  - [1.2 Problems It Solves](#12-problems-it-solves)
  - [1.3 How It Works](#13-how-it-works)
  - [1.4 Supported Formats](#14-supported-formats)
- [2. Project Layout](#2-project-layout)
- [3. User Guide](#3-user-guide)
  - [3.1 Prerequisites](#31-prerequisites)
  - [3.2 Installation](#32-installation)
  - [3.3 Initialize a Project](#33-initialize-a-project)
  - [3.4 Define Table Schemas](#34-define-table-schemas)
  - [3.5 Configure config.yaml](#35-configure-configyaml)
  - [3.6 Run Validation](#36-run-validation)
  - [3.7 Define Relations (Optional)](#37-define-relations-optional)
  - [3.8 Understanding the HTML Report](#38-understanding-the-html-report)
  - [3.9 Parquet Export](#39-parquet-export)
  - [3.10 Docker Usage](#310-docker-usage)
- [4. Developer Guide](#4-developer-guide)
  - [4.1 Development Setup](#41-development-setup)
  - [4.2 Code Quality](#42-code-quality)
  - [4.3 Running Tests](#43-running-tests)
  - [4.4 CI Pipeline](#44-ci-pipeline)
  - [4.5 Architecture Overview](#45-architecture-overview)
- [License](#license)

---

## 1. Purpose

### 1.1 What tval Does

tval is a CLI tool that validates tabular data files (CSV, Excel, Parquet) against YAML schema definitions. It checks data types, missing values, allowed values, custom rules, and cross-table relationships — then generates a single HTML report with all results.

### 1.2 Problems It Solves

- **Inconsistent validation** — When multiple analysts manually inspect data, coverage varies. tval enforces a single set of declarative rules.
- **Manual effort** — Hand-checking row counts, allowed values, and aggregation totals is tedious. tval automates the entire process.

### 1.3 How It Works

```
config.yaml          schema/*.yaml       relations.yaml (optional)
     |                     |                     |
     v                     v                     |
  Load config       Validate schemas             |
     |                     |                     |
     +--------+------------+                     |
              |                                  |
              v                                  |
     Load data files (CSV / XLSX / Parquet)       |
              |                                  |
              v                                  |
     Run validation checks                       |
              |                                  |
              v                                  |
     Run relation checks  <----------------------+
              |
              v
     Compute column statistics
              |
              v
     Export to Parquet (optional)
              |
              v
     Generate HTML report
```

### 1.4 Supported Formats

| Format  | Extensions   | Notes                                          |
|---------|--------------|-------------------------------------------------|
| CSV     | `.csv`       | Character encoding auto-detected via `chardet`  |
| Excel   | `.xlsx`      | `.xls` is **not** supported                     |
| Parquet | `.parquet`   | Read directly by DuckDB                         |

---

## 2. Project Layout

After running `tval init`, your project will have this structure:

```
your-project/
├── tval/
│   ├── config.yaml       # Validation configuration
│   ├── schema/           # Table definition YAML files (one per table)
│   ├── data/             # Source data files (CSV, XLSX, Parquet)
│   └── output/           # Generated reports and exports
├── ...                   # Your other project files
└── .gitignore            # tval/data/ and tval/output/ are auto-added
```

For the repository source layout and module details, see [Section 4.5 Architecture Overview](#45-architecture-overview).

---

## 3. User Guide

### 3.1 Prerequisites

- **Python >= 3.10**
- **pip** or [**uv**](https://github.com/astral-sh/uv) (recommended)

### 3.2 Installation

Using uv (recommended):

```bash
uv pip install tval-cli
```

Using pip:

```bash
pip install tval-cli
```

For development (editable install):

```bash
git clone https://github.com/b-trout/tval.git
cd tval
pip install -e ".[dev]"
```

### 3.3 Initialize a Project

```bash
tval init
```

Output:

```
Created tval/
Appended tval/data/, tval/output/ to .gitignore

Next steps:
  1. Add table definition YAML files to tval/schema/
  2. Place your data files in tval/data/
  3. Run validation with: tval run
```

To specify a different target directory:

```bash
tval init --dir ./my-project
```

### 3.4 Define Table Schemas

Each table is defined by a YAML file in the `schema/` directory. Below is a fully annotated example:

```yaml
# --- Table metadata ---
table:
  name: orders                    # Table name used internally
  description: Orders table       # Human-readable description
  source_dir: ./data/orders/      # Directory containing data files (relative to config.yaml)

# --- Column definitions ---
columns:
  - name: order_id                # Column name (must match data file headers)
    logical_name: Order ID        # Display name for reports
    type: INTEGER                 # Data type (INTEGER, VARCHAR, DOUBLE, DATE, etc.)
    not_null: true                # true = blank/missing values are not allowed

  - name: user_id
    logical_name: User ID
    type: INTEGER
    not_null: true

  - name: amount
    logical_name: Amount
    type: DOUBLE
    not_null: true
    min: 0                            # Minimum allowed value (numeric columns only)
    max: 1000000                      # Maximum allowed value (numeric columns only)

  - name: status
    logical_name: Status
    type: VARCHAR
    not_null: true
    allowed_values:               # Only these values are accepted; anything else is flagged
      - pending
      - shipped
      - cancelled

  - name: order_date
    logical_name: Order Date
    type: DATE
    not_null: true
    format: "%Y-%m-%d"           # Expected date format (for DATE, TIMESTAMP, TIME columns)

# --- Table-level rules ---
table_constraints:
  # Columns that uniquely identify each row (no duplicates allowed)
  primary_key:
    columns: [order_id]

  # Column combinations where duplicate values are not allowed
  unique:
    - columns: [order_id, user_id]

  # Columns whose values must exist in another table
  foreign_keys:
    - columns: [user_id]           # Column in this table
      references:
        table: users               # The other table (must have its own YAML)
        columns: [user_id]         # Matching column in the other table

  # Custom validation rules written in SQL
  checks:
    - description: Amount must be non-negative
      query: "SELECT COUNT(*) FROM {table} WHERE amount < 0"
      expect_zero: true            # true = pass when no rows match (i.e., no violations found)

  # Declarative row-level conditions (auto-generates SQL checks)
  row_conditions:
    - description: Amount must not exceed 10x order_id
      condition: "amount <= order_id * 10"

  # Aggregation-level validation rules (shown in a separate report section)
  aggregation_checks: []

# --- Export settings (optional) ---
export:
  partition_by: []                 # Split exported Parquet files by these columns
```

#### Column Definition Reference

| Field            | Type       | Required | Description                                             |
|------------------|------------|----------|---------------------------------------------------------|
| `name`           | `string`   | Yes      | Column name matching data file headers                  |
| `logical_name`   | `string`   | Yes      | Human-readable name for reports                         |
| `type`           | `string`   | Yes      | Data type (e.g. `INTEGER`, `VARCHAR`, `DOUBLE`, `DATE`) |
| `not_null`       | `bool`     | Yes      | `true` = blank/missing values are not allowed           |
| `description`    | `string`   | No       | Optional description                                    |
| `allowed_values` | `string[]` | No       | List of accepted values; other values are flagged as NG |
| `min`            | `number`   | No       | Minimum allowed value (numeric columns only). NULL values are excluded from check |
| `max`            | `number`   | No       | Maximum allowed value (numeric columns only). Must be >= `min` if both specified |
| `format`         | `string`   | No       | Expected date/time format (e.g. `"%Y-%m-%d"`) for DATE/TIMESTAMP/TIME columns |

#### Table Constraints Reference

| Rule                 | What it does                                                               |
|----------------------|----------------------------------------------------------------------------|
| `primary_key`        | Ensures each row is uniquely identified — no duplicate values allowed in these columns |
| `unique`             | Ensures a combination of columns has no duplicate values                   |
| `foreign_keys`       | Ensures values in these columns exist in the referenced table (e.g. every `user_id` in orders must exist in the users table) |
| `checks`             | Custom SQL queries to validate data (see below)                           |
| `row_conditions`     | Declarative row-level conditions — each `condition` is a SQL boolean expression that must be true for every row |
| `aggregation_checks` | Same as `checks`, but results appear in a separate section of the report  |

#### User-Defined Checks

Write custom validation rules using SQL. Use `{table}` as a placeholder for the table name:

```yaml
checks:
  # Flag rows where amount is negative
  - description: Amount must be non-negative
    query: "SELECT COUNT(*) FROM {table} WHERE amount < 0"
    expect_zero: true     # true  → pass when the query returns 0 (no violations)
                          # false → pass when the query returns non-zero

  # Ensure the table has at least one row
  - description: Table must not be empty
    query: "SELECT COUNT(*) FROM {table}"
    expect_zero: false

  # Check with parameterized allowed values
  - description: Status must be valid
    query: "SELECT COUNT(*) FROM {table} WHERE status NOT IN (SELECT UNNEST(?::VARCHAR[]))"
    expect_zero: true
    params: [["pending", "shipped", "cancelled"]]
```

> **Tip:** `expect_zero: true` means "this query counts violations — pass when zero violations are found."
> `expect_zero: false` means "this query counts expected rows — pass when at least one row is found."

#### Export Configuration

| Field          | Type       | Default | Description                                       |
|----------------|------------|---------|---------------------------------------------------|
| `partition_by` | `string[]` | `[]`    | Column names to split Parquet output files by      |

### 3.5 Configure config.yaml

The `config.yaml` file controls the validation pipeline. All paths are resolved **relative to the directory containing config.yaml**.

```yaml
database_path: ./tval/work.duckdb           # DuckDB file path (must end in .duckdb)
schema_dir: ./tval/schema                    # Directory containing schema YAML files
output_path: ./tval/output/report.html       # HTML report output path
encoding_confidence_threshold: 0.8           # Minimum confidence for CSV encoding detection (0.0-1.0)
# relations_path: ./tval/relations.yaml      # Optional: inter-table relation definitions
```

| Field                          | Type    | Default | Description                                             |
|--------------------------------|---------|---------|---------------------------------------------------------|
| `database_path`                | `string`| -       | Path to the DuckDB database file (`.duckdb` extension required) |
| `schema_dir`                   | `string`| -       | Directory containing table schema YAML files            |
| `output_path`                  | `string`| -       | Output path for the generated HTML report               |
| `encoding_confidence_threshold`| `float` | `0.8`   | Minimum confidence from `chardet` to trust detected CSV encoding |
| `relations_path`               | `string`| -       | Optional path to `relations.yaml` for cardinality validation |

### 3.6 Run Validation

```bash
# Auto-discover config (searches ./tval/config.yaml then ./config.yaml)
tval run

# Specify config path explicitly
tval run --config path/to/config.yaml

# Run validation and export to Parquet
tval run --export
```

#### CLI Reference

| Command    | Option      | Default            | Description                                  |
|------------|-------------|--------------------|----------------------------------------------|
| `tval init`| `--dir`     | `./tval`           | Target directory for project skeleton        |
| `tval run` | `--config`  | Auto-discover      | Path to `config.yaml`                        |
| `tval run` | `--export`  | Disabled           | Export to Parquet if all validations pass     |

### 3.7 Define Relations (Optional)

Relations let you verify how tables are connected. For example, you can check that every `user_id` in the orders table actually exists in the users table.

To enable relation validation, create a `relations.yaml` file and reference it in `config.yaml`:

```yaml
# config.yaml
relations_path: ./tval/relations.yaml
```

```yaml
# relations.yaml
relations:
  # "One user can have many orders"
  - name: users-orders
    cardinality: "1:N"
    from:
      table: users
      columns: [user_id]
    to:
      table: orders
      columns: [user_id]

  # "One order can have many line items"
  - name: orders-order_details
    cardinality: "1:N"
    from:
      table: orders
      columns: [order_id]
    to:
      table: order_details
      columns: [order_id]
```

#### Supported Cardinalities

| Cardinality | Meaning | What is checked | Count |
|-------------|---------|-----------------|-------|
| `1:1` | One row in each table matches exactly one row in the other (e.g. user ↔ profile) | No duplicates on either side + every value exists in both tables | 4 |
| `1:N` | One row on the from-side can match many rows on the to-side (e.g. user → orders) | No duplicates on the from-side + every to-side value exists in the from table | 2 |
| `N:1` | Many rows on the from-side match one row on the to-side (e.g. orders → user) | No duplicates on the to-side + every from-side value exists in the to table | 2 |
| `N:N` | Many-to-many (e.g. students ↔ courses) | Every value on each side exists in the other table | 2 |

#### Relation Definition Reference

| Field         | Type       | Required | Description                                           |
|---------------|------------|----------|-------------------------------------------------------|
| `name`        | `string`   | Yes      | Human-readable name for the relation                  |
| `cardinality` | `string`   | Yes      | One of `1:1`, `1:N`, `N:1`, `N:N`                    |
| `from.table`  | `string`   | Yes      | Table name (must match a schema YAML definition)      |
| `from.columns`| `string[]` | Yes      | Column(s) on the from-side of the relation            |
| `to.table`    | `string`   | Yes      | Table name (must match a schema YAML definition)      |
| `to.columns`  | `string[]` | Yes      | Column(s) on the to-side of the relation              |

> **Note:** If either table has data loading errors, all checks for that relation are marked `SKIPPED` (since the data is incomplete). Blank (NULL) values are excluded from cross-table existence checks.

### 3.8 Understanding the HTML Report

The generated HTML report contains the following sections:

| Section                            | Description                                                         |
|------------------------------------|---------------------------------------------------------------------|
| **Summary**                        | Total tables, OK count, NG count                                    |
| **Load Results** (per table)       | Per-file load status; errors are displayed with messages             |
| **Logic Validation** (per table)   | Results of allowed-value checks and user-defined `checks`           |
| **Aggregation** (per table)        | Results of `aggregation_checks` (reported separately)               |
| **Statistics** (per table)         | Column profiles: count, nulls, unique, mean, std, min, max, percentiles |
| **Export** (per table)             | Parquet export status (only when `--export` is used)                |
| **Relation Cardinality Validation**| Cross-table relation check results (only when `relations_path` is configured) |

#### Status Definitions

| Status      | Icon | Meaning                                                          |
|-------------|------|------------------------------------------------------------------|
| **OK**      | ✅   | Check passed — no issues found                                  |
| **NG**      | ❌   | Check failed — data does not meet the expected rule (e.g. duplicates found, invalid values, constraint violation) |
| **ERROR**   | ❌   | Check could not run — typically caused by a bug in the SQL query or an internal error. Review the error message and fix the query |
| **SKIPPED** | ⚠️   | Check was not executed — this happens when data files failed to load. Fix the load errors first, then re-run to see these check results |

### 3.9 Parquet Export

Parquet export is triggered by the `--export` flag and follows an **all-or-nothing** rule:

- If **all tables** have `OK` status **and** all relation checks pass (or are skipped), every table is exported to Parquet.
- If **any table** has `NG` status or any relation check fails, all exports are marked `SKIPPED`.

Export output is written to `<output_path_parent>/parquet/<table_name>/`.

When `partition_by` is set in the schema's `export` section, DuckDB writes Hive-style partitioned Parquet files:

```yaml
export:
  partition_by: [region, year]
```

This produces a directory structure like:

```
parquet/orders/region=US/year=2024/data_0.parquet
parquet/orders/region=JP/year=2024/data_0.parquet
```

### 3.10 Docker Usage

A development container is provided via the `Dockerfile`:

```bash
docker build -t tval-dev .
docker run -it -v "$(pwd)":/home/dev/workspace tval-dev
```

The container includes Python 3.12, uv, all project dependencies, and pre-commit hooks pre-installed.

---

## 4. Developer Guide

### 4.1 Development Setup

```bash
git clone https://github.com/b-trout/tval.git
cd tval
uv sync --extra dev
uv run pre-commit install
```

### 4.2 Code Quality

All tool configuration is centralized in `pyproject.toml`.

```bash
uv run ruff check src/ tests/     # Lint (pycodestyle, pyflakes, isort)
uv run ruff format --check src/ tests/  # Format check
uv run mypy src/                   # Type check (strict mode)
```

Pre-commit hooks run `ruff check`, `ruff format`, and `mypy` automatically on each commit.

### 4.3 Running Tests

```bash
uv run pytest tests/ -v
```

Tests use **real DuckDB instances and real data files** (no mocking):

| Test File              | Scope                                               |
|------------------------|-----------------------------------------------------|
| `test_parser.py`       | YAML schema parsing and Pydantic validation          |
| `test_builder.py`      | DDL generation and FK dependency ordering            |
| `test_loader.py`       | CSV/XLSX/Parquet file loading into DuckDB            |
| `test_checker.py`      | Validation check execution and error handling        |
| `test_profiler.py`     | Column statistics computation and error handling     |
| `test_exporter.py`     | Parquet export with partitioning                     |
| `test_reporter.py`     | HTML report generation and status aggregation        |
| `test_relation.py`     | Relation cardinality validation (1:1, 1:N, N:1, N:N)|
| `test_integration.py`  | End-to-end pipeline validation                       |

### 4.4 CI Pipeline

GitHub Actions runs on every pull request to `main`, testing against a Python version matrix:

| Step            | Command                              |
|-----------------|--------------------------------------|
| Ruff check      | `uv run ruff check src/ tests/`      |
| Ruff format     | `uv run ruff format --check src/ tests/` |
| Mypy            | `uv run mypy src/`                   |
| Pytest          | `uv run pytest tests/ -v`            |

Matrix: **Python 3.10** and **Python 3.12** on `ubuntu-latest`.

### 4.5 Architecture Overview

#### Repository Layout

```
tval/
├── src/tval/
│   ├── __init__.py          # Package marker
│   ├── cli.py               # CLI entry point (argparse)
│   ├── init.py              # tval init scaffolding
│   ├── main.py              # Pipeline orchestration
│   ├── parser.py            # YAML schema → Pydantic models
│   ├── builder.py           # DDL generation + topological sort
│   ├── loader.py            # CSV/XLSX/Parquet → DuckDB INSERT
│   ├── checker.py           # Validation check execution
│   ├── relation.py          # Inter-table relation cardinality validation
│   ├── profiler.py          # Column statistics computation
│   ├── exporter.py          # Parquet export with partitioning
│   ├── reporter.py          # HTML report generation
│   ├── logger.py            # Structured JSON logging
│   └── templates/
│       └── report.html.j2   # Jinja2 HTML report template
├── tests/
├── docs/
├── pyproject.toml           # Build config, dependencies, tool settings
├── Dockerfile               # Development container
├── .pre-commit-config.yaml  # Pre-commit hook configuration
└── .github/workflows/
    └── ci.yml               # GitHub Actions CI pipeline
```

#### Module Responsibilities

| Module       | Responsibility                                                      |
|--------------|---------------------------------------------------------------------|
| `cli.py`     | Parse CLI arguments, dispatch to `init` or `run`                    |
| `init.py`    | Scaffold project directories, config, and `.gitignore` entries      |
| `main.py`    | Orchestrate the full validation pipeline end-to-end                 |
| `parser.py`  | Define Pydantic models and load/validate YAML schema files          |
| `builder.py` | Generate CREATE TABLE SQL, resolve FK dependency order (topo sort)  |
| `loader.py`  | Load CSV/XLSX/Parquet files into DuckDB with encoding detection     |
| `checker.py` | Execute allowed-value, user-defined, and aggregation checks         |
| `relation.py`| Validate inter-table relationship cardinalities (1:1, 1:N, N:1, N:N) |
| `profiler.py`| Compute column statistics (count, nulls, unique, mean, percentiles) |
| `exporter.py`| Export tables to Parquet with optional Hive partitioning            |
| `reporter.py`| Render HTML report from Jinja2 template                             |
| `logger.py`  | Provide structured JSON logging                                     |

#### Pipeline Flow

```
                   cli.py
                  /      \
            init.py      main.py
                         /  |  \  \  \  \  \  \
                parser.py   |   |  |  |  |  |  reporter.py
                   builder.py   |  |  |  |  |
                      loader.py |  |  |  |
                        checker.py |  |  |
                          relation.py |  |
                            profiler.py |
                              exporter.py
                                  |
                   All modules --> logger.py
```

#### Key Design Decisions

- **SQL injection prevention** - All identifiers pass through `quote_identifier()` in `builder.py`, which validates against a strict regex and wraps in double quotes.
- **Connection separation** - Data loading uses a read-write connection (`conn_rw`); checks, profiling, and export use a read-only connection (`conn_ro`).
- **Dependency ordering** - Foreign key relationships are resolved via topological sort before table creation.

For detailed design documentation, see [docs/DESIGN.md](docs/DESIGN.md).

---

## License

MIT
