Metadata-Version: 2.4
Name: truthound
Version: 1.2.6
Summary: Zero-Configuration Data Quality Framework Powered by Polars
Project-URL: Homepage, https://github.com/seadonggyun4/Truthound
Project-URL: Repository, https://github.com/seadonggyun4/Truthound
Project-URL: Issues, https://github.com/seadonggyun4/Truthound/issues
Author-email: seadonggyun4 <seadonggyun4@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: data-masking,data-quality,data-validation,pii-detection,polars
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: polars>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.12.0
Provides-Extra: all
Requires-Dist: aiokafka>=0.9.0; extra == 'all'
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'all'
Requires-Dist: boto3>=1.26.0; extra == 'all'
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'all'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'all'
Requires-Dist: jinja2>=3.0.0; extra == 'all'
Requires-Dist: motor>=3.0.0; extra == 'all'
Requires-Dist: pandas>=2.0.0; extra == 'all'
Requires-Dist: reflex>=0.4.0; extra == 'all'
Requires-Dist: scikit-learn>=1.3.0; extra == 'all'
Requires-Dist: scipy>=1.10.0; extra == 'all'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'all'
Requires-Dist: weasyprint>=60.0; extra == 'all'
Requires-Dist: xxhash>=3.4.0; extra == 'all'
Provides-Extra: anomaly
Requires-Dist: scikit-learn>=1.3.0; extra == 'anomaly'
Requires-Dist: scipy>=1.10.0; extra == 'anomaly'
Provides-Extra: async-datasources
Requires-Dist: aiokafka>=0.9.0; extra == 'async-datasources'
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'async-datasources'
Requires-Dist: motor>=3.0.0; extra == 'async-datasources'
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'azure'
Provides-Extra: dashboard
Requires-Dist: reflex>=0.4.0; extra == 'dashboard'
Provides-Extra: database
Requires-Dist: sqlalchemy>=2.0.0; extra == 'database'
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pandas>=2.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.3.0; extra == 'dev'
Requires-Dist: scipy>=1.10.0; extra == 'dev'
Provides-Extra: drift
Requires-Dist: scipy>=1.10.0; extra == 'drift'
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'elasticsearch'
Provides-Extra: gcs
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'gcs'
Provides-Extra: kafka
Requires-Dist: aiokafka>=0.9.0; extra == 'kafka'
Provides-Extra: mongodb
Requires-Dist: motor>=3.0.0; extra == 'mongodb'
Provides-Extra: nosql
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'nosql'
Requires-Dist: motor>=3.0.0; extra == 'nosql'
Provides-Extra: pdf
Requires-Dist: weasyprint>=60.0; extra == 'pdf'
Provides-Extra: perf
Requires-Dist: xxhash>=3.4.0; extra == 'perf'
Provides-Extra: reports
Requires-Dist: jinja2>=3.0.0; extra == 'reports'
Provides-Extra: s3
Requires-Dist: boto3>=1.26.0; extra == 's3'
Provides-Extra: stores
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'stores'
Requires-Dist: boto3>=1.26.0; extra == 'stores'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'stores'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'stores'
Provides-Extra: streaming
Requires-Dist: aiokafka>=0.9.0; extra == 'streaming'
Description-Content-Type: text/markdown

<div align="center">
  <img width="500px" alt="Truthound Banner" src="https://raw.githubusercontent.com/seadonggyun4/Truthound/main/docs/assets/truthound_banner.png" />
</div>

<h1 align="center">Truthound</h1>

<p align="center">
  <strong>Zero-Configuration Data Quality Framework Powered by Polars</strong>
</p>

<p align="center">
  <em>Sniffs out bad data</em>
</p>

> **Alpha Release**: This framework is currently in alpha stage. APIs may change without notice. We are continuously improving core features and expanding ecosystem integrations.

---

## Abstract

<img width="300" height="300" alt="Truthound_icon" src="https://github.com/user-attachments/assets/90d9e806-8895-45ec-97dc-f8300da4d997" />

Truthound is a data quality validation framework built on Polars, a Rust-based DataFrame library. The framework provides zero-configuration validation through automatic schema inference and supports a wide range of validation scenarios from basic schema checks to statistical drift detection.

[![PyPI version](https://img.shields.io/pypi/v/truthound.svg)](https://pypi.org/project/truthound/)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
[![Powered by Polars](https://img.shields.io/badge/Powered%20by-Polars-2563EB?logo=polars&logoColor=white)](https://pola.rs/)
[![Awesome](https://awesome.re/badge.svg)](https://github.com/ddotta/awesome-polars)
[![Downloads](https://img.shields.io/pepy/dt/truthound?color=brightgreen)](https://pepy.tech/project/truthound)

**Documentation**: [https://truthound.netlify.app](https://truthound.netlify.app/)

**Related Projects** (Alpha)

> **Note**  
> These projects are under active development and depend on a Truthound API that is not yet finalized.  
> As the API contract is subject to change, production use or direct integration into real-world projects is discouraged at this stage.
> 
| Project | Description | Status |
|---------|-------------|--------|
| [truthound-orchestration](https://github.com/seadonggyun4/truthound-orchestration) | Workflow integration for Airflow, Dagster, Prefect, and dbt | Alpha |
| [truthound-dashboard](https://github.com/seadonggyun4/truthound-dashboard) | Web-based data quality monitoring dashboard | Alpha |

---

## Implementation Status

### Verified Metrics

| Metric | Value | Notes |
|--------|-------|-------|
| Test Cases | 8,259 | Collected via pytest |
| Validators | 264 | Validator classes available |
| Validator Categories | 28 | Distinct subdirectories |

### Core Features

| Feature | Description |
|---------|-------------|
| **Zero Configuration** | Automatic schema inference with fingerprint-based caching (xxhash) |
| **264 Validators** | 28 categories including schema, completeness, uniqueness, distribution, drift, anomaly |
| **Polars LazyFrame** | Native Polars operations, expression-based batch execution, single collect() optimization |
| **DAG Parallel Execution** | Dependency-aware orchestration with 3 execution strategies (Sequential, Parallel, Adaptive) |
| **Custom Validator SDK** | `@custom_validator` decorator, `ValidatorBuilder` fluent API, testing utilities, 7 templates |
| **i18n Error Messages** | 7 languages (EN, KO, JA, ZH, DE, FR, ES) with message catalogs |
| **Privacy Compliance** | GDPR, CCPA, LGPD, PIPEDA, APPI support with PII detection and masking |
| **ReDoS Protection** | Regex safety analysis, ML-based prediction (sklearn), CVE database, RE2 engine support |
| **Distributed Timeout** | Deadline propagation, cascading timeout, graceful degradation |
| **Enterprise Sampling** | Block, multi-stage, column-aware, progressive sampling for 100M+ rows |
| **Performance Profiling** | Validator-level timing, memory, throughput metrics with Prometheus export |
| **File Format Support** | CSV, JSON, Parquet, NDJSON, JSONL |

---

## Quick Start

### Installation

```bash
pip install truthound

# With optional features
pip install truthound[all]
```

### Python API

```python
import truthound as th

# Basic validation
report = th.check("data.csv")

# Parallel validation (DAG-based execution)
report = th.check("data.csv", parallel=True, max_workers=4)

# Schema-based validation
schema = th.learn("baseline.csv")
report = th.check("new_data.csv", schema=schema)

# Drift detection
drift = th.compare("train.csv", "production.csv")

# PII scanning and masking
pii_report = th.scan(df)
masked_df = th.mask(df, strategy="hash")

# Statistical profiling
profile = th.profile("data.csv")
```

### CLI

```bash
truthound check data.csv                    # Validate
truthound check data.csv --strict           # CI/CD mode
truthound compare baseline.csv current.csv  # Drift detection
truthound scan data.csv                     # PII scanning
truthound auto-profile data.csv -o profile.json  # Profiling

# Code scaffolding
truthound new validator my_validator        # Create validator
truthound new reporter json_export          # Create reporter
truthound new plugin my_plugin              # Create plugin
```

### Data Sources

| Interface | Supported Sources |
|-----------|-------------------|
| **CLI** | File formats only: CSV, JSON, Parquet, NDJSON, JSONL |
| **Python API** | All sources: Files, DataFrames, SQL databases, Spark, Cloud DW |

```python
# Python API - SQL databases, Spark, Cloud DW
from truthound.datasources import BigQueryDataSource

source = BigQueryDataSource(
    table="users",
    project="my-project",
    dataset="analytics",
)
report = th.check(source=source)
```

> **Note**: Spark, BigQuery, Snowflake, Redshift, Databricks, and other database sources require the Python API with the `source=` parameter. The CLI only supports file-based inputs.

---

## Validator Categories

The following validator categories are implemented:

| Category | Description |
|----------|-------------|
| schema | Column structure, types, relationships |
| completeness | Null detection, required fields |
| uniqueness | Duplicates, primary keys, composite keys |
| distribution | Range, outliers, statistical tests |
| string | Regex, email, URL, JSON validation |
| datetime | Format, range, sequence validation |
| aggregate | Mean, median, sum constraints |
| cross_table | Multi-table relationships |
| multi_column | Column comparisons, conditional logic |
| query | SQL/Polars expression validation |
| table | Row count, freshness, metadata |
| geospatial | Coordinates, bounding boxes |
| drift | KS, PSI, Chi-square, Wasserstein |
| anomaly | IQR, Z-score, Isolation Forest, LOF |
| business_rule | Luhn, IBAN, VAT, ISBN validation |
| localization | Korean, Japanese, Chinese identifiers |
| ml_feature | Leakage detection, correlation |
| profiling | Cardinality, entropy, frequency |
| referential | Foreign keys, orphan records |
| timeseries | Gaps, seasonality, trend detection |
| privacy | PII detection and compliance rules |
| security | SQL injection prevention, ReDoS protection |
| sdk | Custom validator development tools |
| timeout | Distributed timeout management |
| i18n | Internationalized error messages |
| streaming | Streaming data validation |
| memory | Memory-aware processing |
| optimization | Validator execution optimization |

---

## Data Sources

| Category | Sources |
|----------|---------|
| DataFrame | Polars, Pandas, PySpark |
| Core SQL | PostgreSQL, MySQL, SQLite |
| Cloud DW | BigQuery, Snowflake, Redshift, Databricks |
| Enterprise | Oracle, SQL Server |
| File | CSV, Parquet, JSON, NDJSON |

---

## Streaming Support

### Protocol-based Adapters

Kafka and Kinesis adapters implement `IStreamSource`/`IStreamSink` protocols with async operations:

| Adapter | Library | Features |
|---------|---------|----------|
| KafkaAdapter | aiokafka | Consumer groups, partition management, SASL/SSL authentication |
| KinesisAdapter | aiobotocore | Multi-shard consumption, enhanced fan-out, checkpointing |

### StreamingSource Pattern

File-based and cloud streaming sources use `StreamingSource` base class:

| Source | Description |
|--------|-------------|
| ParquetSource | Parquet file streaming |
| CSVSource | CSV file streaming |
| JSONLSource | JSON Lines streaming |
| ArrowIPCSource | Arrow IPC format |
| ArrowFlightSource | Arrow Flight protocol |
| PubSubSource | Google Cloud Pub/Sub (requires google-cloud-pubsub) |

Note: Kafka and Kinesis use protocol-based adapters with `aiokafka` and `aiobotocore`. Pub/Sub uses the `StreamingSource` pattern with synchronous operations.

---

## Enterprise Features

### Auto-Profiling

| Feature | Description |
|---------|-------------|
| **Statistical Profiling** | Column-level statistics, distribution analysis, missing value detection |
| **Pattern Detection** | Email, phone, credit card, custom regex patterns |
| **Rule Generation** | Auto-generate validation rules from profile with `Suite.execute()` |
| **Distributed Processing** | Spark, Dask, Ray, Local backends |
| **Incremental Scheduling** | Cron, interval, data change triggers |
| **Schema Evolution** | Change detection, compatibility analysis (FULL, BACKWARD, FORWARD, NONE) |
| **Unified Resilience** | Circuit breaker, retry, bulkhead, rate limiter with fluent builder |

### Data Docs (HTML Reports)

| Feature | Description |
|---------|-------------|
| **6 Built-in Themes** | Default, Light, Dark, Minimal, Modern, Professional |
| **4 Chart Libraries** | ApexCharts, Chart.js, Plotly.js, SVG (zero-dependency) |
| **15 Languages** | EN, KO, JA, ZH, DE, FR, ES, PT, IT, RU, AR, TH, VI, ID, TR |
| **White-labeling** | Enterprise themes with custom branding, logo, colors |
| **PDF Export** | Chunked rendering, parallel processing for large reports |
| **Report Versioning** | 4 strategies with diff and rollback support |

### Plugin Architecture

| Feature | Description |
|---------|-------------|
| **Security Sandbox** | NoOp, Process, Container isolation engines |
| **Plugin Signing** | HMAC, RSA, Ed25519 algorithms with trust store |
| **6 Security Presets** | DEVELOPMENT, TESTING, STANDARD, ENTERPRISE, STRICT, AIRGAPPED |
| **Version Constraints** | Semver-based (^, ~, >=, <, ranges) |
| **Dependency Management** | Graph-based resolution, cycle detection, topological sort |
| **Hot Reload** | File watching, graceful reload with rollback |
| **Documentation** | AST-based extraction, Markdown/HTML/JSON renderers |
| **CLI Extension** | Entry point based plugin system (`truthound.cli` group) |

### Checkpoint & CI/CD

| Feature | Description |
|---------|-------------|
| **Saga Pattern** | 8 compensation strategies (Backward, Forward, Semantic, Pivot, etc.) |
| **12 CI Platforms** | GitHub Actions, GitLab CI, Jenkins, CircleCI, Azure DevOps, etc. |
| **9 Notification Providers** | Slack, Email, PagerDuty, GitHub, Webhook, Teams, OpsGenie, Discord, Telegram |
| **GitHub OIDC** | 30+ claims parsing, AWS/GCP/Azure/Vault credential exchange |
| **4 Distributed Backends** | Local, Celery, Ray, Kubernetes |
| **Circuit Breaker** | 3 states (CLOSED, OPEN, HALF_OPEN), failure detection strategies |
| **Idempotency** | Request fingerprinting, duplicate detection, TTL expiration |

### Advanced Notifications

| Feature | Description |
|---------|-------------|
| **Rule-based Routing** | Python expression + Jinja2 engine, 11 built-in rules, combinators (AllOf, AnyOf, NotRule) |
| **Deduplication** | InMemory/Redis Streams backends, 4 window strategies (Sliding, Tumbling, Session, Adaptive) |
| **Rate Limiting** | Token Bucket, Fixed/Sliding Window, 5 throttler types with builder pattern |
| **Escalation Policies** | Multi-level escalation, state machine, 3 storage backends (InMemory, Redis, SQLite) |

### ML & Lineage

| Feature | Description |
|---------|-------------|
| **6 Anomaly Algorithms** | Isolation Forest, LOF, One-Class SVM, DBSCAN, Statistical, Autoencoders |
| **4 Drift Algorithms** | KS Test, Chi-Square, PSI, Jensen-Shannon Divergence |
| **ML Model Monitoring** | Performance metrics, quality metrics, drift detection, alerting |
| **Lineage Graph** | DAG-based dependency tracking, column-level lineage, impact analysis |
| **4 Visualization Renderers** | D3.js, Cytoscape.js, Graphviz, Mermaid |
| **OpenLineage Integration** | Industry-standard lineage events, run lifecycle management |

### Storage Features

| Feature | Description |
|---------|-------------|
| **Cloud Storage** | S3, GCS, Azure Blob backends with connection pooling |
| **Versioning** | 4 strategies (Incremental, Semantic, Timestamp, GitLike) |
| **Retention** | 6 policies (Time, Count, Size, Status, Tag, Composite) |
| **Tiering** | Hot/Warm/Cold/Archive with 5 migration policies |
| **Caching** | LRU, LFU, TTL backends with 4 cache modes |
| **Replication** | Sync/Async/Semi-Sync cross-region with conflict resolution |
| **Backpressure** | 6 strategies with circuit breaker and monitoring |
| **Batch Optimization** | Memory-aware buffer, async batch writer |

### Enterprise Infrastructure

| Component | Features |
|-----------|----------|
| **Logging** | JSON format, correlation IDs, ELK/Loki/Fluentd integration, async logging |
| **Metrics** | Prometheus counters, gauges, histograms, HTTP endpoint, push gateway |
| **Config** | Environment profiles (dev/staging/prod), Vault/AWS Secrets integration, hot reload |
| **Audit** | Full operation trail, Elasticsearch/S3/Kafka storage, compliance reporting (SOC2/GDPR/HIPAA) |
| **Encryption** | AES-256-GCM, ChaCha20-Poly1305, field-level encryption, Cloud KMS (AWS/GCP/Azure/Vault) |

---

## Installation Options

```bash
# Core installation
pip install truthound

# Feature-specific extras
pip install truthound[drift]      # Drift detection (scipy)
pip install truthound[anomaly]    # Anomaly detection (scikit-learn)
pip install truthound[pdf]        # PDF export (weasyprint)

# Data source extras
pip install truthound[bigquery]   # Google BigQuery
pip install truthound[snowflake]  # Snowflake
pip install truthound[redshift]   # Amazon Redshift
pip install truthound[databricks] # Databricks
pip install truthound[oracle]     # Oracle Database
pip install truthound[sqlserver]  # SQL Server

# Security extras
pip install truthound[encryption] # Encryption (cryptography)

# Full installation
pip install truthound[all]
```

---

## Requirements

- Python 3.11+
- Polars 1.x
- PyYAML
- Rich
- Typer

---

## Development

```bash
git clone https://github.com/seadonggyun4/Truthound.git
cd Truthound
pip install hatch
hatch env create
hatch run test
```

---

## References

1. Polars Documentation. https://pola.rs/
2. Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione"
3. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"
4. Breunig, M. M., et al. (2000). "LOF: Identifying Density-Based Local Outliers"

---

## License

Apache License 2.0

---

## Acknowledgments

Built with Polars, Rich, Typer, scikit-learn, and SciPy.
