Metadata-Version: 2.4
Name: tableai
Version: 0.2.4
Summary: AI toolkit for tabular data — auto EDA, data profiling, anomaly detection, and smart transformations on DataFrames.
Project-URL: Homepage, https://www.nrl.ai
Project-URL: Repository, https://github.com/vietanhdev/tableai
Project-URL: Issues, https://github.com/vietanhdev/tableai/issues
Author-email: Viet-Anh Nguyen <vietanh.dev@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: anomaly-detection,data-cleaning,data-profiling,eda,pandas,tabular-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.8
Requires-Dist: click>=8.0
Requires-Dist: numpy
Requires-Dist: pandas
Provides-Extra: all
Requires-Dist: anyllm; extra == 'all'
Requires-Dist: matplotlib>=3.5; extra == 'all'
Requires-Dist: pytest-cov>=4.0; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: scikit-learn>=1.0; extra == 'all'
Requires-Dist: seaborn>=0.12; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: llm
Requires-Dist: anyllm; extra == 'llm'
Provides-Extra: ml
Requires-Dist: scikit-learn>=1.0; extra == 'ml'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.5; extra == 'viz'
Requires-Dist: seaborn>=0.12; extra == 'viz'
Description-Content-Type: text/markdown

<h1 align="center">tableai</h1>
<p align="center"><em>Profile, clean, and query tabular data with one-liners — plus natural-language DataFrame analysis.</em></p>

<p align="center">
<img src="https://img.shields.io/pypi/v/tableai.svg" alt="PyPI">
<img src="https://img.shields.io/pypi/pyversions/tableai.svg" alt="Python">
<img src="https://img.shields.io/pypi/l/tableai.svg" alt="License">
</p>

**tableai** is a toolkit for making sense of DataFrames fast. Profile any DataFrame and get column types, null counts, descriptive statistics, correlations, and a data-quality score. Clean it with a single call that imputes missing values, drops duplicates, and clips outliers. Detect anomalies with IQR or Isolation Forest. Get rule-based natural-language insights — or ask questions in plain English and have `anyllm` generate the pandas code for you.

Built by [Viet-Anh Nguyen](https://github.com/vietanhdev) at [NRL.ai](https://www.nrl.ai).

## Why tableai?

- **One-liner API** — `tableai.profile(df)` gives you everything in one call
- **Plugin architecture** — Register custom profilers, cleaners, and anomaly detectors
- **Local-first** — All core features work without any cloud or LLM call
- **Minimal core deps** — `pandas` and `numpy`; sklearn and anyllm are optional
- **Production-ready** — Structured dataclass results, JSON export, reproducible

## Installation

```bash
pip install tableai
```

For optional features:

```bash
pip install tableai[sklearn]   # Isolation Forest + KMeans clustering
pip install tableai[llm]       # NL querying via anyllm
pip install tableai[all]       # everything
```

**Python 3.8+ supported** (tested on 3.8, 3.9, 3.10, 3.11, 3.12, 3.13)

## Quick Start

```python
import tableai
import pandas as pd

df = pd.read_csv("sales.csv")

# 1. Profile the DataFrame (dtypes, nulls, stats, correlations, quality score)
report = tableai.profile(df)
print(report.quality_score)              # 0.0 - 1.0
print(report.nulls)                      # per-column null counts
print(report.correlations.head())        # top correlated pairs

# 2. Clean the DataFrame (impute, dedupe, clip outliers)
clean = tableai.clean(df, impute=True, dedupe=True, clip_outliers=True)

# 3. Detect anomalies (IQR by default, Isolation Forest if sklearn installed)
anomalies = tableai.anomalies(df, method="iqr")
print(f"{len(anomalies)} anomalous rows")

# 4. Rule-based insights
for insight in tableai.insights(df):
    print("-", insight)

# 5. Natural-language querying (requires tableai[llm] + anyllm)
result = tableai.ask(df, "what is the average revenue by region?")
print(result)
```

## Models & Methods

### Profiling

- **Dtype detection** — numeric / categorical / datetime / text / boolean / ID
- **Null analysis** — per-column null counts, percentages, and null patterns
- **Descriptive statistics** — mean, std, min, 25/50/75 percentiles, max, skew, kurtosis
- **Cardinality** — unique counts and top-K value frequencies
- **Correlation matrix** — Pearson for numerics, Cramer's V for categoricals
- **Duplicate detection** — exact and near-duplicate row counts

### Cleaning

Configurable pipeline applied in order:

1. **Drop constant columns** — zero variance
2. **Impute** — `median` for numerics, `mode` for categoricals (configurable)
3. **Deduplicate** — drop exact-duplicate rows
4. **Clip outliers** — IQR method (`[Q1 - 1.5*IQR, Q3 + 1.5*IQR]`)
5. **Type coercion** — auto-convert date-like strings to datetime

### Anomaly detection

| Method | Algorithm | Notes |
|---|---|---|
| `iqr` (default) | 1.5 x IQR per numeric column | Zero deps |
| `zscore` | `|z| > 3` per numeric column | Zero deps |
| `isolation_forest` | sklearn `IsolationForest` | Needs `tableai[sklearn]` |

### Data quality score

Weighted average (0.0 - 1.0) of four sub-scores:

- **Completeness** — `1 - null_ratio`
- **Uniqueness** — ratio of distinct rows
- **Consistency** — fraction of columns with a dominant dtype
- **Validity** — fraction of values inside expected ranges / formats

### Insights (rule-based NL)

Pattern-driven natural-language observations, for example:

- `"Column 'age' has 23.4% missing values"`
- `"'price' and 'quantity' are strongly positively correlated (r=0.87)"`
- `"Column 'id' appears to be a unique identifier"`
- `"12 rows are exact duplicates"`

### Natural-language querying (optional)

`tableai.ask(df, "…")` uses [anyllm](https://pypi.org/project/anyllm) to generate pandas code for your question, executes it in a sandboxed namespace, and returns the result. Works with any local or cloud LLM that anyllm supports.

## Models & Methods

**tableai** uses pure pandas/numpy for core operations — no ML dependencies required.

**Profiling (`tableai.profile`)** — Computes per-column:
- Dtype detection (numeric, categorical, datetime, string)
- Null counts and percentages
- Unique value counts
- Numeric statistics: mean, median, std, min, max, quartiles, skewness, kurtosis
- Top categorical values
- Pearson correlation matrix between numeric columns

**Cleaning (`tableai.clean`)** — Configurable strategies:
- Missing values: median (numeric), mode (categorical), drop, or zero
- Duplicate removal
- Outlier handling: IQR-based clipping or removal

**Anomaly Detection (`tableai.anomalies`)**:
- **IQR method** (default, no deps) — flags points outside Q1-1.5·IQR / Q3+1.5·IQR
- **Isolation Forest** (optional via `[ml]`, requires scikit-learn)

**Quality Scoring (`tableai.quality_score`)** — Weighted score 0-100:
- Completeness 35% (1 - null_ratio)
- Validity 25% (IQR-based outlier ratio)
- Uniqueness 20% (duplicate detection)
- Consistency 20% (mixed-type detection)

**Insights (`tableai.insights`)** — Rule-based natural language insights about missing values, correlations, skewness, cardinality, duplicates, and class imbalance.

**Natural Language Querying (`tableai.ask`, `tableai.query`)** — Optional via `[llm]` extra. Uses **anyllm** to generate pandas code from natural language. Falls back to keyword matching when LLM unavailable.

## API Reference

| Function | Purpose |
|---|---|
| `tableai.profile(df)` | Returns `ProfileReport` dataclass |
| `tableai.clean(df, **opts)` | Returns a cleaned DataFrame |
| `tableai.anomalies(df, method="iqr")` | Returns rows flagged as anomalous |
| `tableai.quality_score(df)` | Returns float 0.0 - 1.0 |
| `tableai.insights(df)` | Returns `list[str]` of NL insights |
| `tableai.ask(df, question, model=None)` | NL query via LLM |
| `tableai.compare(df1, df2)` | Diff two DataFrames (schema + data) |

## CLI Usage

```bash
tableai profile data.csv --out report.json
tableai clean data.csv --out clean.csv
tableai anomalies data.csv --method isolation_forest
tableai ask data.csv "average sales by region"
tableai quality data.csv
```

## Examples

### Full profiling report to JSON

```python
import tableai, pandas as pd

df = pd.read_csv("customers.csv")
report = tableai.profile(df)
report.to_json("customers_report.json")
print(f"Quality: {report.quality_score:.2f}")
```

### Custom cleaning pipeline

```python
import tableai

clean = tableai.clean(
    df,
    impute_numeric="median",
    impute_categorical="mode",
    dedupe=True,
    clip_outliers=True,
    drop_constant=True,
)
```

### Ask questions in English (with Ollama)

```python
import tableai

# Uses anyllm; defaults to Ollama if running locally
answer = tableai.ask(df, "which customer spent the most last quarter?",
                     model="llama3.1:8b")
print(answer)
```

## License

MIT (c) Viet-Anh Nguyen
