Metadata-Version: 2.4
Name: summarystatpkg
Version: 0.1.0
Summary: A smart data-profiling library for pandas DataFrames — basic and advanced column metadata, heterogeneity detection, null-pattern analysis, and categorical correlation discovery.
Author: Subhajit Bhattacharyya
License: MIT
Project-URL: Homepage, https://github.com/yourusername/summarystatpkg
Project-URL: Issues, https://github.com/yourusername/summarystatpkg/issues
Keywords: data profiling,summary statistics,pandas,EDA,metadata,data quality
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.23
Requires-Dist: scikit-learn>=1.2

# summarystatpkg

A smart **data-profiling library** for pandas DataFrames. Goes well beyond `.describe()` — it detects datetime columns, analyses null patterns, flags heterogeneous columns, clusters mixed-format values, and discovers categorical correlations automatically.

---

## Installation

```bash
pip install summarystatpkg
```

---

## Quick Start

```python
import pandas as pd
from summarystatpkg import csv_metadata, advanced_csv_metadata

df = pd.read_csv("your_file.csv")

# ── Basic profiling ──────────────────────────────────────────
basic = csv_metadata(df)
# Returns a list of dicts, one per column

# ── Advanced profiling ───────────────────────────────────────
advanced = advanced_csv_metadata(df)
# Returns:
#   advanced["columnMetadata"]      → per-column analysis
#   advanced["possibleCorrelation"] → detected column relationships
```

---

## What Each Function Does

### `csv_metadata(df)`
Basic column scanner. For every column it returns:

| Field | Description |
|---|---|
| `name` | Column name |
| `data_type` | pandas dtype |
| `notnullpercentage` | % of non-null rows |
| `uniquepercentage` | % unique values |
| `top_5_value_counts` | Most frequent values |
| `mean_value_count` | Mean (numeric) or mean frequency (object) |
| `std_dev_value_count` | Std dev of above |
| `max_value` / `min_value` | Range (numeric columns only) |
| `isdatetime` | Whether the column looks like a datetime |

---

### `advanced_csv_metadata(df)`
Smart profiler. Runs on up to 2 000 rows for performance. Per column it runs:

**Null pattern analysis** — are non-null values clustered or periodic?
```python
{
  "has_clusters": True,
  "periodic_pattern": False,
  "common_gap": None
}
```

**Heterogeneity detection** — entropy + frequency variance score (0–1).
Scores above 0.5 trigger structural clustering.

**Structural clustering** — for heterogeneous columns, values are profiled
on 12 character-level features and clustered with KMeans. Returns stratified
sample values per cluster:
```python
{
  "cluster_0": {
    "sample_values": ["john@example.com", "alice@corp.io"],
    "dominant_features": ["len_11_20", "has_at"]
  },
  "cluster_1": {
    "sample_values": ["N/A", "unknown"],
    "dominant_features": ["len_0_10"]
  }
}
```

**Correlation detection** — scans all categorical column pairs for
`one-to-one` or `many-to-one` relationships:
```python
{
  "country_code->country_name": "one-to-one",
  "store_id->region": "many-to-one"
}
```

---

### Individual utility functions

```python
from summarystatpkg import (
    is_datetime_column,        # series → bool
    null_clustering_analysis,  # (df, col) → dict
    entropy_based_detection,   # (df, col) → float  [0–1]
    feature_based_clustering,  # (df, col) → dict
    detect_correlations_optimized,  # df → dict
)
```

---

## Example Output

```python
import pandas as pd
from summarystatpkg import advanced_csv_metadata

df = pd.DataFrame({
    "email":   ["a@b.com", "c@d.org", None, "e@f.net"],
    "country": ["US", "UK", "US", "DE"],
    "country_name": ["United States", "United Kingdom", "United States", "Germany"],
    "score":   [10, 20, 30, 40],
})

result = advanced_csv_metadata(df)
print(result["possibleCorrelation"])
# {'country->country_name': 'one-to-one'}
```

---

## Requirements

- Python ≥ 3.9
- pandas ≥ 1.5
- numpy ≥ 1.23
- scikit-learn ≥ 1.2

---

## License

MIT
