Metadata-Version: 2.4
Name: autoclean-dataframe
Version: 1.0.1
Summary: Automatic, configurable data cleansing for pandas DataFrames
Author: autoclean-dataframe contributors
License: MIT
Project-URL: Homepage, https://github.com/yuuichieguchi/autoclean-dataframe
Project-URL: Documentation, https://github.com/yuuichieguchi/autoclean-dataframe#readme
Project-URL: Repository, https://github.com/yuuichieguchi/autoclean-dataframe.git
Project-URL: Issues, https://github.com/yuuichieguchi/autoclean-dataframe/issues
Keywords: data-cleaning,pandas,data-quality,etl
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=1.5.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"

# autoclean-dataframe

A Python library for **automatic, configurable data cleansing** of pandas DataFrames with detailed reporting. Clean messy tabular data quickly using declarative configuration.

## Features

- **Declarative Configuration**: Define cleaning rules using Python dicts, Pydantic models, or YAML/JSON files
- **Comprehensive Cleaning Operations**:
  - Missing value imputation (mean, median, mode, constant, forward/backward fill)
  - Type conversion with intelligent error handling
  - Whitespace and text normalization (strip, case conversion)
  - Categorical normalization and value validation
  - PII masking (email, phone, custom patterns)
  - Outlier detection and handling (flag, remove, or clip)
  - Duplicate removal and empty row/column handling
- **Smart Defaults**: Quick `auto_clean()` function for common scenarios
- **Detailed Reporting**: Track all changes with human-readable summaries and machine-parseable JSON
- **Configurable I/O**: Load/save configurations from YAML or JSON files
- **Type-Safe**: Full type hints with Pydantic validation
- **Testable**: Immutable operations return new DataFrames

## Installation

```bash
pip install autoclean-dataframe
```

## Quick Start

### 1. Automatic Cleaning (Easiest)

```python
from autoclean_dataframe import auto_clean

# Apply smart defaults: remove duplicates, infer types, detect outliers
df_clean, report = auto_clean(df)
print(report)
```

### 2. Programmatic Configuration

```python
from autoclean_dataframe import (
    clean_dataframe,
    DataCleanConfig,
    GeneralCleanConfig,
    ColumnConfig,
    TypeConversionConfig,
    MissingValueConfig,
)

config = DataCleanConfig(
    general=GeneralCleanConfig(
        remove_duplicates=True,
        drop_fully_empty_rows=True,
    ),
    columns={
        "age": ColumnConfig(
            column_name="age",
            type_conversion=TypeConversionConfig(target_type="int"),
            missing_values=MissingValueConfig(strategy="mean"),
        ),
        "email": ColumnConfig(
            column_name="email",
            strip_whitespace=True,
            to_lowercase=True,
        ),
    }
)

df_clean, report = clean_dataframe(df, config)
print(report)
```

### 3. YAML Configuration

Create `config.yaml`:

```yaml
general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    column_name: age
    type_conversion:
      target_type: int
    missing_values:
      strategy: mean

  email:
    column_name: email
    strip_whitespace: true
    to_lowercase: true
    pii:
      pii_type: email
```

Then in Python:

```python
from autoclean_dataframe import load_config, clean_dataframe

config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)
```

## Core Features

### 1. Missing Value Handling

```python
MissingValueConfig(
    strategy="mean",        # "mean", "median", "mode", "constant", "forward_fill", "backward_fill", "drop_row", "none"
    constant_value=0,       # For strategy="constant"
    threshold=0.5,          # Drop column if missing % > threshold
)
```

### 2. Type Conversion

```python
TypeConversionConfig(
    target_type="int",           # "int", "float", "str", "bool", "datetime", "category", "none"
    datetime_format="%Y-%m-%d",  # For datetime conversion
    strict=False,                # If True, raise on conversion failure; else coerce to NaN
    infer_type=False,            # Auto-detect type if not specified
)
```

### 3. Outlier Detection

```python
OutlierConfig(
    method="iqr",           # "iqr", "zscore", "none"
    action="flag",          # "flag", "remove", "clip", "none"
    iqr_multiplier=1.5,     # Q1 - k*IQR, Q3 + k*IQR
    zscore_threshold=3.0,   # |z| > threshold
)
```

### 4. PII Masking

```python
PiiConfig(
    pii_type="email",       # "email", "phone", "ssn", "credit_card", "custom", "none"
    custom_pattern=r"\d{3}-\d{2}-\d{4}",  # For pii_type="custom"
    mask_char="*",          # Character to use for masking
)
```

Masked outputs:
- Email: `john@example.com` → `***@***.com`
- Phone: `555-123-4567` → `***-***-4567` (keeps last 4 digits)

### 5. Text Normalization

```python
ColumnConfig(
    column_name="name",
    strip_whitespace=True,      # Remove leading/trailing spaces
    to_lowercase=True,          # Convert to lowercase
    to_uppercase=False,         # Convert to uppercase (mutually exclusive with to_lowercase)
)
```

### 6. Categorical Validation

```python
ColumnConfig(
    column_name="status",
    allowed_values=["active", "inactive", "pending"],  # Restrict to these values
)
```

### 7. General Cleaning

```python
GeneralCleanConfig(
    drop_fully_empty_rows=True,      # Drop rows where ALL values are NaN
    drop_fully_empty_columns=True,   # Drop columns where ALL values are NaN
    remove_duplicates=True,          # Remove duplicate rows
    normalize_unicode=False,         # Normalize to NFC form
    infer_dtypes=False,             # Auto-detect column types
)
```

## Cleaning Report

The cleaning pipeline returns a `CleanReport` object with detailed information:

```python
df_clean, report = clean_dataframe(df, config)

# Print human-readable summary
print(report)

# Export to JSON
json_str = report.to_json()
report.to_dict()

# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")
```

Report includes:
- Row/column counts before and after
- Per-column change summaries
- Count of specific operations (type conversions, imputations, outliers removed, etc.)
- Warnings and errors encountered

Example:

```
======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456

OVERVIEW
----------------------------------------------------------------------
Rows before: 100
Rows after:  95
Rows removed: 5
Columns before: 10
Columns after:  10

Duplicate rows removed: 2

COLUMN CHANGES
----------------------------------------------------------------------

age:
  - Missing values handled: 3
  - Type conversions: 97
  - Outliers detected: 2
  - Outliers clipped: 2

email:
  - Whitespace stripped: 5
  - PII values masked: 100

======================================================================
```

## Examples

See the `examples/` directory for complete examples:

1. **simple_usage.py**: Basic cleaning operations
2. **yaml_config_usage.py**: Using YAML configuration files
3. **config_example.yaml**: Annotated example configuration

Run examples:

```bash
cd examples
python3 simple_usage.py
python3 yaml_config_usage.py
```

## Configuration Schema

Full Pydantic model schema:

```python
DataCleanConfig(
    general: GeneralCleanConfig = GeneralCleanConfig(),
    columns: Dict[str, ColumnConfig] = {},
    preserve_index: bool = True,
    verbose: bool = False,
)

GeneralCleanConfig(
    drop_fully_empty_rows: bool = False,
    drop_fully_empty_columns: bool = False,
    remove_duplicates: bool = False,
    normalize_unicode: bool = False,
    infer_dtypes: bool = False,
)

ColumnConfig(
    column_name: str,
    strip_whitespace: bool = False,
    to_lowercase: bool = False,
    to_uppercase: bool = False,
    type_conversion: Optional[TypeConversionConfig] = None,
    missing_values: Optional[MissingValueConfig] = None,
    outliers: Optional[OutlierConfig] = None,
    pii: Optional[PiiConfig] = None,
    allowed_values: Optional[List[Any]] = None,
)
```

## API Reference

### Main Functions

```python
# Apply cleaning with config
clean_dataframe(df: pd.DataFrame, config: DataCleanConfig) -> Tuple[pd.DataFrame, CleanReport]

# Apply smart defaults
auto_clean(df: pd.DataFrame, verbose: bool = False) -> Tuple[pd.DataFrame, CleanReport]
```

### Configuration & I/O

```python
# Load config from file
load_config(path: Union[str, Path]) -> DataCleanConfig

# Save config to file
save_config(config: DataCleanConfig, path: Union[str, Path], format: str = "yaml") -> None

# Save report to file
save_report(report: CleanReport, path: Union[str, Path], format: str = "json") -> None

# Config serialization
config_to_dict(config: DataCleanConfig) -> Dict[str, Any]
config_to_yaml(config: DataCleanConfig) -> str
config_to_json(config: DataCleanConfig) -> str
```

### Types & Enums

```python
ColumnType = {"numeric", "categorical", "datetime", "text", "unknown"}
ImputationMethod = {"mean", "median", "mode", "forward_fill", "backward_fill", "constant", "drop_row", "none"}
OutlierMethod = {"iqr", "zscore", "none"}
OutlierAction = {"remove", "clip", "flag", "none"}
PiiType = {"email", "phone", "ssn", "credit_card", "custom", "none"}
```

## Exceptions

```python
AutocleanException              # Base exception
ConfigValidationError           # Configuration validation failed
DataValidationError             # Input DataFrame validation failed
TypeConversionError             # Type conversion failed
OutlierDetectionError           # Outlier detection failed
ReportExportError               # Report serialization failed
```

## Design Principles

1. **Immutable by Default**: Always returns new DataFrames, never modifies input
2. **Fail-Safe**: Coerces conversion failures to NaN by default, tracks issues in report
3. **Explicit Over Implicit**: Conservative defaults, requires explicit configuration
4. **Traceable**: Every change tracked and reported
5. **Type-Safe**: Full type hints, Pydantic validation

## Common Workflows

### Clean CSV with Smart Defaults

```python
import pandas as pd
from autoclean_dataframe import auto_clean, save_report

# Load and clean
df = pd.read_csv("messy_data.csv")
df_clean, report = auto_clean(df, verbose=True)

# Save results
df_clean.to_csv("clean_data.csv", index=False)
save_report(report, "report.json")
```

### Type Inference and Conversion

```python
from autoclean_dataframe import auto_clean

# Auto-detect types and convert
df_clean, report = auto_clean(df)

# Check what was inferred
for col in df_clean.columns:
    print(f"{col}: {df_clean[col].dtype}")
```

### Handle Missing Values

```python
config = DataCleanConfig(
    columns={
        "numeric_col": ColumnConfig(
            column_name="numeric_col",
            missing_values=MissingValueConfig(strategy="median"),
        ),
        "categorical_col": ColumnConfig(
            column_name="categorical_col",
            missing_values=MissingValueConfig(
                strategy="constant",
                constant_value="unknown",
            ),
        ),
    }
)
df_clean, report = clean_dataframe(df, config)
```

### Detect and Remove Outliers

```python
from autoclean_dataframe import OutlierConfig, OutlierMethod, OutlierAction

config = DataCleanConfig(
    columns={
        "measurement": ColumnConfig(
            column_name="measurement",
            outliers=OutlierConfig(
                method=OutlierMethod.IQR,
                action=OutlierAction.REMOVE,
                iqr_multiplier=1.5,
            ),
        )
    }
)
df_clean, report = clean_dataframe(df, config)
print(f"Rows removed: {report.rows_removed}")
```

## Performance Notes

- **Memory**: Always creates a copy of the DataFrame (immutable design)
- **Speed**: Optimized for typical data sizes (up to millions of rows)
- **Scaling**: Linear time complexity for most operations

For very large datasets (>1B rows), consider:
- Processing in chunks
- Using more targeted configurations (fewer columns)
- Disabling expensive operations (outlier detection)

## Testing

Run the test suite:

```bash
pip install pytest pytest-cov
pytest tests/ -v
```

Coverage: >80% of codebase

## Contributing

Contributions welcome! Areas for enhancement:
- Additional PII pattern types
- Custom outlier detection methods
- Integration with Dask for larger-than-memory data
- Web API for cleaning service

## License

MIT License

## Project Status

This is an alpha release (v0.1.0). The API is stable but may evolve. Please report issues on GitHub.

## See Also

- **pandas**: Data manipulation
- **pydantic**: Configuration validation
- **great-expectations**: More advanced data validation
- **pandas-profiling**: Data profiling and analysis
