Metadata-Version: 2.4
Name: batch-analytics
Version: 0.3.16
Summary: PySpark batch analytics: Extract, Transform, Stage, and analytical modules (linear regression, correlation, PCA, t-test).
Author: Litewave Analytics Team
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pyspark<3.6,>=3.4
Requires-Dist: numpy>=1.19.0
Requires-Dist: scipy>=1.5.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Provides-Extra: ttest
Requires-Dist: scipy>=1.5.0; extra == "ttest"
Provides-Extra: s3
Requires-Dist: boto3>=1.28; extra == "s3"
Provides-Extra: clickhouse
Requires-Dist: clickhouse-connect<0.9,>=0.7; python_version < "3.9" and extra == "clickhouse"
Requires-Dist: clickhouse-connect>=0.7; python_version >= "3.9" and extra == "clickhouse"
Provides-Extra: output
Requires-Dist: boto3>=1.28; extra == "output"
Requires-Dist: clickhouse-connect<0.9,>=0.7; python_version < "3.9" and extra == "output"
Requires-Dist: clickhouse-connect>=0.7; python_version >= "3.9" and extra == "output"
Provides-Extra: autogluon
Requires-Dist: autogluon<2.0,>=1.0; extra == "autogluon"
Requires-Dist: pandas>=1.3.0; extra == "autogluon"
Requires-Dist: boto3>=1.28; extra == "autogluon"
Requires-Dist: clickhouse-connect<0.9,>=0.7; python_version < "3.9" and extra == "autogluon"
Requires-Dist: clickhouse-connect>=0.7; python_version >= "3.9" and extra == "autogluon"
Requires-Dist: pyarrow>=10.0.0; python_version >= "3.8" and extra == "autogluon"
Provides-Extra: full
Requires-Dist: scipy>=1.5.0; extra == "full"
Requires-Dist: boto3>=1.28; extra == "full"
Requires-Dist: clickhouse-connect<0.9,>=0.7; python_version < "3.9" and extra == "full"
Requires-Dist: clickhouse-connect>=0.7; python_version >= "3.9" and extra == "full"
Requires-Dist: autogluon<2.0,>=1.0; extra == "full"
Requires-Dist: pyarrow>=10.0.0; python_version >= "3.8" and extra == "full"

# Batch Analytics

PySpark-based analytics pipeline for ClickHouse data: **Extract** → **Transform** → **Stage** → **Analytics**. Designed to run as the main application inside a Spark driver container (invoked by `analytics_runners` via SparkApplication CRD).

## Bundle contents

Only the files required for the batch analytics job runner:

```
analytics/
├── pyproject.toml
├── requirements.txt          # core + scipy + boto3 + clickhouse-connect (single-file install)
├── requirements-batch.txt  # includes requirements.txt
├── README.md
└── src/
    └── batch_analytics/
        ├── __init__.py
        ├── __main__.py        # python -m batch_analytics
        ├── job_runner.py      # Entry point
        ├── config.py
        ├── extract.py
        ├── transform.py
        ├── log.py
        ├── README.md
        └── analytics/
            ├── __init__.py
            ├── linear_regression.py
            ├── correlation.py
            ├── pca_clustering.py
            └── t_test.py
```

## Install

```bash
pip install -e .
# or install every runtime dependency used anywhere in the package, then editable:
pip install -r requirements.txt && pip install -e .
# PyPI install includes numpy and scipy (t-test); extras: s3, clickhouse, output, full
pip install "batch-analytics[full]"
```

## Run

```bash
# Via module
python -m batch_analytics

# Via CLI (after pip install -e .)
batch-analytics

# Full pipeline
batch-analytics

# Analytics only (from staged ClickHouse table)
batch-analytics --from-stage --modules lr corr pca ttest
```

## Configuration

See `src/batch_analytics/README.md` for environment variables and usage.

## Docker image

For Spark on Kubernetes, build an image that includes this package and exposes `job_runner.py` at the path used by `mainApplicationFile` (e.g. `local:///opt/analytics/job_runner.py`).
