Metadata-Version: 2.4
Name: xelytics-core
Version: 0.2.2
Summary: Pure analytics engine for statistical analysis and insight generation
Author: Xelytics Team
License: MIT
Project-URL: Homepage, https://xelytics.live
Project-URL: Quick Start Notebook, https://colab.research.google.com/drive/1d1gN5Ip9-p7ojbogxVRRc3QIK6Gq6m9o?usp=sharing
Project-URL: End to End Notebook, https://colab.research.google.com/drive/1zQuBfquU9Zk-UuiX5wE6VmhrX2DqZ1MX?usp=sharing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.1.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: statsmodels>=0.14.0
Requires-Dist: pingouin>=0.5.3
Requires-Dist: plotly>=5.17.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: redis>=5.0.0
Provides-Extra: llm
Requires-Dist: openai>=1.6.0; extra == "llm"
Requires-Dist: groq>=0.4.0; extra == "llm"
Requires-Dist: httpx>=0.25.0; extra == "llm"
Provides-Extra: advanced
Requires-Dist: ruptures>=1.1.8; extra == "advanced"
Requires-Dist: pmdarima>=2.0.4; extra == "advanced"
Provides-Extra: connectors
Requires-Dist: psycopg2-binary>=2.9.0; extra == "connectors"
Requires-Dist: pymysql>=1.1.0; extra == "connectors"
Requires-Dist: snowflake-connector-python>=3.0.0; extra == "connectors"
Requires-Dist: sqlalchemy>=2.0.0; extra == "connectors"
Requires-Dist: openpyxl>=3.1.0; extra == "connectors"
Requires-Dist: pyarrow>=14.0.0; extra == "connectors"
Requires-Dist: boto3>=1.34.0; extra == "connectors"
Requires-Dist: azure-storage-blob>=12.19.0; extra == "connectors"
Requires-Dist: google-cloud-storage>=2.10.0; extra == "connectors"
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == "connectors"
Requires-Dist: pandas-gbq>=0.19.0; extra == "connectors"
Provides-Extra: export
Requires-Dist: jinja2>=3.1.0; extra == "export"
Requires-Dist: weasyprint>=60.0; extra == "export"
Requires-Dist: python-pptx>=0.6.23; extra == "export"
Requires-Dist: nbformat>=5.7.0; extra == "export"
Requires-Dist: kaleido>=0.2.1; extra == "export"
Provides-Extra: large-data
Requires-Dist: dask[dataframe]>=2024.1.0; extra == "large-data"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: black>=23.11.0; extra == "dev"
Requires-Dist: mypy>=1.7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"

# Xelytics-Core: Complete Analytics Engine

**Enterprise-grade pure analytics engine for automated statistical analysis, time series forecasting, clustering, and insight generation.**

[![Version](https://img.shields.io/badge/version-0.2.0--alpha.1-blue)](CHANGELOG.md)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](pyproject.toml)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![Status](https://img.shields.io/badge/status-beta-blue)](CHANGELOG.md)

> **Status**: Phases 1–3 complete ✅ | Foundation, Time Series Analysis, and Clustering fully implemented and tested.

---

## What It Does

Xelytics-Core is a **zero-configuration analytics engine** that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions — all with a single function call.

**One-line analysis:**
```python
from xelytics import analyze
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze(df)  # That's it!

for insight in result.insights:
    print(f"📊 {insight.title}: {insight.description}")
```

**Output includes:**
- ✅ 50+ statistical tests (parametric & non-parametric)
- ✅ Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
- ✅ Anomaly detection & change point detection
- ✅ Clustering analysis (K-Means, DBSCAN, Hierarchical)
- ✅ Interactive Plotly visualizations
- ✅ Human-readable insights (with optional LLM narration)
- ✅ Professional HTML, PDF, PowerPoint, and Jupyter reports

---

## Core Principles

| Principle | Meaning |
|---|---|
| **Zero Configuration** | Works out-of-the-box with sensible defaults; optional parameters for advanced use |
| **Pure Analytics** | No HTTP, no databases, no authentication—just data in, results out |
| **Type-Safe** | All inputs and outputs are typed dataclasses with IDE autocomplete |
| **Deterministic** | Identical inputs always produce identical outputs |
| **Backward Compatible** | v0.1.0 code runs unchanged in v0.2.0+ |
| **Extensible** | Custom pipelines, connectors, exporters, and LLM providers |
| **Production-Ready** | Parallel execution, result caching, error handling, comprehensive testing |

---

## 🚀 Quick Start (5 Minutes)

### Installation

```bash
# Minimal install
pip install -e .

# With all features
pip install -e ".[advanced,connectors,export,llm,dev]"
```

### Basic Analysis

```python
from xelytics import analyze
import pandas as pd

# Load your data
df = pd.read_csv("sales.csv")

# Run comprehensive analysis in one line
result = analyze(df)

# Explore results
print(f"Rows: {result.summary.row_count}")
print(f"Tests executed: {result.metadata.tests_executed}")

# View key findings
for insight in result.insights[:3]:
    print(f"  • {insight.title}")
    
# Export as JSON
import json
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)
```

**Output:**
```
Rows: 1000
Tests executed: 47
  • Significant correlation detected: revenue vs. marketing_spend
  • Outliers detected in customer_age column
  • Data shows strong seasonality
```

---

## 📖 Comprehensive Features & Usage

### 1️⃣ Statistical Analysis

Automatically runs relevant statistical tests based on data types and distributions.

#### Basic Usage

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    significance_level=0.05,
    enable_llm_insights=False,
    max_visualizations=15,
)

result = analyze(df, config=config)

# View statistical tests
for test in result.statistics:
    print(f"{test.test_name}:")
    print(f"  p-value: {test.p_value:.4f}")
    print(f"  Significant: {test.significant}")
    print(f"  Effect size: {test.effect_size.value:.3f}")
```

#### Advanced: Custom Analysis Plan

```python
# Define which columns to analyze
config = AnalysisConfig(
    include_columns=["age", "income", "purchase_frequency"],
    exclude_columns=["customer_id", "timestamp"],
    categorical_max_categories=50,  # Skip columns with >50 unique values
)

result = analyze(df, config=config)
```

**Statistics Covered:**
- ✅ Descriptive: mean, median, variance, skewness, kurtosis
- ✅ t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
- ✅ Correlation: Pearson, Spearman, Kendall Tau
- ✅ Chi-square tests for categorical associations
- ✅ Effect sizes: Cohen's d, Cramér's V, Eta-squared
- ✅ Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)

---

### 2️⃣ Time Series Analysis (NEW in v0.2.0)

Complete time series toolkit: detection, decomposition, forecasting, anomalies.

#### Time Series Detection

```python
from xelytics import analyze, AnalysisConfig

# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)

# Option 2: Specify datetime column
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
)
result = analyze(df, config=config)

# Check which columns were detected as time series
for ts in result.time_series_analysis:
    print(f"{ts.column_name}:")
    print(f"  Type: {ts.series_type.value}")
    print(f"  Frequency: {ts.frequency}")
    print(f"  Has trend: {ts.has_trend}")
    print(f"  Has seasonality: {ts.has_seasonality}")
    if ts.has_seasonality:
        print(f"  Seasonal period: {ts.seasonal_period}")
```

#### Time Series Decomposition

```python
# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    decomposition_method="additive",  # or "multiplicative", "stl"
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.decomposition:
        print(f"{ts.column_name} decomposition:")
        print(f"  Trend strength: {ts.decomposition.trend_strength:.3f}")
        print(f"  Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")
```

#### Forecasting

```python
# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=30,  # Forecast next 30 periods
    forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.forecasts:
        print(f"\n{ts.column_name} - Next 30 periods forecast:")
        for forecast in ts.forecasts[:5]:  # Show first 5
            print(f"  Period {forecast.period}: {forecast.value:.2f} "
                  f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")
```

#### Anomaly Detection

```python
# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,  # 95th percentile threshold
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.anomalies:
        print(f"\n{ts.column_name} - Anomalies detected:")
        for anomaly in ts.anomalies[:3]:
            print(f"  Index {anomaly.index}: {anomaly.value:.2f} "
                  f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")
```

#### Change Point Detection

```python
# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    detect_change_points=True,
    change_point_sensitivity=0.05,
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.change_points:
        print(f"\n{ts.column_name} - Change points:")
        for cp in ts.change_points:
            print(f"  At index {cp.index}: magnitude={cp.magnitude:.2f}, "
                  f"confidence={cp.confidence:.2f}")
```

---

### 3️⃣ Clustering & Segmentation (NEW in v0.2.0)

Unsupervised learning for customer segmentation, market clustering, etc.

#### Basic Clustering

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=8,
    exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)

# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
    print(f"\nCluster {cluster.cluster_id}:")
    print(f"  Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
    print(f"  Silhouette score: {cluster.silhouette_score:.3f}")
    print(f"  Profile: {cluster.profile}")
```

#### K-Means (with Automatic K Selection)

```python
# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=10,
    k_selection_method="elbow",  # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)

# View metrics for each K
for cluster in result.clusters:
    print(f"K={cluster.algorithm_params['n_clusters']}: "
          f"silhouette={cluster.silhouette_score:.3f}")
```

#### DBSCAN (Density-Based)

```python
# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="dbscan",
    dbscan_eps=0.5,  # Auto-estimated if not provided
    dbscan_min_samples=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
    print(f"{noise_label}: {cluster.size} points")
```

#### Hierarchical Clustering

```python
# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="hierarchical",
    hierarchical_linkage="ward",  # ward, complete, average, single
    max_clusters=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    print(f"Cluster {cluster.cluster_id}: {cluster.size} members")
```

---

### 4️⃣ Data Connectors (NEW in v0.2.0)

Analyze data directly from databases and cloud storage—no manual data export needed.

#### PostgreSQL

```python
from xelytics.connectors import connect_to_source
from xelytics import analyze

connector = connect_to_source(
    source_type="postgresql",
    host="db.example.com",
    database="analytics",
    user="analyst",
    password=os.getenv("DB_PASSWORD"),
    port=5432,
)

try:
    connector.connect()
    df = connector.query("""
        SELECT customer_id, age, income, purchase_count, lifetime_value
        FROM customers
        WHERE signup_year >= 2023
    """)
finally:
    connector.disconnect()

result = analyze(df)
```

#### MySQL / MariaDB

```python
connector = connect_to_source(
    source_type="mysql",
    host="db.example.com",
    database="analytics",
    user="analyst",
    password=os.getenv("DB_PASSWORD"),
)

df = connector.query("SELECT * FROM sales_data WHERE year = 2025")
result = analyze(df)
```

#### SQLite

```python
connector = connect_to_source(
    source_type="sqlite",
    database="/path/to/analytics.db",
)

df = connector.query("SELECT * FROM daily_metrics")
result = analyze(df)
```

#### BigQuery

```python
connector = connect_to_source(
    source_type="bigquery",
    project_id="my-project",
    credentials_path="/path/to/service-account.json",
)

df = connector.query("""
    SELECT * FROM `my-project.dataset.events`
    WHERE event_date >= '2025-01-01'
    LIMIT 100000
""")
result = analyze(df)
```

#### Snowflake

```python
connector = connect_to_source(
    source_type="snowflake",
    account="xy12345",
    warehouse="COMPUTE",
    database="ANALYTICS",
    schema="PUBLIC",
    user=os.getenv("SNOWFLAKE_USER"),
    password=os.getenv("SNOWFLAKE_PASSWORD"),
)

df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)
```

#### S3 / Cloud Storage

```python
# Amazon S3
connector = connect_to_source(
    source_type="s3",
    bucket="my-analytics-bucket",
    key="data/sales.parquet",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query()  # Returns DataFrame
result = analyze(df)

# Azure Blob Storage
connector = connect_to_source(
    source_type="azure_blob",
    container_name="data",
    blob_name="sales.csv",
    connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)

# Google Cloud Storage
connector = connect_to_source(
    source_type="gcs",
    bucket="my-bucket",
    key="data/sales.csv",
    credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)
```

---

### 5️⃣ Report Generation (NEW in v0.2.0)

Generate professional, interactive reports in multiple formats.

#### HTML Report

```python
from xelytics import analyze
from xelytics.export import HTMLReportGenerator

result = analyze(df)

generator = HTMLReportGenerator(
    theme="light",  # light or dark
    logo_text="ACME Corp",
    company_name="ACME Analytics",
)

html = generator.generate(
    result,
    title="Q1 2025 Sales Analysis",
    author="Data Science Team",
    include_raw_data=False,  # Don't embed full dataset
)

with open("report.html", "w") as f:
    f.write(html)

# Open in browser or embed
os.startfile("report.html")
```

#### PDF Report

```python
from xelytics.export import generate_pdf_report

pdf_bytes = generate_pdf_report(
    result,
    title="Q1 2025 Sales Analysis",
    author="Data Science Team",
    orientation="portrait",  # or "landscape"
)

with open("report.pdf", "wb") as f:
    f.write(pdf_bytes)
```

#### PowerPoint Presentation

```python
from xelytics.export import generate_pptx_report

pptx = generate_pptx_report(
    result,
    title="Q1 2025 Sales Analysis",
    author="Data Science Team",
    theme="office",  # office, modern, minimal
    include_speaker_notes=True,
)

pptx.save("report.pptx")
```

#### Jupyter Notebook

```python
from xelytics.export import generate_notebook

notebook = generate_notebook(
    result,
    title="Q1 2025 Sales Analysis",
    include_code_cells=True,
    include_raw_result=True,
)

with open("analysis.ipynb", "w") as f:
    json.dump(notebook, f)
```

#### JSON Export

```python
import json

# For programmatic access or storage
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)

# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
    data = json.load(f)
    result = AnalysisResult(**data)
```

---

### 6️⃣ Custom Pipelines (NEW in v0.2.0)

Pre-process data with custom steps before analysis.

```python
from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze

# Build a custom pipeline
pipeline = Pipeline([
    remove_outliers(method="iqr", threshold=1.5),
    normalize(method="minmax"),
    pca(n_components=10),
    correlation_analysis(threshold=0.7),
])

# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)

# Or use in AnalysisConfig
config = AnalysisConfig(
    run_custom_pipeline=True,
    custom_pipeline=pipeline,
)
result = analyze(df, config=config)
```

---

### 7️⃣ Caching (NEW in v0.2.0)

Speed up repeated analyses on the same data.

#### File-Based Cache

```python
from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache

cache = FileCache(cache_dir="./cache")

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

# First run: takes full time
result1 = analyze(df, config=config)

# Subsequent runs on same data: instant
result2 = analyze(df, config=config)  # Retrieved from cache!
```

#### Redis Cache (Distributed)

```python
from xelytics.cache import RedisCache

cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

result = analyze(df, config=config)
```

#### Clear Cache

```python
from xelytics.cache import clear_cache

# Clear all caches
clear_cache(pattern="*")

# Clear specific patterns
clear_cache(pattern="stats:*")  # Only clear stats caches
```

---

### 8️⃣ CLI (Command-Line Interface)

Analyze without writing Python code.

```bash
# Basic analysis - outputs JSON
xelytics analyze data.csv

# Save to file
xelytics analyze data.csv --output results.json

# Set parameters
xelytics analyze data.csv \
  --format=json \
  --alpha 0.01 \
  --no-llm \
  --max-visualizations 20 \
  --datetime-column "date"

# Time series analysis
xelytics analyze data.csv \
  --enable-time-series \
  --datetime-column "date" \
  --forecast-periods 30

# Clustering
xelytics analyze data.csv \
  --enable-clustering \
  --clustering-algorithm kmeans \
  --max-clusters 5

# Show version
xelytics --version

# Help
xelytics --help
```

---

### 9️⃣ LLM Integration (Optional)

Enhance insights with AI narration.

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",  # openai, groq, or local
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# Insights now include AI-generated descriptions
for insight in result.insights:
    print(f"{insight.title}")
    print(f"  📝 {insight.narrative}")  # AI-generated explanation
```

#### Multiple LLM Providers

```python
# OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# Groq (fast, open source)
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="groq",
    llm_model="mixtral-8x7b",
    llm_api_key=os.getenv("GROQ_API_KEY"),
)

# Azure OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="azure",
    llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)
```

---

### 🔟 Large Dataset Support

Analyze datasets with millions of rows.

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    # Auto-sample if > 1M rows
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Or force sampling
    sampling_strategy="stratified",
    sample_size=100_000,
    
    # Parallel execution
    parallel_execution=True,
    max_workers=4,
)

result = analyze(df, config=config)
```

#### Chunked Processing for Very Large Files

```python
from xelytics.engine import analyze_large_dataset

# Process 10M row file without loading into memory
result = analyze_large_dataset(
    source="huge_sales_data.csv",
    chunksize=50_000,
    sample_size=100_000,  # Take a sample for full analysis
    config=AnalysisConfig(),
)
```

---

## ⚙️ Configuration Reference

```python
from xelytics import AnalysisConfig

config = AnalysisConfig(
    # General
    significance_level=0.05,
    mode="automated",  # automated or semi-automated
    
    # Columns
    include_columns=None,  # [list] Include only these columns
    exclude_columns=None,  # [list] Exclude these columns
    datetime_column=None,  # [str] Column name for time series
    
    # Time Series
    enable_time_series=False,
    decomposition_method="additive",  # additive, multiplicative, stl
    forecast_periods=0,
    forecast_methods=["arima", "exponential_smoothing"],
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,
    detect_change_points=False,
    
    # Clustering
    enable_clustering=False,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=10,
    k_selection_method="elbow",
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Caching
    enable_caching=False,
    cache_backend=None,
    
    # Reporting
    max_visualizations=15,
    run_custom_pipeline=False,
    custom_pipeline=None,
    
    # LLM
    enable_llm_insights=False,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=None,
    
    # Other
    random_seed=42,
    verbose=True,
)
```

---

## 🏭 End-to-End Workflow Example

Complete analysis pipeline from data to report:

```python
#!/usr/bin/env python3
"""Complete analysis workflow."""

import pandas as pd
import os
from datetime import datetime
from xelytics import analyze, AnalysisConfig
from xelytics.export import HTMLReportGenerator, generate_pdf_report
from xelytics.connectors import connect_to_source

# 1. LOAD DATA
print("📁 Loading data...")
connector = connect_to_source(
    source_type="postgresql",
    host="db.example.com",
    database="sales",
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
)

try:
    connector.connect()
    df = connector.query("""
        SELECT 
            order_id, customer_id, order_date,
            product_category, quantity, unit_price, total_amount,
            customer_age, customer_region, is_returning_customer
        FROM orders
        WHERE order_date >= '2024-01-01'
    """)
    print(f"✓ Loaded {len(df):,} rows")
finally:
    connector.disconnect()

# 2. CONFIGURE ANALYSIS
print("\n⚙️  Configuring analysis...")
config = AnalysisConfig(
    significance_level=0.05,
    
    # Time series
    enable_time_series=True,
    datetime_column="order_date",
    forecast_periods=30,
    
    # Clustering
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=5,
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    
    # Cache for later
    enable_caching=True,
    
    # Reporting
    max_visualizations=20,
    enable_llm_insights=True,
    llm_provider="openai",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# 3. RUN ANALYSIS
print("\n🔍 Running analysis...")
result = analyze(df, config=config)

# 4. EXPLORE RESULTS
print(f"\n✓ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f"  • Tests: {result.metadata.tests_executed}")
print(f"  • Visualizations: {len(result.visualizations)}")
print(f"  • Insights: {len(result.insights)}")
print(f"  • Time Series Series: {len(result.time_series_analysis)}")
print(f"  • Clusters: {len(result.clusters)}")

print("\n📊 Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
    print(f"  {i}. {insight.title}")
    if hasattr(insight, 'narrative'):
        print(f"     {insight.narrative[:100]}...")

# 5. GENERATE REPORTS
print("\n📄 Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# HTML Report
html_generator = HTMLReportGenerator(
    theme="light",
    logo_text="Sales Analytics",
    company_name="ACME Corp"
)
html = html_generator.generate(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
    f.write(html)
print(f"  ✓ HTML: {html_path}")

# PDF Report
pdf_bytes = generate_pdf_report(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
    f.write(pdf_bytes)
print(f"  ✓ PDF:  {pdf_path}")

# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
    json.dump(result.to_dict(), f, indent=2)
print(f"  ✓ JSON: {json_path}")

print("\n✅ Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")
```

**Output:**
```
📁 Loading data...
✓ Loaded 150,432 rows

⚙️  Configuring analysis...

🔍 Running analysis...

✓ Analysis complete in 3421ms
  • Tests: 47
  • Visualizations: 18
  • Insights: 12
  • Time Series Series: 2
  • Clusters: 5

📊 Key Insights:
  1. Significant correlation detected: total_amount vs. customer_age
  2. Strong seasonality in Q4 sales
  3. Customer segmentation: 5 distinct groups identified
  4. Outliers detected in unit_price column
  5. Increasing trend in repeat customer rate

📄 Generating reports...
  ✓ HTML: reports/sales_analysis_20250307_143021.html
  ✓ PDF:  reports/sales_analysis_20250307_143021.pdf
  ✓ JSON: reports/sales_analysis_20250307_143021.json

✅ Analysis complete!
Reports saved to: /home/user/reports
```

---

## 📈 Performance & Scaling

| Dataset Size | Processing Time | Max Parallel Tasks |
|---|---|---|
| **10K rows** | 1–2 seconds | 3 |
| **100K rows** | 5–10 seconds | 4 |
| **1M rows** | 30–60 seconds | 4 |
| **10M rows** | 3–5 minutes | 4 (chunked) |
| **100M rows** | 10–30 minutes | 4 (chunked + sampled) |

**Optimization Strategies:**
- ✅ Automatic sampling for datasets > 1M rows
- ✅ Parallel execution (4 workers by default)
- ✅ Result caching (file or Redis)
- ✅ Progress callbacks for long-running analyses
- ✅ Memory-aware warnings (logs warning if > 1GB)

---

## 📊 Feature Comparison

| Feature | v0.1.0 | v0.2.0 | 
|---|:---:|:---:|
| **Statistical Analysis** | ✅ | ✅ |
| Automated test selection | ✅ | ✅ |
| Effect size calculation | ✅ | ✅ |
| Assumption checking | ✅ | ✅ |
| **Time Series (NEW)** | — | ✅ |
| Detection & decomposition | — | ✅ |
| ARIMA & ES forecasting | — | ✅ |
| Anomaly detection | — | ✅ |
| Change point detection | — | ✅ |
| **Clustering (NEW)** | — | ✅ |
| K-Means | — | ✅ |
| DBSCAN | — | ✅ |
| Hierarchical | — | ✅ |
| Cluster profiling | — | ✅ |
| **Performance (NEW)** | — | ✅ |
| Parallel execution | — | ✅ |
| Result caching | — | ✅ |
| Sampling strategies | — | ✅ |
| Chunked processing | — | ✅ |
| **Connectors (NEW)** | — | ✅ |
| PostgreSQL | — | ✅ |
| MySQL/MariaDB | — | ✅ |
| SQLite | — | ✅ |
| BigQuery | — | ✅ |
| Snowflake | — | ✅ |
| S3/Azure/GCS | — | ✅ |
| **Export (NEW)** | — | ✅ |
| HTML reports | — | ✅ |
| PDF export | — | ✅ |
| PowerPoint slides | — | ✅ |
| Jupyter notebooks | — | ✅ |
| JSON export | — | ✅ |
| **Other Features** | | |
| Data profiling | ✅ | ✅ |
| Rule-based insights | ✅ | ✅ |
| LLM narration | ✅ | ✅ |
| Custom pipelines | — | ✅ |
| Progress callbacks | — | ✅ |
| CLI interface | — | ✅ |
| Backward compatible | — | ✅ |

---

## 🔧 Installation & Setup

### System Requirements

- **Python:** 3.9, 3.10, 3.11, 3.12
- **OS:** Linux, macOS, Windows
- **RAM:** 2GB minimum; 8GB+ recommended for large datasets

### Basic Installation

```bash
# Minimal (core features only)
pip install -e .

# Development
pip install -e ".[dev]"

# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"

# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"
```

### Verify Installation

```bash
python -c "from xelytics import analyze; print('✓ Xelytics installed')"

# Check version
python -c "import xelytics; print(xelytics.__version__)"

# Test CLI
xelytics --version
```

---

## 📚 Documentation

Full documentation is available in the `docs/` folder:

| Topic | Location |
|---|---|
| **🚀 Installation** | [docs/installation.md](docs/installation.md) |
| **📖 Quick Start** | [docs/quickstart.md](docs/quickstart.md) |
| **📊 Statistical Analysis** | [docs/guides/01_basic_analysis.md](docs/guides/01_basic_analysis.md) |
| **⏱️ Time Series** | [docs/guides/02_time_series.md](docs/guides/02_time_series.md) |
| **🎯 Clustering** | [docs/guides/03_clustering.md](docs/guides/03_clustering.md) |
| **⚡ Performance** | [docs/guides/04_performance.md](docs/guides/04_performance.md) |
| **🔗 Connectors** | [docs/guides/05_connectors.md](docs/guides/05_connectors.md) |
| **📄 Export & Reports** | [docs/guides/06_export_reports.md](docs/guides/06_export_reports.md) |
| **🛠️ Custom Pipelines** | [docs/guides/07_custom_pipelines.md](docs/guides/07_custom_pipelines.md) |
| **💻 CLI Guide** | [docs/guides/08_cli.md](docs/guides/08_cli.md) |
| **🔍 API Reference** | [docs/api/](docs/api/) |
| **📋 Examples** | [examples/](examples/) |
| **📜 Migration Guide** | [docs/migration/v01_to_v02.md](docs/migration/v01_to_v02.md) |
| **📑 API Contract** | [API_CONTRACT.md](API_CONTRACT.md) |
| **📝 Comprehensive Docs** | [COMPREHENSIVE_DOCUMENTATION.md](COMPREHENSIVE_DOCUMENTATION.md) |

---

## 🛠️ Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"
```

### Running Tests

```bash
# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_clustering.py -v

# Tests matching pattern
pytest tests/ -k "test_kmeans" -v

# With coverage report
pytest tests/ --cov=xelytics --cov-report=html

# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v

# Only fast tests
pytest tests/ -m "not slow" -v
```

### Code Formatting & Linting

```bash
# Format code with Black
black xelytics/ tests/ examples/

# Check formatting
black --check xelytics/ tests/

# Lint with Ruff
ruff check xelytics/ tests/ --fix

# Type checking with mypy
mypy xelytics/
```

### Building & Publishing

```bash
# Build package
pip install build
python -m build

# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*
```

---

## 🧪 Testing & Quality Assurance

**Test Coverage:** 85%+ (307 tests)

**Test Categories:**

| Category | Count | Status |
|---|---|---|
| Unit Tests | 200+ | ✅ Passing |
| Integration Tests | 50+ | ✅ Passing |
| Performance Tests | 20+ | ✅ Passing |
| Backward Compatibility Tests | 8 | ✅ Passing (v0.1.0 code works in v0.2.0) |
| Example Scripts | 5 | ✅ Working |

**Key Test Suites:**
- ✅ `test_core.py` - Data ingestion, profiling, feature detection
- ✅ `test_clustering.py` - K-Means, DBSCAN, Hierarchical
- ✅ `test_timeseries_advanced.py` - Decomposition, forecasting, anomalies
- ✅ `test_stats.py` - Statistical tests, effect sizes, assumptions
- ✅ `test_connectors_integration.py` - Database connectivity
- ✅ `test_export.py` - HTML, PDF, PowerPoint, notebook export
- ✅ `test_caching.py` - File and Redis caching
- ✅ `test_v02_backward_compatibility.py` - v0.1.0 compatibility

**Run Full Test Suite:**

```bash
# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short

# Full run (includes slow + integration)
pytest tests/ -v --tb=short

# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing
```

---

## 🏗️ Architecture

### System Design

```
┌─────────────────────────────────┐
│    Public API Layer             │
│  analyze() / AnalysisConfig     │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Data Ingestion Layer         │
│  Connectors, DataFrames, Files  │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Processing Core              │
│  Type Detection, Sampling        │
│  Feature Detection, Profiling    │
└──────────────┬──────────────────┘
               │
       ┌───────┴─────────┬──────────────┐
       │                 │              │
   ┌───▼────┐  ┌────────▼──┐  ┌───────▼──┐
   │ Stats  │  │ TimeSeries│  │Clustering│
   │Engine  │  │ Engine    │  │ Engine   │
   └───┬────┘  └────────┬──┘  └───────┬──┘
       │                │              │
       └────────┬───────┴──────────────┘
                │
      ┌─────────▼──────────┐
      │  Visualization &   │
      │  Insight Generator │
      └─────────┬──────────┘
                │
      ┌─────────▼──────────┐
      │  Export Layer      │
      │  HTML/PDF/PPTX/etc │
      └────────────────────┘
```

### Module Breakdown

```
xelytics-core/
├── xelytics/
│   ├── __init__.py               # Public API
│   ├── engine.py                 # Main analyze() function
│   ├── exceptions.py             # Exception hierarchy
│   │
│   ├── core/                     # Data pipeline
│   │   ├── ingestion.py          # Type detection, validation
│   │   ├── profiler.py           # Column statistics
│   │   ├── features.py           # Feature detection
│   │   └── chunked.py            # Large dataset processing
│   │
│   ├── stats/                    # Statistical analysis
│   │   ├── engine.py             # Test selection & execution
│   │   ├── planner.py            # Analysis planning
│   │   └── ...
│   │
│   ├── timeseries/               # Time series (v0.2.0)
│   │   ├── detector.py           # Series detection
│   │   ├── decomposition.py      # Trend/seasonal separation
│   │   ├── forecasting.py        # ARIMA/ExpSmoothing
│   │   ├── anomaly.py            # Anomaly detection
│   │   └── change_points.py      # Change point detection
│   │
│   ├── clustering/               # Clustering (v0.2.0)
│   │   ├── kmeans.py             # K-Means
│   │   ├── dbscan.py             # DBSCAN
│   │   ├── hierarchical.py       # Hierarchical clustering
│   │   └── profiler.py           # Cluster profiling
│   │
│   ├── connectors/               # Data sources (v0.2.0)
│   │   ├── postgres.py           # PostgreSQL
│   │   ├── mysql.py              # MySQL/MariaDB
│   │   ├── database.py           # Base SQL class
│   │   ├── s3.py                 # AWS S3
│   │   ├── cloud.py              # Azure/GCS
│   │   └── ...
│   │
│   ├── export/                   # Report generation (v0.2.0)
│   │   ├── html.py               # HTML reports
│   │   ├── pdf.py                # PDF export
│   │   ├── pptx.py               # PowerPoint slides
│   │   ├── notebook.py           # Jupyter notebooks
│   │   └── ...
│   │
│   ├── cache/                    # Caching (v0.2.0)
│   │   ├── base.py               # Cache interface
│   │   ├── file.py               # File-based cache
│   │   └── redis.py              # Redis cache
│   │
│   ├── pipeline/                 # Custom pipelines (v0.2.0)
│   │   ├── __init__.py           # Pipeline class
│   │   └── steps.py              # Pre-built steps
│   │
│   ├── llm/                      # LLM integration
│   │   ├── openai.py             # OpenAI provider
│   │   ├── groq.py               # Groq provider
│   │   └── base.py               # Provider interface
│   │
│   ├── viz/                      # Visualizations
│   │   ├── generator.py          # Plotly spec generation
│   │   └── themes.py             # Color schemes
│   │
│   ├── insights/                 # Insight generation
│   │   ├── rules.py              # Rule-based insights
│   │   └── templates.py          # Insight templates
│   │
│   ├── schemas/                  # Type definitions
│   │   ├── config.py             # AnalysisConfig
│   │   └── outputs.py            # AnalysisResult & schemas
│   │
│   └── cli/                      # Command-line interface
│       └── main.py               # CLI entry point
│
├── tests/                        # 300+ tests
│   ├── test_core.py
│   ├── test_clustering.py
│   ├── test_timeseries_*.py
│   ├── test_connectors_integration.py
│   ├── test_export.py
│   └── ...
│
├── examples/                     # Example scripts
│   ├── quickstart.py
│   ├── forecasting_demo.py
│   └── ...
│
├── docs/                         # Full documentation
│   ├── guides/                   # Step-by-step guides
│   ├── api/                      # API reference
│   └── examples/                 # Example notebooks
│
└── pyproject.toml                # Dependencies & config
```

---

## 📋 API Classes & Functions

### Core Classes

```python
# Main entry point
from xelytics import analyze, AnalysisConfig, AnalysisResult

# Configuration
config = AnalysisConfig(...)

# Run analysis
result: AnalysisResult = analyze(df, config=config)

# Access results
result.summary              # DatasetSummary
result.statistics           # List[StatisticalTestResult]
result.visualizations       # List[VisualizationSpec]
result.insights             # List[Insight]
result.time_series_analysis # List[TimeSeriesResult]
result.clusters             # List[ClusterResult]
result.metadata             # RunMetadata
```

### Data Source Connectors

```python
from xelytics.connectors import connect_to_source

connector = connect_to_source(source_type="postgresql", ...)
df = connector.query("SELECT * FROM table")
```

### Export Functions

```python
from xelytics.export import (
    HTMLReportGenerator,
    generate_pdf_report,
    generate_pptx_report,
    generate_notebook,
)
```

### Caching

```python
from xelytics.cache import FileCache, RedisCache, get_cache, clear_cache

cache = get_cache("file", cache_dir="./cache")
```

### Time Series

```python
from xelytics.timeseries import (
    analyze_time_series,
    decompose_time_series,
    forecast_time_series,
    detect_anomalies,
    detect_change_points,
)
```

### Clustering

```python
from xelytics.clustering import (
    analyze_clusters,
    cluster_kmeans,
    cluster_dbscan,
    cluster_hierarchical,
    profile_clusters,
)
```

---

## 🤝 Contributing

We welcome contributions! Here's how you can help:

### Reporting Issues

1. Check [existing issues](https://github.com/xelytics/xelytics-core/issues)
2. Create new issue with:
   - Descriptive title
   - Steps to reproduce
   - Expected vs actual behavior
   - Environment info (Python version, OS, xelytics version)

### Submitting Changes

1. Fork the repository
2. Create branch: `git checkout -b feature/my-feature`
3. Make changes and add tests
4. Format code: `black xelytics/ tests/`
5. Run tests: `pytest tests/`
6. Commit: `git commit -am 'Add my feature'`
7. Push: `git push origin feature/my-feature`
8. Create Pull Request

### Code Standards

- **Style:** Black formatting, 100-char line length
- **Types:** Type hints for all functions
- **Tests:** Each feature needs tests (85%+ coverage target)
- **Docs:** Docstrings for all public functions

---

## 📄 Changelog

### v0.2.0-alpha.1 (February 2026) — Current

**Phases Completed:**
- ✅ **Phase 1:** Foundation & backward compatibility
- ✅ **Phase 2:** Time series analysis
- ✅ **Phase 3:** Clustering & profiling

**Key Features Added:**
- Time series: detection, decomposition, forecasting, anomalies, change points
- Clustering: K-Means, DBSCAN, Hierarchical, profiling
- Connectors: PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, S3, Azure, GCS
- Export: HTML, PDF, PowerPoint, Jupyter notebooks
- Performance: Parallel execution, caching, sampling, chunked processing
- CLI and custom pipelines

**v0.1.0 → v0.2.0 Compatibility:**
✅ 100% backward compatible — All v0.1.0 code works unchanged

See [CHANGELOG.md](CHANGELOG.md) for full history and [API_CONTRACT.md](API_CONTRACT.md) for versioning policy.

---

## 🎓 Learning Resources

- **API Documentation:** See [docs/api/](docs/api/)
- **Quick Start:** [docs/quickstart.md](docs/quickstart.md)
- **Example Scripts:** [examples/](examples/)
- **GitHub Discussions:** Ask questions in [GitHub Discussions](https://github.com/xelytics/xelytics-core/discussions)
- **Issues:** Report bugs in [GitHub Issues](https://github.com/xelytics/xelytics-core/issues)

---

## 📞 Support

| Channel | Purpose |
|---|---|
| 📖 **Documentation** | How-to guides, API reference, examples |
| 💬 **GitHub Discussions** | Q&A, feature ideas, best practices |
| 🐛 **GitHub Issues** | Bug reports, feature requests |
| 📧 **Email** | contact@xelytics.io |

---

## 📜 License

MIT License — see [LICENSE](LICENSE) for details.

```
Copyright (c) 2026 Xelytics Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
```

---

## 🙏 Acknowledgments

Built with ❤️ using:
- [pandas](https://pandas.pydata.org/) — Data manipulation
- [scikit-learn](https://scikit-learn.org/) — Machine learning
- [statsmodels](https://www.statsmodels.org/) — Statistical modeling
- [plotly](https://plotly.com/) — Interactive visualizations
- [pingouin](https://pingouin-stats.org/) — Statistical functions

---

## 📊 Project Status

| Component | v0.1.0 | v0.2.0 | Status |
|---|---|---|---|
| Core Analytics | Beta | Beta | ✅ Stable |
| Time Series | — | Beta | ✅ Working |
| Clustering | — | Beta | ✅ Working |
| Connectors | — | Beta | ✅ Working |
| Export | — | Beta | ✅ Working |
| CLI | — | Beta | ✅ Working |

**Next Milestones:**
- v0.2.1: Bug fixes, performance improvements
- v0.3.0: Advanced forecasting (Prophet), deep learning integration
- v1.0.0: API stabilization, user feedback incorporation

---

## ⭐ Star this repository if you find it useful!

Questions? [Open an issue](https://github.com/xelytics/xelytics-core/issues) or [start a discussion](https://github.com/xelytics/xelytics-core/discussions).
