Metadata-Version: 2.3
Name: pyquery-polars
Version: 2.5.0
Summary: Enterprise-grade Headless ETL Engine with Interactive UI
Keywords: PyQuery,Polars,ETL,Big Data,Excel,Power BI,Automation,Analytics,Audit
Author: Shan
Author-email: Shan <tksudharshan@gmail.com>
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Office/Business :: Financial :: Spreadsheet
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS
Classifier: Natural Language :: English
Requires-Dist: polars>=1.0.0
Requires-Dist: streamlit>=1.30.0
Requires-Dist: fastapi>=0.109.0
Requires-Dist: uvicorn>=0.25.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: rich>=13.7.0
Requires-Dist: questionary>=2.0.0
Requires-Dist: xlsxwriter>=3.1.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: connectorx>=0.3.3
Requires-Dist: fastexcel>=0.16.0
Requires-Dist: python-multipart>=0.0.20
Requires-Dist: matplotlib>=3.9.4
Requires-Dist: seaborn>=0.13.2
Requires-Dist: plotly>=6.5.0
Requires-Dist: statsmodels>=0.14.6
Requires-Dist: scikit-learn>=1.6.1
Requires-Dist: scipy>=1.13.1
Requires-Dist: chardet>=5.2.0
Requires-Python: >=3.9
Project-URL: Changelog, https://github.com/tks18/pyquery/releases
Project-URL: Documentation, https://github.com/tks18/pyquery#readme
Project-URL: Homepage, https://github.com/tks18/pyquery
Project-URL: Issues, https://github.com/tks18/pyquery/issues
Project-URL: Repository, https://github.com/tks18/pyquery
Description-Content-Type: text/markdown

<div align="center">

# ⚡ PyQuery: The Main Character of Data Stacks 💫

### _ETL. EDA. ML. SQL. IDE._

[![Execution](https://img.shields.io/badge/Execution-Lazy_Execution_Enabled-6A0DAD?style=for-the-badge)](#)
[![Mode](https://img.shields.io/badge/Mode-Power_User_Ready-8B0000?style=for-the-badge)](#)
[![Privacy](https://img.shields.io/badge/Privacy-Local_First_No_Cloud-2E8B57?style=for-the-badge)](#)
[![Design](https://img.shields.io/badge/Design-Premium_UX-%238A2BE2?style=for-the-badge)](#)
[![Stack](https://img.shields.io/badge/Stack-Full_IDE_Included_💻-007ACC?style=for-the-badge)](#)

[![PyPI Version](https://img.shields.io/pypi/v/pyquery-polars.svg?color=4CAF50&logo=python&logoColor=white)](https://pypi.org/project/pyquery-polars/)
[![Python Versions](https://img.shields.io/pypi/pyversions/pyquery-polars.svg?color=blue)](https://pypi.org/project/pyquery-polars/)
[![License](https://img.shields.io/github/license/tks18/pyquery.svg?color=orange)](LICENSE)

![Rows of Data](https://i.giphy.com/sRFEa8lbeC7zbcIZZR.webp)

**Stop letting Pandas hold you back. The single-threaded era is over.** 🚩

**PyQuery** is a local-first data operating system that **auto-heals broken CSVs**, includes a **native Code Editor**, and processes **100GB+ files** without breaking a sweat. ⚡

[Feature Request](https://github.com/tks18/pyquery/issues) · [Report Bug](https://github.com/tks18/pyquery/issues)

</div>

---

## 🧠 TL;DR (For the goldfish attention spans)

> **New in recent drops:**
> PyQuery now boots like an operating system, adapts its CLI theme based on your mood, fixes broken CSV encodings automatically, and can isolate bad files so one cursed spreadsheet doesn’t take your entire pipeline hostage.

1. **Install it:** `pip install pyquery-polars` (Don't be basic).
2. **Run it:** `pyquery ui` (Visuals) or `pyquery run` (Speedrun).
3. **The Flex:** It's a local-first, privacy-focused engine that eats Excel sheets and CSVs for breakfast using Rust. It has built-in ML, Monte Carlo sims, and SQL. It's basically a Data Scientist in a box.

---

## ⛩️ The Awakening (Lore)

Long ago, the Data World was mid. Analysts lived in fear of the `MemoryError`. They bowed before the single-threaded tyranny of the Old Gods (Pandas). They accepted their fate of freezing screens, crashing kernels, and waiting 4 hours for a simple `groupby`.

**But I refused.**

From the depths of the Rusty abyss, **PyQuery** has awakened. I am not just an ETL tool anymore. I am the entire war room. I am here to **obliterate** your bottlenecks and **ratio** your old benchmarks.

While they study the blade (Excel), I studied the **Lazy Frame**.
While they manage memory, I **devour** it.
While they draw primitive charts, I **simulate the future**.

The age of waiting is over. **Total Domination** is the only metric that matters.

**Welcome to your Villain Arc.** 👹

---

## 🖥️ The Main Character CLI (Boot Sequence Update)

This is not a command line.  
This is a **startup ritual**.

Every time PyQuery boots, it behaves like a data OS coming online.

### ⚡ Adaptive Theme Engine

- CLI dynamically switches **color gradients, borders, and mood** based on the selected boot mode:
  - **Cyberpunk** (default neon main-character energy)
  - **Rustacean** (pure Polars lore)
  - **Matrix** (hacker-core, green text supremacy)
  - **Villain Arc** (purple & gold, no mercy)
- Each theme announces itself during startup. You _feel_ it before you run anything.

### 📟 Sequential Boot Logs

- Real-time kernel-style logs with cinematic pacing:
  - Timestamped steps
  - Module icons (`⚡ Engine`, `💾 IO`, `🧠 Planner`)
- It doesn’t say “loading”…  
  It **declares intent**.

Your terminal doesn’t just start PyQuery.  
It **witnesses it**.

### 🧩 Focused UI, Zero Distractions (Modal Upgrade)

Sidebars are for tourists.  
PyQuery now loads data through **dedicated modal dialogs** — because loading data is a moment, not a side quest.

- **Dialog-Based Loading**: File, SQL, and API loaders now open in focused modals.
- **Blazing-Fast & Optimistic**:
  - The dialog opens **instantly**. No waiting for file scans.
  - **Lazy Preview**: We scan 100k+ files without freezing the UI. Preview only when you ask.
  - **Recent Paths**: Caches your last successful loads. We remember so you don't have to.
- **Dynamic Layouts**: The UI adapts intelligently:
  - No sheet selector unless it’s Excel.
  - No base-folder input unless you’re using pattern mode.
- **Preview Before Commit**:
  - See **matched files** before loading.
  - Preview **Excel sheets** before importing.

You don’t guess anymore.  
You **confirm with intent**.

---

## 🧾 PyQuery vs. Power Query: The Roast

We don't usually punch down, but you asked for it.

| Feature             | ⚡ **PyQuery** (The Chad)                                      | 🐢 **Power Query** (The Virgin)                                             |
| ------------------- | -------------------------------------------------------------- | --------------------------------------------------------------------------- |
| **Speed**           | **Rust-Powered.** Processes millions of rows before you blink. | **Single-Threaded.** Spends 20 mins saying "Loading Data..." just to crash. |
| **Language**        | **Python/SQL/Polars.** The languages of gods.                  | **M-Code.** A language invented to punish humanity.                         |
| **AI/ML**           | **Built-in.** Random Forests, Clustering, & Monte Carlo Sims.  | **Non-existent.** You need a generic "AI Plugin" that costs extra.          |
| **Vibe**            | **Dark Mode CLI & Streamlit.** Cyberpunk aesthetic.            | **Corporate Grey.** It sucks the soul out of your body.                     |
| **Price**           | **Free & Open Source.**                                        | **Requires an Office 365 License** (Subscription L).                        |
| **Boot Experience** | **Cinematic CLI with Themes & Logs**                           | **Static Spinner of Doom**                                                  |
| **Broken CSVs**     | **Auto-healed at ingest**                                      | **Crashes silently**                                                        |
| **One Bad File**    | **Isolated & corrected**                                       | **Pipeline dead**                                                           |

---

## 💪 The Flex (Why We Are Him)

We built an empire so you can rule yours. This isn't just software; it's a lifestyle.

### 🎯 EDA: The Crystal Ball (Expanded)

> _"Most tools describe the past. PyQuery predicts the future."_

We dropped a massive update. EDA is no longer just "looking at data". It's **hunting**. Check the **EDA Field Manual** for the full grimoire, but here's the tea:

#### 1. 🧬 Dataset DNA & Health Check

We scan your data's soul.

- **Missing Cells**: We don't just count nulls; we judge them. (<1% is excellence, >10% is sloppy).
- **Cardinality Checks**: Instantly know if a column is categorical or continuous.
- **Duplicate Detection**: We find the clones and eliminate them.

#### 2. 🚀 The Action Engine (ML Strategist)

- **Strategic Brief**: A "Top 3 Insights" card that ranks every signal in your data. It whispers: _"The money is here."_
- **Automated Drivers**: It finds the hidden variables controlling your target.
- _"Why is Churn high? It's not Price. It's Customer Support Wait Time > 5m."_ -> **Boom. Solved.**

- **Correlation Matrix**: We calculate Pearson, Cramer’s V, and F-Tests automatically. We know the relationships better than you know your own situationship.

#### 3. 🧪 ML Laboratory (The Brain)

- **Auto-Pilot Mode**: Trains an army of models (Random Forest, Lasso, Ridge) to find the best fit. You sit back and look busy.
- **Clustering (Unsupervised Rizz)**:
- **Elbow Plots & Silhouette Scores**: We optimize so you don't have to guess.
- **Cluster DNA**: We name the segments for you. "Cluster 1 = High Spend, Low Age."

- **Explainable Anomalies**: Uses Isolation Forests to catch the weirdos and fraudsters essentially instantly. We provide a **Contextual Profiler** to tell you _why_ they are weird.

#### 4. 🎮 Decision Simulator (The Time Machine)

- **"What-If" Sliders**: Change variables in real-time. "If I raise Price by 10% and lower ad spend, do I still profit?"
- **Monte Carlo Sims**: Run 1,000+ simulations with Normal/Uniform distributions. We don't guess; we calculate the probability of your success.
- **Waterfall Analysis**: The Model breaks down exactly _why_ the prediction changed.

#### 5. 📈 Time Series & Visuals That Don't Miss

- **Holt-Winters Forecasting**: We predict the future with confidence intervals.
- **Decomposition**: We split your data into Trend, Seasonality, and Noise.
- **Cohort Comparison (Volcano Plots)**: Visualizing "Effect Size" vs "Significance." We bring the science.

#### 6. 💻 The Integrated IDE (Code is Power)

For those who speak the language of the gods (Python/SQL), we built a **React-based Code Editor** right inside the UI. _(Adapted from [streamlit-code-editor](https://github.com/bouzidanas/streamlit-code-editor))_

- **Embedded Ace Editor**: Syntax highlighting, line numbers, and active line focus. It feels like VS Code, but it lives in your browser.
- **Intelligent Auto-Completions**:
  - **Context-Aware**: It knows your data. It suggests `pl`, `np`, `math`.
  - **Polars Snippets**: Type `col` -> get `col("name")`. Type `when` -> get `when().then().otherwise()`.
  - **SQL Knowledge**: It suggests keywords and functions while you type query logic.
- **Sandboxed Custom Scripts**:
  - **AST-Validated Security**: We parse your code *before* execution.
  - **Blocked**: `import os`, private attributes, system calls.
  - **Allowed**: `numpy`, `scipy`, `sklearn`. Pure math and logic only.
  - **The Contract**: Define `pyquery_transform(lf)` and return a LazyFrame. We handle the rest.

---

### 🧪 SQL Lab: The Codex (God Mode)

For when the GUI is too easy and you want to flex your raw SQL skills. This isn't SQLite. This is **High-Performance Lazy SQL**.

- **Zero-Lag Querying**: Run `SELECT *` on a **50GB file**? It pulls a preview instantly. The engine effectively cheats physics.
- **Cross-Dataset Joins**: Join your `sales.csv` with `targets.xlsx` using standard SQL. We bridge the gap.
- **Materialize**: Execute a complex query, then save it as a new dataset to continue the torture.
- **Direct Export**: SQL results straight to Parquet/Excel.

---

### 🧹 The Forge (Ruthless ETL)

🧠 Backend I/O That Actually Understands Real-World Data, Real data is cursed. We planned for that.

- **🧬 Advanced Auto-Encoding Detection & Healer**: PyQuery scans the first bytes of every CSV to automatically fix `UnicodeDecodeError`.

  - **Stream-Based Healing**: Processes multi-GB files in 4MB chunks. Memory usage stays flat.
  - **Sanitization**: Automatically strips `Null Bytes` (`\x00`), normalizes newlines, and replaces garbage bytes.
  - Supports UTF-8, CP1252, Latin-1, and limits the crash radius of bad vendor data.

- **🧩 Mixed-Encoding Folder Handling**: If a folder contains files with _different encodings_, PyQuery detects it and switches strategy automatically.

  - One bad vendor file will not ruin your entire ingest.
  - We isolate. We adapt. We continue.

- **📂 Recursive Folder Globbing (Upgraded)**: Patterns like `data/**/*.csv` now work even when:

  - Schemas differ slightly
  - Encodings are inconsistent
  - Headers are misaligned

- **🔍 Advanced File Filtering (Precision Strikes)**: We didn’t stop at globbing. We built a **filtering engine**.

  - **Multiple Filter Types**:
    - `Glob`, `Regex`, `Contains`, `Not Contains`, `Exact`, `Is Not`
  - **Scoped Filtering**:
    - Apply rules to just the **filename** or the **full path**
  - **Stackable Logic**:
    - Example:
      - Must contain `sales`
      - Must NOT contain `backup`
      - Must match regex `\d{4}`

  This is surgical file selection.  
  No more loading junk and cleaning later.

- **📊 Excel Handling That Respects Your Sanity**: Excel is chaotic. PyQuery now meets it on its own terms.

  - **Multi-Sheet Selection**:
    - Load one sheet, many sheets, or only the ones that matter.
  - **Template-Based Sheet Mapping**:
    - When loading `*.xlsx` via glob:
      - Pick a **base file**
      - Preview its sheets
      - Apply that selection across all matching files
  - **Sheet Name Filtering**:
    - Regex-powered selection like:
      - `Sheet.*`
      - `Q[1-4]_Data`

  Excel stops being unpredictable.  
  It becomes… tolerable.

- **✨ Source Awareness & Cleanliness (The Receipts)**: When you merge files, PyQuery remembers where they came from.

  - **Source Metadata Injection (Optional)**:
    - Automatically add:
      - `__source_path__`
      - `__source_name__`
    - Every row carries its origin story.
  - **Auto Type Inference (Magic Wand Restored)**:
    - Samples your data on load
    - Infers correct dtypes
    - Instantly appends a **Clean & Cast** step to your recipe

  Lineage. Cleanliness. Control.  
  No more “where did this row come from?”

- **✨ Auto-Typecast (The Magic Button)**: One click scans rows and forcibly converts `Strings` to `Int`, `Float`, or `Date`. It uses regex heuristics to crush inconsistency.

- **🎭 PII Incinerator**: Detects and obfuscates credit cards and social security numbers. Secrets remain secret.

- **🩹 Smart Impute**: We fill the voids. Forward fill, backward fill, median, or specific value injection. No null survives.

- **💥 Explode & Coalesce**: Flatten lists and merge columns like a boss.

---

## 🗺️ The Roadmap (Manifesting Destiny) 🔮

We aren't stopping here. We are aiming for the moon. 🚀

### Phase 1: Native App Supremacy (Rust + Tauri) 🦀

The browser has limits (RAM, Sandbox). The Native App will have **none**.

- **Hardware Acceleration**: GPU-accelerated plotting. 10 Million points at 144Hz.
- **Zero-Copy**: Access 100GB files directly from disk without loading _anything_.
- **Vibe**: Dark mode by default. OLED black.

### Phase 2: Big Data Devourer ☁️

- **Cloud Connectors**: S3, GCS, Azure. We drink their milkshakes.
- **Distributed Compute**: If one core isn't enough, we take them all.

---

## 🧾 The Receipts (Benchmarks)

We don't post without proof. We mog the competition.

| Metric            | 🐼 Pandas (Legacy)       | ⚡ PyQuery (Polars)     | The Diff       |
| ----------------- | ------------------------ | ----------------------- | -------------- |
| **Load 10GB CSV** | `MemoryError` (Crash) 💥 | **0.2s** (Lazy Scan) ⚡ | **Infinite**   |
| **Filter Rows**   | 15.4s (Slow)             | **0.5s** (Parallel)     | **30x Faster** |
| **Group By**      | 45s (Painful)            | **2.1s** (Instant)      | **20x Faster** |
| **RAM Usage**     | 12GB+ (Bloated)          | **500MB** (Lean)        | **95% Less**   |

_Benchmarks run on a standard dev laptop. Results may vary but the vibe remains consistent._

---

## 🧠 The Tech Stack (Forbidden Knowledge) 🐐

This isn't just a library. It's a weapon system.

### 1. 🌊 The "Infinite Stream" Glitch (Lazy Execution)

The Old Gods (Pandas) are **Eager**. They try to swallow the ocean (RAM) whole. They choke.
**PyQuery is Lazy.** It waits. It plans.

- **Scan**: "It's a 100GB file. Interesting."
- **Plan**: Filters, joins, math. Nothing executes until the final blow.
- **Stream**: Data flows in chunks. Process. Write. Destroy.
- **Result**: Processing 100GB on a MacBook Air. The laws of physics are optional.

### 2. ⚙️ File-Level Execution Control (The Missing Piece)

Most engines think in **datasets**.  
PyQuery can think in **files**.

#### 🧨 Individual File Processing Mode

- New execution branch: `process_individual=True`
- Forces the engine to **load files one-by-one instead of bulk scanning**

#### Why this matters (a lot):

- One corrupted CSV no longer nukes the entire pipeline.
- We get a **pre-concatenation window** where PyQuery can:
  - Standardize headers
  - Fix schemas
  - Apply targeted cleaning rules
- Bad files are fixed or isolated before they touch the rest.

This is how PyQuery survives **enterprise-grade mess**.

Bulk scan when you can.  
Precision strike when you must.

### 3. 🚀 Streaming I/O Architecture (Enterprise Mode)

We rewired the backend for scale — real scale.

- **True Streaming File Discovery**:
  - File loading now uses generators and lazy iteration.
  - You can point PyQuery at **hundreds of thousands of files** without a crash.
- **Partial Globbing Optimization**:
  - Simple text filters are automatically converted into filesystem-level globs.
  - Python never even _sees_ irrelevant files.

The result:

- Faster discovery
- Lower memory pressure
- Enterprise-grade robustness

This engine does not flinch.

### 4. 🛡️ Type Safety (Absolute Order)

Python is dynamic (chaotic). PyQuery imposes **Order**.

- Every step is backed by a **Pydantic Model**.
- If a `String` tries to infiltrate a `Float` column, it is terminated **before** execution.
- There are no runtime surprises. Only calculated victories.

---

## 🎮 Choose Your Fighter (5 Paths to Power)

We don't limit you. Dominate however you choose.

### 📦 Installation

```bash
pip install pyquery-polars

```

### 1. 🌊 The GUI (God Mode)

For when you want to click things, see pretty charts, and feel like a data scientist in a sci-fi movie.

- **Visual Recipe Builder**: Nodes and edges of pure logic.
- **Native File Picker**: Accessing the local filesystem directly. No barriers.

```bash
pyquery ui
# Launches the Web App on localhost:8501 🚀

```

### 2. 💻 The Interactive CLI (Shadow Mode)

For when you operate in the dark. ☕ This isn't a command line. It's a cockpit.

- **Dynamic Menus**: Use arrow keys to select transforms.
- **Rich Tables**: Beautiful, colorful ASCII dataframes.

```bash
pyquery interactive
# Enter the Matrix. 🕶️

```

### 3. 🤖 The API (Headless Beast)

Building a machine? Run PyQuery as the engine.

- **Swagger Docs**: Auto-generated at `/docs`.
- **Async**: Fire and forget jobs via `POST /recipes/run`.

```bash
pyquery api
# Serving high-performance ETL over HTTP at localhost:8000 📡

```

### 4. ⚡ The Batch Runner (Speedrun)

For automation. No interface. Just speed.

```bash
pyquery run -s input.csv -r recipe.json -o output.parquet
# Task complete. ⚡

```

### 5. 🧙‍♂️ The Sorcerer (Python SDK)

For the developers who want to weave PyQuery into their own code.

```python
from pyquery_polars.backend.engine import PyQueryEngine
# Full programmatic control over the recipe engine.
# You are the architect now.

```

> **Pro Tip:**  
> The CLI now supports multiple boot personalities (Boot Modes).  
> If you see purple and gold, you’re in **Villain Arc mode** 👹.

---

## 🧰 The Loadout (Arsenal)

Packed with every tool needed to clear the map.

| Category      | The Tools                                            | Why it slaps                            |
| ------------- | ---------------------------------------------------- | --------------------------------------- |
| **Cleaning**  | `Fill Nulls`, `Mask PII`, `Smart Extract`, `Regex`   | Turns garbage data into gold. ✨        |
| **Analytics** | `Rolling Agg`, `Time Bin`, `Rank`, `Diff`, `Z-Score` | High-frequency trading vibes. 📈        |
| **Combining** | `Smart Join`, `Concat`, `Pivot`, `Unpivot`           | Merge datasets without the headache. 🤝 |
| **Math**      | `Log`, `Exp`, `Clip`, `Date Offset`                  | For the scientific girlies. 👩‍🔬          |
| **Text**      | `Slice`, `Case`, `Replace`, `One-Hot`                | String manipulation on steroids. 💪     |
| **I/O**       | `CSV`, `Parquet`, `Excel`, `JSON`, `IPC`             | Speaks every language. 🗣️               |

> PyQuery doesn’t just scale to enterprise data.  
> It assumes enterprise chaos — and plans accordingly.

---

## 🧑‍💻 Join the Cult (Developer Guide)

You want to contribute? Good. We need strong allies.

### The Blooding (Adding a Transform) 🖐️

#### Backend Implementation

1. **Define Params**: Create a Pydantic model (`src/pyquery_polars/core/params.py`).
2. **Backend Logic**: Write a pure polars function (`src/pyquery_polars/backend/transforms/`).
3. **Register**: Add your step to `register_all_steps()` in `src/pyquery_polars/backend/engine/registry.py`.

#### Frontend Implementation

1. **Frontend Renderer**: Create a Renderer Function (`src/pyquery_polars/frontend/steps/`).
2. **Register**: Add your step to `register_frontend()` in `src/pyquery_polars/frontend/registry_init.py`.

It appears in the CLI, API, and UI **automatically**. 🤯

```python
# Only certified ballers contribute code.
# Are you up for it?

```

---

## 📜 License

**GPL-3.0**. Open source forever. 💖

---

<div align="center">

_Made with ☕, 🦀 (Rust), and 💖 by [Sudharshan TK](https://github.com/tks18)_

</div>
