Metadata-Version: 2.4
Name: rowbase
Version: 0.2.0
Summary: Rowbase SDK — declare data pipelines as Python functions
Author-email: Rowbase Team <team@rowbase.com>
License-Expression: LicenseRef-Proprietary
Keywords: data,etl,pipelines,polars,sdk
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: fastexcel>=0.19
Requires-Dist: httpx>=0.28
Requires-Dist: polars>=1.38
Requires-Dist: pyarrow>=23.0
Requires-Dist: pydantic>=2.12
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=14.0
Requires-Dist: typer>=0.24
Requires-Dist: xlsxwriter>=3.2
Provides-Extra: dev
Requires-Dist: mypy>=1.19; extra == 'dev'
Requires-Dist: pytest>=9.0; extra == 'dev'
Requires-Dist: ruff>=0.15; extra == 'dev'
Description-Content-Type: text/markdown

# Rowbase SDK

**Status:** Alpha  
**Python:** 3.12+  
**Core Dependencies:** Polars, Pydantic, Typer

---

## Overview

Rowbase SDK is a Python library for **Rowbase engineers** to author data pipelines. These pipelines are then deployed and made available to non-technical customers who simply upload their data and receive cleaned results.

**Key positioning:**
- **Agentic authoring** — Rowbase engineers write pipelines (often with AI assistance)
- **Invisible to users** — Customers never see the code; they just upload files and get results
- **B2B data cleaning** — Customers are non-technical business users who need clean data

## Who Uses This

| User | Use Case |
|------|----------|
| Rowbase Engineers | Write pipelines using the SDK, deploy via CLI |
| End Customers | Upload data via web UI, download cleaned results |

The customer never interacts with the SDK directly.

## Installation

```bash
pip install rowbase
# or: uv add rowbase
```

## Quick Start

```python
from rowbase import pipeline, source, dataset

@pipeline
def orders_etl():
    # Declare input sources
    orders = source("orders", columns=["order_id", "customer", "amount", "status"])
    customers = source("customers", columns=["customer_id", "name", "email"])
    
    # Transform: filter to completed orders
    @dataset(name="completed_orders", data_from=orders)
    def filter_completed(df):
        return df.filter(pl.col("status") == "completed")
    
    # Transform: enrich with customer info
    @dataset(name="enriched_orders", data_from=[filter_completed, customers])
    def enrich(orders_df, customers_df):
        return orders_df.join(
            customers_df, 
            left_on="customer", 
            right_on="customer_id"
        )
    
    # Publish these datasets
    yield completed_orders
    yield enriched_orders
```

## Authoring Workflow

1. **Understand customer data** — What does their raw data look like?
2. **Define sources** — What columns/format do they upload?
3. **Write transformations** — Use Polars to clean/transform
4. **Test locally** — Use `rowbase run` to test
5. **Deploy** — Use `rowbase push` to deploy to the platform
6. **Customer uses** — Customer uploads files via web UI

## Core Concepts

### @pipeline

Decorator that marks a function as a pipeline. The function should be a generator that yields published datasets.

```python
@pipeline
def my_pipeline():
    # ... sources and datasets ...
    yield published_dataset
```

### @source

Declares an input data source. Returns a `SourceHandle` used as input to datasets.

```python
orders = source(
    name="orders",
    columns=["order_id", "customer", "amount"],
    description="Raw orders from Shopify",
    reader_options={"separator": ",", "has_header": True},
    optional=False
)
```

Supported reader options:
- `sheet_name` - Excel sheet name or index
- `skip_rows` - Rows to skip before header
- `has_header` - Whether file has header row
- `separator` - CSV separator

### @dataset

Declares a transformation function. The decorated function receives DataFrames and returns a DataFrame.

```python
@dataset(
    name="cleaned_data",
    data_from=source_handle,
    schema=MyPydanticModel,
    on_schema_error="fail",
    description="Cleaned and validated data",
    metadata=True
)
def transform(df):
    return df.filter(pl.col("amount") > 0)
```

## CLI Commands

```bash
rowbase --help

# Initialize a new pipeline project
rowbase init my-pipeline

# Run locally for testing
rowbase run

# Deploy pipeline to Rowbase
rowbase push

# List datasets
rowbase data list
```

## Deployment

To deploy a pipeline:

```bash
rowbase push --api-key rb_xxx
```

This uploads your pipeline code to the Rowbase API, creating a new version that customers can use.

## Customer Experience

Once deployed, customers see the pipeline in their dashboard:

1. **Select pipeline** — Choose "Orders ETL" from the list
2. **Upload files** — Upload `orders.csv` and `customers.csv`
3. **Submit** — Click "Run"
4. **Wait** — See progress (pending → running → completed)
5. **Download** — Get enriched_orders.csv with clean data

## Example: Data Cleaning Pipeline

```python
@pipeline
def clean_customer_data():
    customers = source("customers")
    
    @dataset(name="valid_emails", data_from=customers)
    def filter_valid(df):
        # Remove rows with invalid emails
        return df.filter(pl.col("email").str.contains("@"))
    
    @dataset(name="normalized_phones", data_from=valid_emails)
    def normalize_phones(df):
        # Normalize phone numbers
        return df.with_columns(
            pl.col("phone").str.replace_all(r"[^0-9]", "").alias("phone")
        )
    
    yield valid_emails
    yield normalized_phones
```

## Dependencies

- **polars** - DataFrame operations
- **pydantic** - Schema validation
- **typer** - CLI framework
- **pyarrow** - Parquet support
- **xlsxwriter** - Excel support
- **fastexcel** - Fast Excel parsing

## License

Proprietary - All rights reserved
