Metadata-Version: 2.3
Name: polymo
Version: 0.8.4
Summary: Declarative REST API ingestion for PySpark
Keywords: spark,pyspark,rest,api,ingestion,data-engineering,etl,http
Author: Daniel Tom
Author-email: Daniel Tom <d.e.tom89@gmail.com>
License: BSD-3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Database
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Typing :: Typed
Requires-Dist: fsspec>=2025.9.0
Requires-Dist: httpx>=0.26
Requires-Dist: jinja2>=3.1.6
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: pyspark>=4 ; extra == 'benchmark'
Requires-Dist: pyarrow>=13 ; extra == 'benchmark'
Requires-Dist: httpx>=0.26 ; extra == 'benchmark'
Requires-Dist: notebook>=7.4.7 ; extra == 'benchmark'
Requires-Dist: fastapi>=0.110 ; extra == 'benchmark'
Requires-Dist: uvicorn>=0.24 ; extra == 'benchmark'
Requires-Dist: fastapi>=0.110 ; extra == 'builder'
Requires-Dist: uvicorn>=0.24 ; extra == 'builder'
Requires-Dist: jinja2>=3.1.6 ; extra == 'builder'
Requires-Dist: pyspark>=4 ; extra == 'builder'
Requires-Dist: pyarrow>=13 ; extra == 'builder'
Requires-Dist: pyspark>=4 ; extra == 'smoke'
Requires-Dist: pyarrow>=13 ; extra == 'smoke'
Requires-Python: >=3.10
Project-URL: Documentation, https://dan1elt0m.github.io/polymo/
Project-URL: Homepage, https://github.com/dan1elt0m/polymo
Project-URL: Issues, https://github.com/dan1elt0m/polymo/issues
Project-URL: Repository, https://github.com/dan1elt0m/polymo.git
Provides-Extra: benchmark
Provides-Extra: builder
Provides-Extra: smoke
Description-Content-Type: text/markdown

<p align="center">
  <img src="builder-ui/public/logo.png" alt="Polymo" width="220">
</p>

<p align="center">
    <em>Turn REST APIs into Spark DataFrames with just a YAML file
</em>
</p>

<p align="center">
  <a href="https://github.com/dan1elt0m/polymo/actions/workflows/test.yml"><img alt="test" src="https://github.com/dan1elt0m/polymo/actions/workflows/test.yml/badge.svg"></a>
  <a href="https://github.com/dan1elt0m/polymo/actions/workflows/gh-pages.yml"><img alt="Publish Docs" src="https://github.com/dan1elt0m/polymo/actions/workflows/gh-pages.yml/badge.svg"></a>
  <img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/polymo">
  <img alt="PyPI - Status" src="https://img.shields.io/pypi/status/polymo">
</p>

# Welcome to Polymo

Polymo makes it super easy to ingest APIs with Pyspark. You only need to define a YAML file.

My vision is that API ingestion doesn't need heavy, third party tools or hard to maintain custom code.
The heck, you don't even need Pyspark skills.


<!-- Centered clickable screenshot -->
<p align="center">
  <a href="docs/ui.png">
    <img src="docs/ui.png" alt="Polymo Builder UI - connector preview screen" width="860">
  </a>
</p>

## How does it work?

Define a config file manually or use the recommended, lightweight builder UI. 
Once you are happy with your config, all you need to do is register the Polymo reader and tell Spark where to find the config:

```python
from pyspark.sql import SparkSession
from polymo import ApiReader

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)

df = (
    spark.read.format("polymo")
    .option("config_path", "./config.yml")  # YAML you saved from the Builder
    .option("token", "YOUR_TOKEN")  # Only if the API needs one
    .load()
)

df.show()
```

Structured Streaming works out of the box aswell:

```python
stream_df = (
    spark.readStream.format("polymo")
    .option("config_path", "./config.yml")
    .option("stream_batch_size", 100)
    .option("stream_progress_path", "/tmp/polymo-progress.json")
    .load()
)

query = stream_df.writeStream.format("memory").outputMode("append").queryName("polymo")
query.show()

```

Does it perform? Polymo can read in batches (pages in parallel) and therefore is much faster than row based solutions like UDFs.


It's still early days, but Polymo already supports a lot of features!

- Various Authentication options
- Many Pagination  patterns, plus automatic partition-aware reading when totals are exposed.
- Several partitioning stategies for parallel Spark reads.
- Incremental sync support with cursor parameters, JSON state files on local or remote storage, optional memory caching, and overrideable state keys.
- Schema controls that auto-infer types or accept Spark SQL schemas, along with record selectors, filtering expressions, and schema-based casting for nested responses.
- Structured Streaming compatibility with `spark.readStream`, tunable batch sizing, durable progress tracking, and a streaming smoke test mode.
- Error handling through configurable retry counts, status code lists, timeout handling, and exponential backoff settings.
- Jinja templating of query parameters gives you a ton of flexibility


## How to start?
Locally you probably want to install polymo with the UI: 

```bash
pip install "polymo[builder]"
```

This comes with UI deps such as pyspark

Running Polymo on an existing cluster in for instance databricks doesnt require these deps.
In that case, just install the bare minimum depa with
```bash
pip install polymo
```

## Launch the builder UI 

```bash 
polymo builder
```

#### (Optional) Run the Builder in Docker

```bash
docker compose up --build builder
```

- The service listens on port `8000`; open <http://localhost:8000> once Uvicorn reports it is running.

## Where to Next
Read the docs [here](https://dan1elt0m.github.io/polymo/)

Contributions and early feedback welcome!
