Metadata-Version: 2.1
Name: spark-board
Version: 1.0.0
Summary: Interactive visualization of Spark jobs
Home-page: https://github.com/alijdens/spark-board
Author: Axel Lijdens, Ezequiel Werner
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# `spark-board`: interactive PySpark dataframes visualization

`spark-board` provides an interactive way to analize PySpark data frame execution plans as a static website displaying the transformations DAG.

Check out the [examples](https://alijdens.github.io/spark-board/) for a quick overview of the features (and the corresponding examples source code [here](https://github.com/alijdens/spark-board/tree/v1.0.0/tests/examples/)).

If you intend to develop `spark-board` or run from source, check out the [documentation](https://github.com/alijdens/spark-board/tree/v1.0.0/docs).

## Usage

`spark-board` takes a PySpark data frame and inspects the operations to build the DAG. This usually is the final step of a PySpark script, right before writing it to disk.

### Install `spark-board`

```shell
pip install spark-board
```

### Run `spark-board`

```python
from spark_board.html import dump_dataframe, DefaultSettings

# get the PySpark data frame that will be displayed
df = ...

dump_dataframe(
    df=df,
    output_dir="./spark_board_output",
    overwrite=True,  # overwrite output_dir if it already exists
    default_settings=DefaultSettings(),  # override default settings if desired
)
```

and that's it! `spark-board` will generate a static website in the defined `output_dir` folder. You can now serve the website using any web server and inspect the operations.

You can check out the available default settings [here](https://github.com/alijdens/spark-board/blob/main/spark_board/default_settings.py).

### Serving

`spark-board` is intended to be a live documentation of PySpark scripts. Because of this, it's advisable to run it every time the source code is updated. For example, `spark-board` can be run as part of a CI pipeline and the generated website uploaded to a static website hosting service, like Github or Gitlab pages (we actually do this to [update and serve the examples](https://github.com/alijdens/spark-board/tree/v1.0.0/.github/workflows/deploy-examples.yml) in this repository).
