Metadata-Version: 2.1
Name: ddeutil-workflow
Version: 0.0.6
Summary: Data Developer & Engineer Workflow Utility Objects
Author-email: ddeutils <korawich.anu@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/ddeutils/ddeutil-workflow/
Project-URL: Source Code, https://github.com/ddeutils/ddeutil-workflow/
Keywords: data,workflow,utility,pipeline
Classifier: Topic :: Utilities
Classifier: Natural Language :: English
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fmtutil
Requires-Dist: ddeutil-io
Requires-Dist: python-dotenv==1.0.1
Provides-Extra: app
Requires-Dist: fastapi==0.112.0; extra == "app"
Requires-Dist: apscheduler[sqlalchemy]==3.10.4; extra == "app"

# Workflow

[![test](https://github.com/ddeutils/ddeutil-workflow/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/ddeutils/ddeutil-workflow/actions/workflows/tests.yml)
[![python support version](https://img.shields.io/pypi/pyversions/ddeutil-workflow)](https://pypi.org/project/ddeutil-workflow/)
[![size](https://img.shields.io/github/languages/code-size/ddeutils/ddeutil-workflow)](https://github.com/ddeutils/ddeutil-workflow)
[![gh license](https://img.shields.io/github/license/ddeutils/ddeutil-workflow)](https://github.com/ddeutils/ddeutil-workflow/blob/main/LICENSE)

**Table of Contents**:

- [Installation](#installation)
- [Getting Started](#getting-started)
- [Core Features](#core-features)
  - [On](#on)
  - [Pipeline](#pipeline)
- [Usage](#usage)
  - [Python & Bash](#python--bash)
  - [Hook (EL)](#hook-extract--load)
  - [Hook (T)](#hook-transform)
- [Configuration](#configuration)
- [Deployment](#deployment)

This **Workflow** objects was created for easy to make a simple metadata
driven pipeline that able to **ETL, T, EL, or ELT** by `.yaml` file.

I think we should not create the multiple pipeline per use-case if we able to
write some dynamic pipeline that just change the input parameters per use-case
instead. This way we can handle a lot of pipelines in our orgs with metadata only.
It called **Metadata Driven**.

Next, we should get some monitoring tools for manage logging that return from
pipeline running. Because it not show us what is a use-case that running data
pipeline.

> [!NOTE]
> _Disclaimer_: I inspire the dynamic statement from the GitHub Action `.yml` files
> and all of config file from several data orchestration framework tools from my
> experience on Data Engineer.

## Installation

```shell
pip install ddeutil-workflow
```

This project need `ddeutil-io` extension namespace packages. If you want to install
this package with application add-ons, you should add `app` in installation;

```shell
pip install ddeutil-workflow[app]
```

## Getting Started

The first step, you should start create the connections and datasets for In and
Out of you data that want to use in pipeline of workflow. Some of this component
is similar component of the **Airflow** because I like it orchestration concepts.

The main feature of this project is the `Pipeline` object that can call any
registries function. The pipeline can handle everything that you want to do, it
will passing parameters and catching the output for re-use it to next step.

> [!IMPORTANT]
> In the future of this project, I will drop the connection and dataset to
> dynamic registries instead of main features because it have a lot of maintain
> vendor codes and deps. (I do not have time to handle this features)

### On

The **On** is schedule object.

```yaml
on_every_5_min:
  type: on.On
  cron: "*/5 * * * *"
```

```python
from ddeutil.workflow.on import On

schedule = On.from_loader(name='on_every_5_min', externals={})
assert '*/5 * * * *' == str(schedule.cronjob)

cron_iter = schedule.generate('2022-01-01 00:00:00')
assert '2022-01-01 00:05:00' f"{cron_iter.next:%Y-%m-%d %H:%M:%S}"
assert '2022-01-01 00:10:00' f"{cron_iter.next:%Y-%m-%d %H:%M:%S}"
assert '2022-01-01 00:15:00' f"{cron_iter.next:%Y-%m-%d %H:%M:%S}"
assert '2022-01-01 00:20:00' f"{cron_iter.next:%Y-%m-%d %H:%M:%S}"
```

### Pipeline

The **Pipeline** object that is the core feature of this project.

```yaml
run_py_local:
  type: ddeutil.workflow.pipeline.Pipeline
  on: 'on_every_5_min'
  params:
    author-run:
      type: str
    run-date:
      type: datetime
```

```python
from ddeutil.workflow.pipeline import Pipeline

pipe = Pipeline.from_loader(name='run_py_local', externals={})
pipe.execute(params={'author-run': 'Local Workflow', 'run-date': '2024-01-01'})
```

> [!NOTE]
> The above parameter use short declarative statement. You can pass a parameter
> type to the key of a parameter name.
> ```yaml
> params:
>   author-run: str
>   run-date: datetime
> ```
>
> And for the type, you can remove `ddeutil.workflow` prefix because we can find
> it by looping search from `WORKFLOW_CORE_REGISTRY` value.

## Usage

This is examples that use workflow file for running common Data Engineering
use-case.

> [!IMPORTANT]
> I recommend you to use `task` stage for all actions that you want to do with
> pipeline object.

### Python & Bash

```yaml
run_py_local:
  type: pipeline.Pipeline
  params:
    author-run: str
    run-date: datetime
  jobs:
    first-job:
      stages:
        - name: "Printing Information"
          id: define-func
          run: |
            x = '${{ params.author-run }}'
            print(f'Hello {x}')

            def echo(name: str):
              print(f'Hello {name}')

        - name: "Run Sequence and use var from Above"
          vars:
            x: ${{ params.author-run }}
          run: |
            print(f'Receive x from above with {x}')
            # Change x value
            x: int = 1

        - name: "Call Function"
          vars:
            echo: ${{ stages.define-func.outputs.echo }}
          run: |
            echo('Caller')
    second-job:
      stages:
        - name: "Echo Bash Script"
          id: shell-echo
          bash: |
            echo "Hello World from Shell"
```

```python
from ddeutil.workflow.pipeline import Pipeline

pipe = Pipeline.from_loader(name='run_py_local', externals={})
pipe.execute(params={'author-run': 'Local Workflow', 'run-date': '2024-01-01'})
```

```shell
> Hello Local Workflow
> Receive x from above with Local Workflow
> Hello Caller
> Hello World from Shell
```

### Hook (Extract & Load)

```yaml
pipe_el_pg_to_lake:
  type: pipeline.Pipeline
  params:
    run-date: datetime
    author-email: str
  jobs:
    extract-load:
      stages:
        - name: "Extract Load from Postgres to Lake"
          id: extract-load
          uses: tasks/postgres-to-delta@polars
          with:
            source:
              conn: conn_postgres_url
              query: |
                select * from ${{ params.name }}
                where update_date = '${{ params.datetime }}'
            sink:
              conn: conn_az_lake
              endpoint: "/${{ params.name }}"
```

### Hook (Transform)

```yaml
pipeline_hook_mssql_proc:
  type: pipeline.Pipeline
  params:
    run_date: datetime
    sp_name: str
    source_name: str
    target_name: str
  jobs:
    transform:
      stages:
        - name: "Transform Data in MS SQL Server"
          id: transform
          uses: tasks/mssql-proc@odbc
          with:
            exec: ${{ params.sp_name }}
            params:
              run_mode: "T"
              run_date: ${{ params.run_date }}
              source: ${{ params.source_name }}
              target: ${{ params.target_name }}
```

## Configuration

```bash
export WORKFLOW_ROOT_PATH=.
export WORKFLOW_CORE_REGISTRY=ddeutil.workflow,tests.utils
export WORKFLOW_CORE_PATH_CONF=conf
```

Application config:

```bash
export WORKFLOW_APP_DB_URL=postgresql+asyncpg://user:pass@localhost:5432/schedule
export WORKFLOW_APP_INTERVAL=10
```

## Deployment

This package able to run as a application service for receive manual trigger
from the master node via RestAPI.

> [!WARNING]
> This feature do not start yet because I still research and find the best tool
> to use it provision an app service, like `starlette`, `fastapi`, `apscheduler`.

```shell
(venv) $ workflow start -p 7070
```
