Metadata-Version: 2.1
Name: cloudsdp
Version: 0.1.6
Summary: 
Home-page: https://github.com/nvn-nil/CloudSDP
Keywords: cloud,data,processing,extraction,transformation,ingestion,ETL,BigQuery,Google Cloud,data pipeline
Author: Naveen Anil
Author-email: naveenms01@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: deepdiff (>=6.3.1,<7.0.0)
Requires-Dist: google-cloud-bigquery (>=3.11.4,<4.0.0)
Requires-Dist: google-cloud-run (>=0.9.1,<0.10.0)
Project-URL: Documentation, https://nvn-nil.github.io/CloudSDP/
Project-URL: Repository, https://github.com/nvn-nil/CloudSDP
Description-Content-Type: text/markdown

# CloudSDP Library

The CloudSDP library is designed to simplify the creation and management of serverless data pipelines between Google Cloud Run and Google BigQuery. It provides a developer-friendly interface to extract data from various sources, transform it, and seamlessly load it into BigQuery tables, all while leveraging the power of serverless architecture.

## Features

WIP:

- **Data Extraction and Ingestion**: Extract data from various sources, convert it into a common format, and ingest it into BigQuery tables.

TODO:

- **Data Transformation**: Perform data transformations, such as cleaning, enrichment, and normalization, before loading into BigQuery.
- **Scheduled Jobs and Triggers**: Schedule data pipeline jobs based on time triggers using Cloud Scheduler.
- **Data Pipeline Workflow**: Define and orchestrate data pipeline workflows with configurable execution order and dependencies.
- **Conflict Resolution and Error Handling**: Implement conflict resolution strategies and error handling mechanisms for reliable data processing.
- **Monitoring and Logging**: Monitor job progress, resource utilization, and performance metrics using integrated logging and monitoring tools.
- **Documentation and Examples**: Comprehensive documentation and code examples to guide developers in using the library effectively.

## Installation

Install the library using pip:

`pip install cloudsdp`

Or, install the library using poetry:

`poetry add cloudsdp`

## QuickStart

### Data Ingestion

#### Create dataset, ingest data and cleanup

From a python dict:

```py
import os

from cloudsdp.api.bigquery import BigQuery

PROJECT_NAME = "project_name"


def main():
    bq = BigQuery(PROJECT_NAME)
    dataset_name = "dataset_1"
    table_name = "table_1"

    data = [{"name": "Someone", "age": 29}, {"name": "Something", "age": 22}]

    data_schema = [
        {"name": "name", "type": "STRING", "mode": "REQUIRED"},
        {"name": "age", "type": "INTEGER", "mode": "REQUIRED"},
    ]

    bq.create_dataset(dataset_name)

    bq.create_table(table_name, data_schema, dataset_name)

    errors = bq.ingest_rows_json(data, dataset_name, table_name)
    if errors:
        print("Errors", ";".join(errors))

    bq.delete_dataset(dataset_name, delete_contents=True, not_found_ok=True)


if __name__ == "__main__":
    main()

```

From csv files stored in GCS:

```py

import os

from cloudsdp.api.bigquery import BigQuery


PROJECT_NAME = "project_name"


def main():
    bq = BigQuery(PROJECT_NAME)
    dataset_name = "dataset_1"
    table_name = "table_1"

    data_schema = [
        {"name": "name", "type": "STRING", "mode": "REQUIRED"},
        {"name": "age", "type": "INTEGER", "mode": "REQUIRED"},
    ]

    bq.create_dataset(dataset_name)

    bq.create_table(table_name, data_schema, dataset_name)

    csv_uris = ["gs://mybucket/name_age_data_1.csv", "gs://mybucket/name_age_data_2.csv"]

    result = bq.ingest_csvs_from_cloud_bucket(
        csv_uris, dataset_name, table_name, skip_leading_rows=1, autodetect_schema=False, timeout=120
    )
    print(result)

    bq.delete_dataset(dataset_name, delete_contents=True, not_found_ok=True)


if __name__ == "__main__":
    main()


```

