Metadata-Version: 2.1
Name: do-data-utils
Version: 2.1.0
Summary: Functionalities to interact with Google and Azure, and clean data
Home-page: https://github.com/anuponwa/do-data-utils
Author: Anupong Wannakrairot
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Database :: Database Engines/Servers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: databricks-sdk==0.36.0
Requires-Dist: databricks-sql-connector==3.6.0
Requires-Dist: google==3.0.0
Requires-Dist: google-api-core==2.21.0
Requires-Dist: google-auth==2.35.0
Requires-Dist: google-cloud==0.34.0
Requires-Dist: google-cloud-bigquery==3.26.0
Requires-Dist: google-cloud-core==2.4.1
Requires-Dist: google-cloud-secret-manager==2.21.0
Requires-Dist: google-cloud-storage==2.18.2
Requires-Dist: google-crc32c==1.6.0
Requires-Dist: pandas
Requires-Dist: openpyxl==3.1.5
Requires-Dist: XlsxWriter==3.2.0

# do-data-utils

This package provides you the functionalities to connect to different cloud sources and data cleaning functions.
Package repo on PyPI: [do-data-utils - PyPI](https://pypi.org/project/do-data-utils/)

## Installation

### Commands

To install the latest version from `main` branch, use the following command:
```bash
pip install do-data-utils
```
You can install a specific version, for example,
```bash
pip install do-data-utils==2.1.0
```

### Install in requirements.txt

You can also put this source in the `requirements.txt`.
```python
# requirements.txt

do-data-utils==2.1.0
```

## Available Subpackages
- `google` – Utilities for Google Cloud Platform.
- `azure` – Utilities for Azure services.
- `pathutils` – Utilities related to paths.

For a full list of functions, see the [overview documentation](docs/overview.md).


## Example Usage

The concept of using this revolves around the idea that:
1. You keep service account JSON secrets (for cloud services) in GCP secret manager
2. You have local JSON secret file for accessing the GCP secret manager
3. Retrive the secret you want to interact with cloud platform from GCP secret manager
4. Do your stuff...


### Google

#### GCS
##### Download

```python
from do_data_utils.google import get_secret, gcs_to_df


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

secret = get_secret(secret_info, project_id='my-secret-project-id', secret_id='gcs-secret-id-dev')

# Download a csv file to DataFrame
gcspath = 'gs://my-ai-bucket/my-path-to-csv.csv'
df = gcs_to_df(gcspath, secret, polars=False)
```


```python
from do_data_utils.google import get_secret, gcs_to_dict


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

secret = get_secret(secret_info, project_id='my-secret-project-id', secret_id='gcs-secret-id-dev')

# Download the content from GCS
gcspath = 'gs://my-ai-bucket/my-path-to-json.json'
my_dict = gcs_to_dict(gcspath, secret=secret)
```

##### Upload
```python
from do_data_utils.google import get_secret, dict_to_json_gcs


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

secret = get_secret(secret_info, project_id='my-secret-project-id', secret_id='gcs-secret-id-dev')

my_setting_dict = {
    'param1': 'abc',
    'param2': 'xyz',
}

gcspath = 'gs://my-bucket/my-path-to-json.json'
dict_to_json_gcs(dict_data= my_setting_dict, gcspath=gcspath, secret=secret)
```

#### GBQ

```python
from do_data_utils.google import get_secret, gbq_to_df


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

secret = get_secret(secret_info, project_id='my-secret-project-id', secret_id='gbq-secret-id-dev')

# Query
query = 'select * from my-project.my-dataset.my-table'
df = gbq_to_df(query, secret, polars=False)
```


### Azure/Databricks

```python
from do_data_utils.azure import databricks_to_df


# Load secret key and get the secret to access GCS
with open('secrets/secret-manager-key.json', 'r') as f:
    secret_info = json.load(f)

secret = get_secret(secret_info, project_id='my-secret-project-id', secret_id='databricks-secret-id-dev')

# Download from Databricks sql
query = 'select * from datadev.dsplayground.my_table'
df = databricks_to_df(query, secret, polars=False)
```

### Path utils
```python
from do_data_utils.pathutils import add_project_root

# Adds your root folder to sys.path,
# so you can do imports from the root directory
add_project_root(levels_up=1)
```
