Metadata-Version: 2.1
Name: databricks_optimize
Version: 0.2.2
Summary: Databricks utility to identify which column to use for z-ordering and partitioning.
Author-email: Marco Lombardi <m.lombardi@live.it>
License: MIT
Project-URL: Homepage, https://github.com/LombMarc/databricks-optimize
Project-URL: Documentation, https://github.com/LombMarc/databricks-optimize#readme
Project-URL: Repository, https://github.com/LombMarc/databricks-optimize
Project-URL: Issues, https://github.com/lombmarc/databricks-optimize/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyspark>=3.0.0
Requires-Dist: pytest>=8.3.5

# Databricks optimize utility
Library to generate more easily optimize statement based on existing data.

## Build from source
Build the wheel file is enough to run
```
python -m build --wheel
```

## Table optimizer
This object can be initialized with a table object name (with three level namespace "catalog.schema.table"), the method pre_optimization will run some statistics to identify the best column to run z-order (column with highest cardinality) and to re-partition (this depends on the size of the table and the size of the generated files partition). This is used to define and print the optimization query. Then to run the optimization itself we have two additional method, run_optimize and run_partition, these two can be executed even on their own without the need to run the pre_optimization, in this case the method will run the statistics and the generation separately.


```
from dbks_optimize.optimizer import TableOptimizer

opt = TableOptimizer(spark,"testing.bakehouse.sales_customers",force_partition_on_col='customerID')

opt.pre_optimization() #print out the optimization statement

opt.run_optimizer() #execute optimize on the table

opt.run_partition() #execute partitioning query (it clone i a new table and overwrite the existing one)
```

## Schema optimizer
This object leverage the existing class for single table object and iterate over list of table available computing table statistics.
pre_optimize is used to generate the optimize statement, while to run real optimization we can use run_db_optimization.

```
from dbks_optimize.optimizer import SchemaOptimizer

opt = SchemaOptimizer(spark,'testing.bakehouse')

opt.pre_optimization()

opt.run_db_optimization()
```

## Catalog optimizer
Using the same logic iterate over all schemas and for each on on all tables to generate the statement.
```
from dbks_optimize.optimizer import CatalogOptimizer

opt = CatalogOptimizer(spark,'testing')

opt.pre_optimization()

opt.run_catalog_optimization()
```

