Metadata-Version: 2.1
Name: tdprepview
Version: 1.4.1
Summary: Python Package that creates Data Preparation Pipeline in Teradata-SQL in Views
Author: Martin Hillebrand
Author-email: martin.hillebrand@teradata.com
Keywords: tdprepview,teradata,database,preprocessing,data engineering,data science
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Database
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: AUTHORS.md
Requires-Dist: teradataml>=17.10.00.02
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scikit-learn>=1.3.0
Provides-Extra: plot
Requires-Dist: plotly>=5.0; extra == "plot"
Requires-Dist: seaborn>=0.11; extra == "plot"

![Logo](https://raw.githubusercontent.com/martinhillebrand/tdprepview/main/media/tdprepview_logo.png)


# tdprepview

Python Package that creates Data Preparation Pipelines in Views written in Teradata-SQL.

## Installation


* `pip install tdprepview`

## Features

* Pipeline class that allows creating in-DB preprocessing pipelines
* Several Preprocessor functions
* API similar to sklearn.Pipeline
* Pipeline can be saved and loaded as dict/json
* Pipeline can be automatically created based on data types and distributions.

![Preprocessors](https://raw.githubusercontent.com/martinhillebrand/tdprepview/main/media/supportedpreprocessors_v131.png)


## Quickstart

### 1. setting up the pipeline

```python
import teradataml as tdml
database_credentials = {
    # ...
}
DbCon = tdml.create_context(**database_credentials)

myschema, table_train, table_score = "...", "...", "..."
DF_train = tdml.DataFrame(tdml.in_schema(myschema,table_train))
DF_score = tdml.DataFrame(tdml.in_schema(myschema,table_score))

from tdprepview import Pipeline
from tdprepview import (
    StandardScaler, IterativeImputer, PCA, DecisionTreeBinning )

steps = [
    ({"pattern":"^floatfeat_[1-5]$"}, [StandardScaler(), IterativeImputer()], {"suffix":"_imputed"}),
    ({"suffix":"_imputed"}, PCA(n_components=2), {"prefix":"mypca_"}),
    ({"prefix":"mypca_"}, DecisionTreeBinning(target_var="target_class", no_bins=3))
]

mypipeline = Pipeline(steps=steps)

```

### 2.a transform DataFrames directly
this is suitable if you are working in a sandbox Jupyter Notebook
```python
# fit() calculates all necessary statistics in DB
mypipeline.fit(DF=DF_train)
# transform generates SQL syntax based on the chosen Preprocessors and the statistics from fit()
DF_train_transf = mypipeline.transform(DF_train)
DF_score_transf = mypipeline.transform(DF_score)
#... go on with some modelling (e.g. sklearn, TD_DecisionForest,...) 
# and scoring (via BYOM, TD_DecisionForestPredict, ...)
```

### 2.b inspect produced queries
if you want to check the SQL code, that is generated py tdprepview
```python
mypipeline.fit(DF=DF_train)
query_train = mypipeline.transform(DF_train, return_type = "str")
print(query_train)
```
the output: (Note how columns not part of the pipeline are simply forwarded)
```sql
WITH preprocessing_steps AS
(
    SELECT
    row_id AS c_i_35,
    floatfeat_1 AS c_i_36,
    floatfeat_2 AS c_i_37,
    floatfeat_3 AS c_i_38,
    target_class AS c_i_39,
    ZEROIFNULL( (( c_i_36 ) - -0.2331302 ) / NULLIF( 1.747159 , 0) ) AS c_i_40,
    ZEROIFNULL( (( c_i_37 ) - 0.03895576 ) / NULLIF( 1.722347 , 0) ) AS c_i_41,
    ZEROIFNULL( (( c_i_38 ) - 0.2363556 ) / NULLIF( 2.808312 , 0) ) AS c_i_42,
    -1.118451e-08  +  (-0.07714142) * (COALESCE(c_i_41, 1.610355e-09))  +  (-0.1758817) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_43,
    4.261288e-09  +  (-0.0431946) * (COALESCE(c_i_40, -1.045776e-08))  +  (0.6412595) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_44,
    -7.079888e-09  +  (-0.118112) * (COALESCE(c_i_40, -1.045776e-08))  +  (0.624912) * (COALESCE(c_i_41, 1.610355e-09)) AS c_i_45,
    (0.2604757) * (c_i_43)  +  (-0.681657) * (c_i_44)  +  (-0.6837369) * (c_i_45) AS c_i_46,
    (-0.1047098) * (c_i_43)  +  (0.6840609) * (c_i_44)  +  (-0.7218702) * (c_i_45) AS c_i_47,
    CASE     WHEN c_i_46 < -2.0 THEN 0     WHEN c_i_46 < -1.236351 THEN 1     WHEN c_i_46 < -1.182989 THEN 2     ELSE 3 END AS c_i_48,
    CASE     WHEN c_i_47 < -2.0 THEN 0     WHEN c_i_47 < -0.3139175 THEN 1     WHEN c_i_47 < 0.2286314 THEN 2     ELSE 3 END AS c_i_49
    FROM
        <input_schema>.<input_table_view> t
)

SELECT
    c_i_35 AS row_id,
    c_i_48 AS mypca_pc_1,
    c_i_49 AS mypca_pc_2,
    c_i_39 AS target_class
FROM
preprocessing_steps t
```

if you want to inspect the pipeline as a chart
```python
mypipeline.plot_sankey()
```
Output:

![Sankey Chart](https://raw.githubusercontent.com/martinhillebrand/tdprepview/main/media/example_sankey.png)



### 2.c persist transformed data as a view for later use
this is suitable and compliant with the Teradata ModelOps-Framework, where training.py and scoring.py 
are separate scripts.

__in `training.py`:__
```python
view_train_transf = table_train+"_transf_v"
view_score_transf = table_score+"_transf_v"

mypipeline.fit(schema_name = myschema, 
               table_name = table_train)

# 3. transform: create views for pipelines
# DF_train_transf is already based on the newly created View
DF_train_transf = mypipeline.transform(
                        schema_name = myschema, 
                        table_name = table_train,
                        # this triggeres the creation of a VIEW, thus the transformation 
                        # pipeline is persited
                        create_replace_view = True, 
                        output_schema_name = myschema, 
                        output_view_name= view_train_transf)

# 3.b create view for scoring table - no need for further inspection, thus no return
mypipeline.transform(   schema_name = myschema, 
                        table_name = table_score,
                        return_type = None,
                        create_replace_view = True, 
                        output_schema_name = myschema, 
                        output_view_name= view_score_transf)

# further steps:
# Model training with DF_train_transf, 
# e.g. local and use BYOM, or in-DB with TD_DecisionForest
# save model in DB
```

__in `scoring.py`:__
```python
view_score_transf = table_score+"_transf_v"
# 1. get DataFrame based on View, transform is happening as per view definition
DF_score_transf = tdml.DataFrame(tdml.in_schema(myschema, view_score_transf))

# 2. Model Scoring with trained model 
# (e.g. PMMLPredict, TD_DecisionForestPredict,...)
# get pointer for model from DB + execute scoring in DB

# 3. Save Scored Data in DB
```



# History

## v0.1.0 (2023-02-15)

### added

* First release on PyPI.
* Pipeline with fit and transform functions
* Preprocessor Functions
    * Impute
    * ImputeText
    * TryCast
    * Scale
    * CutOff
    * FixedWidthBinning
    * ThresholdBinarizer
    * ListBinarizer
    * VariableWidthBinning
    * LabelEncoder
    * CustomTransformer
* Notebooks for tests
* Demo Notebook

## v0.1.2 (2023-02-15)

### fixed

* added *.sql to MANIFEST.ln such that SQL templates are also part of the distribution.

### changed

* HISTORY and README file from rst to Markdown

## v0.1.3 (2023-02-16)

### added

* Quickstart in README file

## v0.1.4 (2023-02-17)

### added

* DecisionTreeBinning as Preprocessing function

## v1.0.2 (2023-03-06)

Major Overhaul of tdprepview. It now supports transformations that change the schema of the Input DataFrame, like OneHotEncoding. This implementation is based on a directed acyclic graph (DAG). 

### added

* plot_sankey() Function for Pipeliine class, which plots the DAG as sankey chart.
* Preprocessing functions
  * SimpleImputer (adapted from sklearn, based on Impute)
  * IterativeImputer (adapted from sklearn)
  * StandardScaler (adapted from sklearn, based on Scale)
  * MaxAbsScaler (adapted from sklearn, based on Scale)
  * MinMaxScaler (adapted from sklearn, based on Scale)
  * RobustScaler (adapted from sklearn, based on Scale)
  * Normalizer (adapted from sklearn)
  * QuantileTransformer (adapted from sklearn, based on VariableWidthBinning)
  * Binarizer (adapted from sklearn, based on ThresholdBinarizer)
  * PolynomialFeatures (adapted from sklearn)
  * OneHotEncoder (adapted from sklearn)
  * PCA (adapted from sklearn)

### changed

* exactly one query is generated, with one WITH AS part, that contains all preprocessing, one final SELECT, that contains all final columns with the correct column names.
* the steps in the steps argument of Pipeline are more flexible. They are either tuples with two elements (input_column(s), preprocessor(s)) or tuples with three elements (input_column(s), preprocessor(s), options)
  * input_column(s): string, list of strings or _NEW_ dictionary for dynamic filtering:
    * `{'pattern':<regex pattern to search for column names>}`
    * `{'prefix':<string column names start with>}`
    * `{'suffix':<string column names end with>}`
    * `{'dtype_include':<list of tdtypes to include>}`
    * `{'dtype_exclude':<list of tdtypes to exclude>}`
    * filter conditions can be combined
  * preprocessor(s): one or list of Preprocessors.
  * options: dict for renaming column names after applying preprocessor. Useful for filtering them in a later step
    * `{'prefix':<adds string to beginning of column names>}`
    * `{'suffix':<adds string to end of column names>}`

## v1.1.0 (2024-03-01)

bugfixes, compatibility with newer teradataml version, optional requirements, feature hashing

### added

* Preprocessing functions
  * SimpleHashEncoder (a text column can be hash encoded using in-DB HASHROW)

### changed

* when a pipeline is crystallised in a view, the package checks the teradataml version and uses either `tdml.get_context().execute(q)` or `tdml.execute_sql(q)`.
* bugfix: `IterativeImputer` query generation
* plotly & seaborn are now optional requirements. Only when `plot_sankey()` is called, it checks whether the packages are available.



## v1.2.0 (2024-03-06)

minor bugfixes, Multi-Label Binarizer

### added

* Preprocessing functions
  * MultiLabelBinarizer (a text column with multiple different values separated by a delimiter can be encoded with multiple binary variables)


## v1.3.0 (2024-03-06)

`column_exclude` options for inputs in steps, Cast function

### added

* Preprocessing functions
  * Cast (analogous to SQL, useful to convert everything but target and key to float as last step)


## v1.3.1 (2024-03-26)

### added

* Preprocessing functions
  * PowerTransformer (analogous to sklearn PowerTransformer)

## v1.3.2 (2024-03-28)

Fitted Pipeline can now be persisted in a serialised file and reused later.

### added

* Pipeline can now be serialized using the 
  * `mydict = mypipeline.to_dict()` or 
  * `mypipeline.to_json("mypipeline.json")` functions
* and Pipelines can now analogously be re-created from a serialized representation using the
  * `mypipeline = Pipeline.from_dict(mydict)` or
  * `mypipeline = Pipeline.from_json("mypipeline.json")` or functions.

## v1.4.0 (2024-04-08)

Introducing automatic Pipeline creation based on heuristics. Either via `Pipeline.from_DataFrame(...)` or via `auto_code(...)`

### added

* Pipeline can now automatically created based on a tdml.DataFrame 
  * `mypipeline = Pipeline.from_DataFrame(DF, non_feature_cols=["rowid","target"], fit_pipeline=True)`
  * It'll use heuristics based on datatypes and distributions to decide which preprocessing function would make sense.
* If you only want to see the code for the `steps` parameter, you can use
  * `steps_str = auto_code(DF, non_feature_cols=["rowid","target"])`
  * `print(steps_str) # see steps and adjust if needed...` 
  * `steps = eval(steps_str)`
  * `mypipeline = Pipeline(steps)`
