Metadata-Version: 2.1
Name: DI2
Version: 1.0.2
Summary: An an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions.
Home-page: https://github.com/JupitersMight/DI2
Author: L. Alexandre, R.S. Costa, R. Henriques
Author-email: leonardoalexandre@tecnico.ulisboa.pt
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/JupitersMight/DI2/issues
Keywords: multi-item discretization,prior-free discretization,heterogeneous biological data,data mining
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# DI2 (Distribution Discretizer)

## Discretizer

**distribution_discretizer(dataset, number_of_bins, statistical_test, cutoff_margin, kolmogorov_opt, normalizer, distributions, single_column_discretization)**

**distribution_discretizer(pandas.Dataframe, integer, optional:string, optional:float, optional:boolean, optional:string, optional:array)** - Discretizes data according to the best fitting distribution.

The **distribution_discretizer(pandas.Dataframe, integer, string, float, boolean, string, array)** receives the data (**pandas.Dataframe**), an **integer** representing the number of categories for discretization, an **string** with the name of the main statistical hypothesis test to apply (options available: **"chi2"**, **"ks"**), a **float** between 0 and 0.49 which indicates the width range to consider a value as being a border value, a **boolean** indicating if outliers should be removed, a **string**  indicating the normalization method to be used (options available: **"min_max"**,**"mean"**, **z_score**), an **array** of continuous distributions, from https://docs.scipy.org/doc/scipy/reference/stats.html, to be considered by the discretizer.


## Data normalization:

Normalizes data by min_max normalization https://en.wikipedia.org/wiki/Feature_scaling#Rescaling

Normalizes data z-score normalization https://en.wikipedia.org/wiki/Feature_scaling#Standardization

Normalizes data by mean normalization https://en.wikipedia.org/wiki/Feature_scaling#Mean_normalization


## Goodness of fit test

Person's chi-squared goodness of fit test https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

Kolmogorovâ€“Smirnov goodness of fit test https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test


As an illustrative example we use the dataset available at the UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Breast+Tissue.

---> DI2 was developed by L. Alexandre (leonardoalexandre@tecnico.ulisboa.pt), R.S. Costa (rs.costa@fct.unl.pt) and R. Henriques <---


