Module statkit.dataset
Various methods for partitioning the dataset, such as downsampling and splitting.
Functions
def balanced_downsample(X: pandas.core.frame.DataFrame | numpy.ndarray[tuple[typing.Any, ...], numpy.dtype[~_ScalarT]],
y: pandas.core.series.Series | numpy.ndarray[tuple[typing.Any, ...], numpy.dtype[~_ScalarT]],
ratio: int = 1,
replace: bool = False,
verbose: bool = False) ‑> numpy.ndarray[tuple[typing.Any, ...], numpy.dtype[~_ScalarT]]-
Expand source code
def balanced_downsample( X: DataFrame | NDArray, y: Series | NDArray, ratio: int = 1, replace: bool = False, verbose: bool = False, ) -> NDArray: r"""Downsample majority class while stratifying for variables `X`. This method uses propensity score matching to subsample the majority class so that covariates `X` of both groups are balanced [1]. The logits from a logistic regression model, [i.e., \( \ln \frac{p}{1-p} \) where \( p(\pmb{y}|\pmb{X}) \)] are used to match the cases (`y=1`) to the closest controls (`y=0`). This ensures that after downsampling both groups are equally likely to be in both classes (according to the features `X`). Warning: In the worst case scenario, this method has a time complexity of \( O(m^2) \), where \( m \) is the number of samples. Example: Let's say you have two groups with systematic group differences in sex. ```python import numpy as np import pandas as pd names = ["eve", "alice", "carol", "dian", "bob", "frank"] group_label = pd.Series([1, 1, 0, 0, 0, 0], index=names) # Notice, no men in the case group (systematic bias). x_gender = np.array([0, 0, 0, 0, 1, 1]) # Female: 0; Male: 1. x_age = np.array([55, 75, 50, 60, 70, 80]) demographics = pd.DataFrame( data={"gender": x_gender, "age": x_age}, index=names, ) ``` Make a subselection of the majority class matching on age and gender. After down sampling, the control group has similar age and gender distributions ( namely, no men). >>> from statkit.dataset import balanced_downsample >>> controls = balanced_downsample(X=demographics, y=group_label) >>> controls Index(["carol", "dian"], dtype='object') Args: X: Downsample while balancing (stratifying) the classes based on these features/covariates/exogeneous variables. y: Binary classes to match (e.g., `y=1` case, `y=0` is control). ratio: Downsample majority class to achieve this `majority:minority` ratio. replace: By default, subsample without replacement. verbose: If True, print progress. Returns: Index of the matched majority class (control group): integer indices if `X` is a NumPy array, or index labels if `X` is a DataFrame. References: [1]: Rosenbaum-Ruben, Biometrika 70, 1, pp. 41-55 (1983). """ if replace: raise NotImplementedError("Downsampling with replacement is not implemented.") if ratio != 1: raise NotImplementedError("Downsampling with ratio != 1 is not implemented.") y_ = LabelBinarizer().fit_transform(y) y_ = np.squeeze(y_) # Swap classes if y=1 is the majority class. if sum(y_) > sum(1 - y_): y_ = 1 - y_ # 1) Compute logits. model = LogisticRegression(penalty=None).fit(X, y_) logits = model.decision_function(X) # 2) Match the case with controls using propensity scores. control_indices = _find_nearest_matches_greedily(logits, y_, verbose) if isinstance(X, DataFrame): return X.index[control_indices] return control_indicesDownsample majority class while stratifying for variables
X.This method uses propensity score matching to subsample the majority class so that covariates
Xof both groups are balanced [1]. The logits from a logistic regression model, [i.e., \ln \frac{p}{1-p} where p(\pmb{y}|\pmb{X}) ] are used to match the cases (y=1) to the closest controls (y=0). This ensures that after downsampling both groups are equally likely to be in both classes (according to the featuresX).Warning: In the worst case scenario, this method has a time complexity of O(m^2) , where m is the number of samples.
Example
Let's say you have two groups with systematic group differences in sex.
import numpy as np import pandas as pd names = ["eve", "alice", "carol", "dian", "bob", "frank"] group_label = pd.Series([1, 1, 0, 0, 0, 0], index=names) # Notice, no men in the case group (systematic bias). x_gender = np.array([0, 0, 0, 0, 1, 1]) # Female: 0; Male: 1. x_age = np.array([55, 75, 50, 60, 70, 80]) demographics = pd.DataFrame( data={"gender": x_gender, "age": x_age}, index=names, )Make a subselection of the majority class matching on age and gender. After down sampling, the control group has similar age and gender distributions ( namely, no men).
>>> from statkit.dataset import balanced_downsample >>> controls = balanced_downsample(X=demographics, y=group_label) >>> controls Index(["carol", "dian"], dtype='object')Args
X- Downsample while balancing (stratifying) the classes based on these features/covariates/exogeneous variables.
y- Binary classes to match (e.g.,
y=1case,y=0is control). ratio- Downsample majority class to achieve this
majority:minorityratio. replace- By default, subsample without replacement.
verbose- If True, print progress.
Returns
Index of the matched majority class (control group): integer indices if
Xis a NumPy array, or index labels ifXis a DataFrame.References
[1]: Rosenbaum-Ruben, Biometrika 70, 1, pp. 41-55 (1983).
def split_multinomial_dataset(X: numpy.ndarray[tuple[typing.Any, ...], numpy.dtype[numpy.int64]] | pandas.core.frame.DataFrame,
test_size: float = 0.5,
random_state=None) ‑> tuple[numpy.ndarray[tuple[typing.Any, ...], numpy.dtype[numpy.int64]] | pandas.core.frame.DataFrame, numpy.ndarray[tuple[typing.Any, ...], numpy.dtype[numpy.int64]] | pandas.core.frame.DataFrame]-
Expand source code
def split_multinomial_dataset( X: NDArray[np.int_] | DataFrame, test_size: float = 0.5, random_state=None ) -> tuple[NDArray[np.int_] | DataFrame, NDArray[np.int_] | DataFrame]: """Partition dataset, with number of observations per row, in a train-test split. Each row in `X` counts the number of observations per category (columns). This function equally divides, for each row, the observations in a train and test set (with the test set getting a proportion of `test_size`). Example: Let's say you have a dataset with questionnaire fields, with the total number of product ratings: ```python import pandas as pd product_names = ["a", "b"] rating_names = ["🙁", "😐", "😃"] product_ratings = pd.DataFrame( [[0, 1, 0], [2, 3, 7]], product_names, rating_names, ) ``` The total ratings of each product is multinomially distributed. >>> product_ratings 🙁 😐 😃 a 0 1 0 b 2 3 7 Here is how you make a train test split, equaly partitioning the ratings per product: >>> from statkit.dataset import split_multinomial_dataset >>> x_train, x_test = split_multinomial_dataset( product_ratings, test_size=0.5, ) >>> x_train 🙁 😐 😃 a 0 1 0 b 1 2 4 >>> x_test 🙁 😐 😃 a 0 0 0 b 1 1 3 Args: X: A dataset where each row counts the number of observations per category (columns). That is, each row is a multinomial draw. test_size: Proportion of draws to reserve for the test set. random_state: Seed for numpy pseudo random number generator state. Returns: A pair `X_train`, `X_test` both shaped like `X`. """ random_state = np.random.default_rng(random_state) _single_split = partial(_single_multinomial_train_test_split, test_size=test_size) X_np = X if isinstance(X, DataFrame): X_np = X.to_numpy() x_as = [] x_bs = [] for x_i in X_np: x_a, x_b = _single_split(random_state, x_i) x_as.append(x_a) x_bs.append(x_b) X_train = np.stack(x_as) X_test = np.stack(x_bs) if isinstance(X, DataFrame): df_train = DataFrame(X_train, index=X.index, columns=X.columns) df_test = DataFrame(X_test, index=X.index, columns=X.columns) return df_train, df_test return X_train, X_testPartition dataset, with number of observations per row, in a train-test split.
Each row in
Xcounts the number of observations per category (columns). This function equally divides, for each row, the observations in a train and test set (with the test set getting a proportion oftest_size).Example
Let's say you have a dataset with questionnaire fields, with the total number of product ratings:
import pandas as pd product_names = ["a", "b"] rating_names = ["🙁", "😐", "😃"] product_ratings = pd.DataFrame( [[0, 1, 0], [2, 3, 7]], product_names, rating_names, )The total ratings of each product is multinomially distributed.
>>> product_ratings 🙁 😐 😃 a 0 1 0 b 2 3 7Here is how you make a train test split, equaly partitioning the ratings per product:
>>> from statkit.dataset import split_multinomial_dataset >>> x_train, x_test = split_multinomial_dataset( product_ratings, test_size=0.5, ) >>> x_train 🙁 😐 😃 a 0 1 0 b 1 2 4 >>> x_test 🙁 😐 😃 a 0 0 0 b 1 1 3Args
X- A dataset where each row counts the number of observations per category (columns). That is, each row is a multinomial draw.
test_size- Proportion of draws to reserve for the test set.
random_state- Seed for numpy pseudo random number generator state.
Returns
A pair
X_train,X_testboth shaped likeX.