Metadata-Version: 2.3
Name: timebasedcv
Version: 0.1.0
Summary: Time based cross validation
Project-URL: documentation, https://fbruzzesi.github.io/timebasedcv/
Project-URL: repository, https://github.com/fbruzzesi/timebasedcv
Project-URL: issue-tracker, https://github.com/fbruzzesi/timebasedcv/issues
Author: Francesco Bruzzesi
License: MIT License
        
        Copyright (c) 2023 Francesco Bruzzesi
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.8
Requires-Dist: narwhals>=0.7.15
Requires-Dist: numpy
Requires-Dist: typing-extensions>=4.4.0; python_version < '3.11'
Provides-Extra: all
Requires-Dist: pandas>=1.2.0; extra == 'all'
Requires-Dist: polars>=0.20.3; extra == 'all'
Provides-Extra: all-dev
Requires-Dist: coverage==7.2.1; extra == 'all-dev'
Requires-Dist: hatch; extra == 'all-dev'
Requires-Dist: interrogate>=1.5.0; extra == 'all-dev'
Requires-Dist: mkdocs-autorefs; extra == 'all-dev'
Requires-Dist: mkdocs-material>=9.2.0; extra == 'all-dev'
Requires-Dist: mkdocs>=1.4.2; extra == 'all-dev'
Requires-Dist: mkdocstrings[python]>=0.20.0; extra == 'all-dev'
Requires-Dist: pandas>=1.2.0; extra == 'all-dev'
Requires-Dist: polars>=0.20.3; extra == 'all-dev'
Requires-Dist: pre-commit==2.21.0; extra == 'all-dev'
Requires-Dist: pytest-xdist==3.2.1; extra == 'all-dev'
Requires-Dist: pytest==7.2.0; extra == 'all-dev'
Requires-Dist: ruff>=0.4.0; extra == 'all-dev'
Requires-Dist: scikit-learn>=0.19; extra == 'all-dev'
Provides-Extra: dev
Requires-Dist: hatch; extra == 'dev'
Requires-Dist: pre-commit==2.21.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-autorefs; extra == 'docs'
Requires-Dist: mkdocs-material>=9.2.0; extra == 'docs'
Requires-Dist: mkdocs>=1.4.2; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.20.0; extra == 'docs'
Provides-Extra: lint
Requires-Dist: ruff>=0.4.0; extra == 'lint'
Provides-Extra: pandas
Requires-Dist: pandas>=1.2.0; extra == 'pandas'
Provides-Extra: polars
Requires-Dist: polars>=0.20.3; extra == 'polars'
Provides-Extra: test
Requires-Dist: coverage==7.2.1; extra == 'test'
Requires-Dist: interrogate>=1.5.0; extra == 'test'
Requires-Dist: pytest-xdist==3.2.1; extra == 'test'
Requires-Dist: pytest==7.2.0; extra == 'test'
Requires-Dist: scikit-learn>=0.19; extra == 'test'
Description-Content-Type: text/markdown

<img src="docs/img/timebasedcv-logo.svg" width=185 height=185 align="right">

![license-shield](https://img.shields.io/github/license/FBruzzesi/timebasedcv)
![interrogate-badge](docs/img/interrogate-shield.svg)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
![coverage-badge](docs/img/coverage.svg)
![versions-shield](https://img.shields.io/pypi/pyversions/timebasedcv)

# Time based cross validation

**timebasedcv** is a Python codebase that provides a cross validation strategy based on time.

---

[Documentation](https://fbruzzesi.github.io/timebasedcv) | [Repository](https://github.com/fbruzzesi/timebasedcv) | [Issue Tracker](https://github.com/fbruzzesi/timebasedcv/issues)

---

## Alpha Notice

This codebase is experimental and is working for my use cases. It is very probable that there are cases not covered and for which it breaks (badly). If you find them, please feel free to open an issue in the [issue page](https://github.com/FBruzzesi/timebasedcv/issues) of the repo.

## Description

The current implementation of [scikit-learn TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) lacks the flexibility of having multiple samples within the same time period/unit.

This codebase addresses such problem by providing a cross validation strategy based on a **time period** rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.

Temporal data leakage is an issue and we want to prevent that from happening!

We introduce two main classes:

- [`TimeBasedSplit`](https://fbruzzesi.github.io/timebasedcv/api/timebasedsplit/#timebasedcv.timebasedsplit.TimeBasedSplit) allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. Its core method `split` requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3).
- [`TimeBasedCVSplitter`](https://fbruzzesi.github.io/timebasedcv/api/timebasedsplit/#timebasedcv.timebasedsplit.TimeBasedCVSplitter) conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits, and the `split` method shouldn't take any extra arguments as input other than the arrays to split.

## Installation

**timebasedcv** is a published Python package on [pypi](https://pypi.org/), therefore it can be installed directly via pip, as well as from source using pip and git, or with a local clone:

<details open>

<summary> <b>pip</b> (suggested)</summary>

```bash
python -m pip install timebasedcv
```

</details>

<details closed>

<summary> <b>pip + source/git</b></summary>

```bash
python -m pip install git+https://github.com/FBruzzesi/timebasedcv.git
```

</details>

<details closed>

<summary> <b>local clone</b></summary>

```bash
git clone https://github.com/FBruzzesi/timebasedcv.git
cd timebasedcv
python -m pip install .
```

</details>

## Dependencies

As of **timebasecv v0.1.0**, the only two dependencies are [`numpy`](https://numpy.org/doc/stable/index.html) and [`narwhals>=0.7.15`](https://marcogorelli.github.io/narwhals/).

The latter allows to have a compatibility layer between polars, pandas and other dataframe libraries. Therefore, as long as narwhals supports such dataframe object, we will as well.

## Quickstart

The following code snippet is all you need to get started, yet consider checking out the [Getting Started](https://fbruzzesi.github.io/timebasedcv/getting-started/) section of the documentation for a detailed guide on how to use the library.

First let's generate some data with different number of points per day:

```python
import pandas as pd
import numpy as np
np.random.seed(42)

dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)

df = pd.concat([
    pd.DataFrame({
        "time": pd.date_range(start, end, periods=_size, inclusive="left"),
        "value": np.random.randn(_size-1)/25,
    })
    for start, end, _size in zip(dates[:size], dates[1:], np.random.randint(2, 24, size-1))
]).reset_index(drop=True)

time_series, X = df["time"], df["value"]
df.set_index("time").resample("D").count().head(5)
```

```terminal
time	        value
2023-01-01	14
2023-01-02	2
2023-01-03	22
2023-01-04	11
2023-01-05	1
```

Now let's run the split with a given frequency, train size, test size, gap, stride and window type:

```python
from timebasedcv import TimeBasedSplit

configs = [
    {
        "frequency": "days",
        "train_size": 14,
        "forecast_horizon": 7,
        "gap": 2,
        "stride": 5,
        "window": "expanding"
    },
    ...
]

tbs = TimeBasedSplit(**config)


fmt = "%Y-%m-%d"
for train_set, forecast_set in tbs.split(X, time_series=time_series):

    # Do some magic here
```

Let's see how `train_set` and `forecasting_set` splits would look likes for different split strategies (or configurations).

The blue dots represent the train points, while the red dots represent the forecastng points.

![cross-validation](docs/img/cross-validation.png)

## Contributing

Please read the [Contributing guidelines](https://fbruzzesi.github.io/timebasedcv/contribute/) in the documentation site.

## License

The project has a [MIT Licence](https://github.com/FBruzzesi/timebasedcv/blob/main/LICENSE)
