Metadata-Version: 2.4
Name: benchstats
Version: 3.3.2
Summary: Statistical Testing for Benchmark Results Comparison
Author-email: Aleksei Rechinskii <5dpea9nhd@mozmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Arech/benchstats
Project-URL: Source, https://github.com/Arech/benchstats
Project-URL: Changelog, https://github.com/Arech/benchstats/blob/main/CHANGELOG.md
Project-URL: Bugtracker, https://github.com/Arech/benchstats/issues
Keywords: benchmarks,statistics,reports,analysis,optimization,performance,benchmarking,statistical analysis,brunner munzel,mann whitney u
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development
Classifier: Topic :: System :: Benchmark
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich
Requires-Dist: numpy
Requires-Dist: scipy
Dynamic: license-file

# Statistical Testing for Benchmark Results Comparison

`benchstats` is a Python 3.10+ package for performing comparison of benchmarking results<sup>*</sup>
using proper statistical tests and making a readable report on that. This removes guesswork and WAGs
from interpretation of benchmarking results, and gets a solid<sup>**</sup> answer to a question "is
the algorithm implementation A is faster than the algorithm implementation B".

Module has a CLI support to operate on data files, and a `benchstats.qbench` submodule has helper
methods for benchmarking of Python callables (see a corresponding section at the end of this README
for the latter).

Code that reads input data, perform statistical testing and then visualize results is coupled only by
shared data types, so it could easily be used separately if needed (for example, to build a
performance regression testing pipeline).

<sup>*</sup> - while currently `benchstats` has a built-in parser only for [Google Benchmark](https://github.com/google/benchmark)-produced `.json` files, it's very simple to make and use a parser for any other data source as long as you know how to interpret it. See a "Custom data sources" paragraph and `--files_parser` CLI argument below.

<sup>**</sup> - the solidness of the answer still depends on how well the underlying hardware and OS
are quiesced, and how inherently non-deterministic the algorithms under benchmark are. It isn't
covered here how to achieve these benchmarking basics, however, `benchstats` does have a couple of
features to alleviate testing if the benchmarking machine is stable enough, see `--expect_same` CLI
argument below. If individual latencies of each function under benchmark are available, it's highly recommended to use `benchstats.qbench.showBench()` method with sufficiently large
`pvalue_stats_bootstrap` parameter to validate very important execution order independence assumption.

## Rationale: why should you even bother to use statistical tests to compare benchmarking results?

*/\* First of all, let's agree that we're trying to detect rather small differences in latencies.
If code A runs an order of magnitude faster than code B, things are way simpler. \*/*

\- *Really, why bother? Can't you just measure and compare the run time of code variant A to the run time of code variant B?*

Unfortunately, this is usually meaningless. Just run the benchmark several times. The numbers will
fluctuate, sometimes wildly. Which particular number should you use for the comparison now?

\- *Ok, then why not take an average number over a certain amount of benchmark executions and compare these averages?*

This also suffers from the very same problem as above - the numbers from different re-runs will still fluctuate, even though a bit less. And how do you know if the difference is large enough to be attributed to real differences in the code instead of a random non-determinism?

\- *Ah, but that's simple! In addition to the average value, also compute a standard deviation and then if averages are within 1-2-whatever standard deviations apart one from the other, declare the results aren't different. Otherwise one is significantly less or greater than the other. If you want to be extra sure, go with a "5 sigma golden standard" just like they do it in Physics! Google Benchmark even computes standard deviation on its own when you run benchmark with repetitions! Easy!*

... and not a tiny bit more correct than comparing two random measurements directly. While one can
certainly compute [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) for any
sample, it's unclear if it make sense to do so: (a) standard deviation applies only to distributions
with a finite variance, and (b) it has meaningful interpretation only for very few specific
distributions. Certain distributions, like [Pareto](https://en.wikipedia.org/wiki/Pareto_distribution),
might not not have finite variance at all. If you ever plotted benchmarked latencies, you'd immediately notice that it looks much more as a Pareto, than a [Normal/Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution). Measurements you get when benchmarking are anything, but normally distributed, and a value computed as their "standard deviation" don't describe what it's meant to describe - a real dispersion of measurements around its mean.

None of above is a good idea. But what is? Let's step back and think what do we want to learn from comparing benchmark results? We want to learn if a code variant A is faster, i.e. have smaller run time, than a code variant B. But we can't just measure their run times and compare them directly, because each measurement is essentially a random variable. Even if algorithms are perfectly deterministic, the hardware inherently isn't, so instead of measuring a run time precisely, we always sample a random value from some unknown distribution instead. Hence, the only correct way to go is to sample a lot of these random variables for distributions A and B, and then compare these sets of samples against each other. If it's more likely to find a smaller value in the set A than in the set B, then A is faster than B.

But how to do that? It's a well known problem in Mathematics and several solutions exists, such as [Mann–Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) and a newer more generic [Brunner Munzel Test](https://en.wikipedia.org/wiki/Brunner_Munzel_Test). `benchstats` is essentially a convenience wrapper around these tests with fully customizable inputs and results visualization/exporting, letting you get the answers effortlessly.

## Installation

```bash
pip install benchstats
```

For command line use see section "Examples" and "Command Line Interface Reference" below. For API overview see a corresponding section below.

## Examples of Benchmark Comparison

*/\* I'll use a Google Benchmark-derived binary as a data source for the examples, but just reminding that any data source could be easily plugged-in and used instead. For conciseness, I'll shorten "Google Benchmark" to simply "GBench".*

*For the examples of benchmarking of Python callables, see "Benchmarking Python callables" section below \*/*

Note that in all cases, you should **always use at least 10 repetitions** of a benchmark (`--benchmark_repetitions=10`). Even 2 repetitions would work, but the more data you have, the more reliable results are, and at least 10 is a reasonable minimum, especially for Brunner Munzel test, which is a default test. `--benchmark_enable_random_interleaving` flag is highly recommended too.

Lets consider two use-cases to illustrate how to use `benchstats`.

### How to compare two different implementations of the same benchmark?

For example, you already have a benchmark in the `main` code branch and want to ensure that a `feature` branch, which modifies some underlying code, doesn't contain unexpected performance regressions. How to do that?

1. Run the benchmark from the `main` branch with `--benchmark_out=./my_main.json` argument to save results to `./my_main.json` file:

```bash
benchmark_binary --benchmark_repetitions=10 \
    --benchmark_enable_random_interleaving \
    --benchmark_out=./my_main.json
```

2. Checkout the `feature` code branch and similarly run the same benchmark saving results to `./my_feature.json` file.

3. Let the `benchstats` analyze the results:

```bash
benchstats ./my_main.json ./my_feature.json real_time
```

The first two positional arguments of `benchstats` module are just file paths and the third argument is a json field name that contain a metric value for comparison. (`real_time` is the default field/metric, so it doesn't even have to be specified, btw). GBench typically saves two metrics: `real_time` (shown as just `Time` in reports) which is an average time of one iteration, and a `cpu_time` (shown as `CPU` in reports) which is an average time CPU spent executing the user-mode code. GBench also supports custom timers, and you can use them here as well. Typically, custom timers have the same field name as the name that was given to a counter (in case of doubts, just peek into a `.json` file).

If the `benchmark_binary` had only a single benchmark, the result of the command could look something like this:

<img src="https://github.com/Arech/benchstats/blob/main/docs/imgs/simple1.svg?raw=true" alt="Sample benchmark comparison results for a single benchmark" width=600>

`benchstats` present results as a table: the first column correspond to a benchmark name and all other columns shows results of applying a statistical test to a corresponding metric (yes, you can supply many metrics at once and all of them will be analyzed. Try to append `cpu_time` to the command line above).

Here's what a data cell of the `real_time` column mean: there were 10 samples of an alternative A (from the first positional argument - `my_main.json`) and 10 samples of an alternative B (from the second positional argument `my_feature.json`), - hence this `(10 vs 10)` text in the end. Mean value (hence `(means)` in the column title) for alternative A over its 10 samples was 516.5 microseconds, while mean for B was 502.8 microseconds. The test was using significance level 0.001 (hence the string `alpha=0.00100` in the table title) and on that level test declared that both sets aren't different from each other (hence symbol `~` between the mean values). Actual [p-value](https://en.wikipedia.org/wiki/P-value) for tests that didn't find a difference is not shown by default, only blank space is allocated to report it (so no text shift happens when some benchmark has differences and others don't) but you can pass `--always_show_pval` flag to show it.

To show how `benchstats` report could look like when a significant difference is found, I'll just rerun the same analysis with less restrictive significance level (and will use the opportunity to show effects of some above-mentioned flags)

```bash
benchstats ./my_main.json ./my_feature.json real_time cpu_time \
    --always_show_pvalues --alpha=0.03
```

![Benchmark comparison results when a difference is detected](https://github.com/Arech/benchstats/blob/main/docs/imgs/simple2.svg?raw=true)

The main change is that now the test reports that according to `real_time` metric, the benchmark that generated `./my_main.json` run significantly slower than the benchmark that made `./my_feature.json`. Due to that the script has exited with exit code 1, letting one to detect that a human intervention might be needed. But what are these "main metrics" that the warning string refers to? Main metrics are such metrics, any difference in which makes the script to exit with error code 1. By default, a metric mentioned the first in the command line list of metrics is the main metric, but this could be changed with `--main_metrics` argument, which accepts a list of indices in the metrics list. Let's make `cpu_time` the main metric and see what would change:

```bash
benchstats ./my_main.json ./my_feature.json real_time cpu_time \
    --always_show_pvalues --alpha=0.03 --main_metrics 1
```

![Benchmark comparison results when a difference in non-main metric is detected](https://github.com/Arech/benchstats/blob/main/docs/imgs/simple3.svg?raw=true)

Now `cpu_time` is the main metric and the difference in it isn't significant. `real_time` now is a secondary metric, which is only reported, but doesn't influence the script exit code.

### How to compare two different benchmarks against each other?

For example, we are optimizing a function and want to compare how a new implementation compares to the old. We've created an individual benchmark for both implementations and for convenience have placed them in the same binary. That implies the benchmark functions must have different names. Now, if we run the benchmark binary and obtain `./my.json` results file and then will follow the approach above blindly, we won't get what we need. `benchstats` compares a benchmark from the source 1 to a benchmark with the same name in the source 2, so we'll end up comparing old vs old in one row, and new vs new in the other, just like this:

```bash
benchstats ./my.json ./my.json real_time
```
<img src="https://github.com/Arech/benchstats/blob/main/docs/imgs/simple4_bad.svg?raw=true" alt="Wrong benchmark comparison" width=700>

To solve that, `benchstats` has two features based on the Python standard [regular expression](https://docs.python.org/3.10/library/re.html#regular-expression-syntax) engine:

1. filtering benchmarks by name when reading a source file using a regexp in `--filter1` or `--filter2` flags correspondingly.
2. a benchmark name rewrite feature, based on regexp search and replace. `--from` (also `--from1` for source 1 only and `--from2` for source 2 only) flag sets a regular expression pattern to find in each benchmark name, and an optional `--to` (also `--to1` and `--to2`) flag sets replacement string to substitute the pattern found.

This allows us to make a plan:

1. source 1 will represent the new implementation. For that we'll read `./my.json` and take only benchmark names that doesn't have `Old` substring using a regexp `^(?!.*?Old).*$`
2. source 2 will represent the old implementation. We'll read `./my.json` and take only benchmarks with names that contain `Old` substring. The regexp is trivial literal string `Old`.
3. For source 2 we'll rewrite benchmark names to remove the `Old` prefix

```bash
benchstats ./my.json ./my.json real_time \
    --filter1='^(?!.*?Old).*$' --filter2=Old \
    --from2='(?:Old)(.*?)' --to2=\\1
```

Note the single `'` quotes around `--filter1` and `--from2` regexps. They contain special characters that bash will try interpret spawning an error, if we won't escape them. Single quotes prevent bash expansion (double quotes `"` don't!) and pass strings as is to the program. Also note that string replacement pattern for `--to2` has a back slash escaped for that very reason too: `\\1` instead of `\1`. Single quotes is an option here too.

Now we get a proper result:

<img src="https://github.com/Arech/benchstats/blob/main/docs/imgs/simple4.svg?raw=true" alt="Correct benchmark comparison when a single source file is used" width=700>

## Custom Data Sources

While `benchstats` today support one Google Benchmark -produced `.json` files out of the box, it's intended to be able to run statistical tests for any data sources. If you know how to parse your custom file format, connect to a remote database, generate data on the fly, or obtain data any other way, you can easily let `benchstats` to run statistical tests on it by writing a custom parser.

Any python class (a parser) named X, that:
- inherits from `benchstats.common.ParserBase` class
- can be instantiated with the following 4 parameters:
    1. `source_id` - a string identifier, describing a data source within parser's controlled namespace. It's is obtained as the first or the second positional argument of `benchstats` module CLI.
        - In the examples above, for the built-in `parser_GbenchJson` this is a source file name.
    2. `filter` - a string identifier of a filter, used to restrict a set of benchmarks to load. Obtained as a value of `--filter1` or `--filter2` CLI flags.
        - for the built-in `parser_GbenchJson` it's a string with regular expression to use for filtering benchmark names.
    3. `metrics` - is a non-empty list or a tuple of string identifiers, corresponding to metrics to read for benchmarks. Obtained from third and subsequent CLI positional arguments.
    4. `debug_log` - an instance of `benchstats.common.LoggingConsole` class that provides a
    console logging interface to report errors if any <small>(built-in parsers should also support `None`
    values and `bool` values and instantiate a `LoggingConsole` on its own, but support for this
    isn't needed for a custom parser)</small>
- has a `def getStats(self)` method returning a dictionary that maps strings describing benchmark names to dictionaries linking metric names with `np.ndarray()` of metric values.
- is saved to a file `X.py`

can be used as a custom data source for `benchstats` CLI. Just pass a path to the `X.py` as value of `--files_parser` argument (or `--file1_parser` or `--file2_parser` if you want to use different parsers for source1 and source2)

### The simplest example of a custom data source

Imagine you have a one-column CSV with data on which you want to run a statistical significance test. Save the following to `./myCSV.py`:

```python
import numpy as np
from benchstats.common import ParserBase

class myCSV(ParserBase):
    def __init__(self, fpath, filter, metrics, debug_log=None) -> None:
        self.stats = np.loadtxt(fpath, dtype=np.float64)

    def getStats(self) -> dict[str, dict[str, np.ndarray]]:
        return {"bm": {"real_time": self.stats}}
```

There's always only one benchmark name `bm` hardcoded, and only one metric `real_time` that is read from CSV, but you get the idea. It's that simple.

Now just run `benchstats` passing a path to the parser in `--files_parser` flag:

```bash
benchstats ./src1.csv ./src2.csv --files_parser ./myCSV.py
```

If you make a parser that might be usable by other people, please consider adding it to the project's built-in parsers set by opening a thread with a suggestion in https://github.com/Arech/benchstats/issues or by making a PR into the repo.

## Command Line Interface Reference

You can always get actual CLI help output by passing `--help` flag to the invocation of `benchstats`. Note that `--help` output has precedence over the description below.

```
usage: __main__.py [-h] [--show_debug | --no-show_debug]
    [--files_parser <files parser class or path>]
    [--file1_parser <file1 parser class or path>]
    [--filter1 <reg expr>]
    [--file2_parser <file2 parser class or path>]
    [--filter2 <reg expr>]
    [--from <reg expr>] [--to <replacement string>]
    [--from1 <reg expr>] [--to1 <replacement string>]
    [--from2 <reg expr>] [--to2 <replacement string>]
    [--main_metrics <metric idx> [<metric idx> ...]]
    [--method <method id>]
    [--alpha <positive float less than 0.5>]
    [--bonferroni]
    [--always_show_pvalues]
    [--sample_sizes | --no-sample_sizes]
    [--percent_diff | --no-percent_diff]
    [--sample_stats <stat ids> [<stat ids> ...]]
    [--metric_precision <int>=3>]
    [--expect_same]
    [--colors | --no-colors]
    [--multiline | --no-multiline]
    [--export_to <path/to/export_file>]
    [--export_fmt <format id>]
    [--export_light]
    <path/to/file1> <path/to/file2> [metrics ...]
```
### Options
- `-h, --help` show this help message and exit
- `--show_debug, --no-show_debug` Shows some additional debugging info. Default: True

#### Input data:
Arguments describing how a source data sets are obtained and are optionally
transformed.

- `<path/to/file1>` Path to the first data file with benchmark results.
                    See also `--file1_parser` and `--filter1` arguments.
- `<path/to/file2>` Path to the second data file with benchmark results.
                    See also `--file2_parser` and `--filter2` arguments
- `--files_parser <files parser class or path>` Sets files parser class identifier, if a built-in parser is used (options are: `GbenchJson`). Or sets a path to `.py` file defining a custom parser, inherited from `benchstats.common.ParserBase` class. The parser class name must be the same as the file name.
- `--file1_parser <file1 parser class or path>` Same as `--files_parser`, but only applies to `file1`.
- `--filter1 <reg expr>` If specified, sets a Python regular expression to select benchmarks by name from `<file1>`
- `--file2_parser <file2 parser class or path>` Same as --files_parser, but only applies to `file2`.
- `--filter2 <reg expr>` Same as `--filter1`, but for `<file2>`
- `--from <reg expr>` Arguments `--from` and `--to` works in a pair and are used to describe benchmark names transformation. `--from` sets a regular expression (Python re module flavor) to define a pattern to replace to `--to` replacement string. Typical use-cases include removing unnecessary parts of benchmark names (for example, Google Benchmark adds some suffixes that might not convey useful information) or "glueing" results of different benchmarks so they become comparable by the tool (for example, you have two benchmarks, the first is 'old_foo' with old algorithm implementation and the other is 'foo' with a new competing algorithm implementation, - to compare their performance against each other, you need to remove 'old_' prefix from the first benchmark with `--from old_` and apply `--filter1` and `--filter2` to restrict loading only corresponding data for old or new variations). Remember to escape characters that have special meaning for the shell.
- `--to <replacement string>` See help for `--from`.
- `--from1 <reg expr>`  Like `--from`, but for `<file1>` only.
- `--to1 <replacement string>` Like `--to`, but for `<file1>` only
- `--from2 <reg expr>`  Like `--from`, but for `<file2>` only
- `--to2 <replacement string>` Like `--to`, but for `<file2>` only
- `metrics`             List of metric identifiers to use in tests for each benchmark. Default: `real_time`. For deterministic algorithms highly recommend to measure and to use a minimum latency per repetition.
- `--main_metrics <metric idx> [<metric idx> ...]` Indexes in `metrics` list specifying the main metrics. These are displayed differently and differences detected cause script to exit(1).
Default: `0`.

#### Statistical settings:

- `--method <method id>`  Selects a method of statistical testing. Possible values are are: `brunnermunzel` for [Brunner Munzel test](https://en.wikipedia.org/wiki/Brunner_Munzel_Test) and `mannwhitneyu` for [Mann-Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). Default is `brunnermunzel`
- `--alpha <positive float less than 0.5>` Set statistical significance level (a desired mistake probability level). Default `0.001`
    - Note! This is an actual probability of a mistake only when all preconditions of the chosen statistical test are met. One of the most important preconditions like independence of individual measurements, or constancy of underlying distribution parameters are almost never met in a real-life benchmarking, even on a properly quiesced hardware. Real mistake rates are higher. 
- `--bonferroni`          If set, applies a [Bonferroni multiple comparisons correction](https://en.wikipedia.org/wiki/Bonferroni_correction).

#### Rendering:
Controls how results are rendered

- `--always_show_pvalues` If set, always show p-values. By default it shows p-values only for significant differences. Note that when two sets are compared stochastically
same (`~`), or more precisely, not stochastically less and not stochastically greater (see [Stochastic ordering](https://en.wikipedia.org/wiki/Stochastic_ordering)),
p-value shown is a minimum of two p-values for less and greater comparison.
- `--sample_sizes, --no-sample_sizes` Controls whether to show sizes of datasets used in a test. Default is `True`.
- `--percent_diff, --no-percent_diff` Controls whether to show percentage difference between comparable values. When on, shows it in curly {} braces. Default is `True`.
- `--sample_stats <stat ids> [<stat ids> ...]` Sets which additional statistics about compared samples sets to show. Could be any sequence of floats (must be in range `[0,100]`, designating a percentile value to calculate), or aliases `extremums`, `median`, `iqr` (interquartile range), `std` (standard deviation). Shortenings are accepted. Use of `std`, though isn't recommended as standard deviation doesn't really mean anything for most distributions beyond Normal.
- `--metric_precision <int>=3>` Total number of digits in a metric value reported. Minimum is 3. Default is 4.
- `--expect_same` If set, assumes that distributions are the same (i.e. `H0` hypothesis is true) and shows some additional statistics useful for ensuring that a
benchmark code is stable enough, or the machine is quiesced enough. One good example is when two different source `<file1>` and `<file2>` were made with exactly the same binary running on exactly the same machine in the same state - on a machine with a proper setup tests shouldn't find a difference.
- `--colors, --no-colors` Controls if the output should be colored. Default: `True`
- `--multiline, --no-multiline` Controls if benchmark results could span several lines. Default: `False`

#### Export:
Controls how to export results.

- `--export_to <path/to/export_file>` Path to file to store comparison results to.
- `--export_fmt <format id>` Format of the export file. Options are: `txt`, `svg`, `html`. If not set, inferred from `--export_to` file extension.
- `--export_light` If set, uses light theme instead of dark

## API overview

The code is pretty trivial, so I'll post only a high-level overview: there are 3 major components interacting in `benchstats` when it runs in CLI mode:
1. one or two source data parsers, depending on value of `--files_parser`, `file1_parser` or `file2_parser`.
    - built-in parsers are stored in files starting with `parser_` prefix. What goes after the prefix is a parser identifier for CLI `--file?_parser` flag.
    - all parsers obey requirements listed above in "Custom Data Sources" section.
    - essentially a parser could be just a single function returning `dict[str, dict[str, np.ndarray]]` mapping from benchmark names to metric names to metric values. It's a class only for the simplicity of writing actual code and validating user-supplied parsers.
2. a comparison method `compareStats()` from `benchstats.compare`. It:
    - takes two dictionaries `dict[str, dict[str, np.ndarray]]` with source data plus some control arguments, run stat tests on all metric values of the dictionaries that have the same benchmark name (key of the outer-most dictionary) and a metric name (key of the inner-most dictionary), and then
    - produces `benchstats.compare.CompareStatsResult` namedtuple object with comparison results for that benchmarks+metrics, common to both source data sets.
3. a rendering method `renderComparisonResults()` from `benchstats.render`:
    - takes `benchstats.compare.CompareStatsResult` namedtuple produced by `compareStats()`, plus various control parameters, and renders it according to the rules and styles.

### Benchmarking Python Callables

`benchstats.qbench` module contains:
-  a simple benchmark runner `bench()` function, that takes a set
of callables and runs them with a given number of repetitions / iterations / warmups, and returns
a 3D numpy tensor of shape `[n_callables, repetitions, iterations]` with measured latencies
- a visualizer function `showBench()` that takes measured latencies and the names of benchmarks and
emits a report on them
- and a combined `benchmark()` function, that benchmarks given callables and show a report.

Check `src/benchstats/qbench.py` source for details parameters description.

Examples of use:

*Note, contrary to the examples above, this is a text log of a REPL session, so text is colored by GitHub's markdown formatter*

```python
>>> import time;
>>> from benchstats import qbench as qb;
>>> # benchmark with default settings (repetitions=10, iterations=100)
>>> r = qb.bench((lambda:time.sleep(0.001), lambda:time.sleep(0.00102), lambda:time.sleep(0.0012)))
>>> s = qb.showBench(r) # use default benchmark name and numeric alternative names
Benchmark comparison results (Brunner Munzel test, alpha=0.00100)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃               ┃ min (means),             ┃ mean (means),            ┃
┃ Benchmark     ┃ [0%, 50%, 100%]          ┃ [0%, 50%, 100%]          ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ code | 0 vs 1 │ 1.031ms < 1.048ms        │ 1.118ms ~ 1.108ms        │
│               │ [1.021m,1.029m,1.057m] < │ [1.085m,1.113m,1.159m] ~ │
│               │ [1.039m,1.048m,1.062m]   │ [1.093m,1.096m,1.141m]   │
│               │ p=0.00024 (10 vs 10)     │           (10 vs 10)     │
│ code | 0 vs 2 │ 1.031ms < 1.235ms        │ 1.118ms < 1.293ms        │
│               │ [1.021m,1.029m,1.057m] < │ [1.085m,1.113m,1.159m] < │
│               │ [1.224m,1.230m,1.257m]   │ [1.274m,1.283m,1.320m]   │
│               │ p=0.00000+(10 vs 10)     │ p=0.00000+(10 vs 10)     │
│ code | 1 vs 2 │ 1.048ms < 1.235ms        │ 1.108ms < 1.293ms        │
│               │ [1.039m,1.048m,1.062m] < │ [1.093m,1.096m,1.141m] < │
│               │ [1.224m,1.230m,1.257m]   │ [1.274m,1.283m,1.320m]   │
│               │ p=0.00000+(10 vs 10)     │ p=0.00000+(10 vs 10)     │
└───────────────┴──────────────────────────┴──────────────────────────┘
>>> s = qb.showBench(r, bm_names=("c|A","c|B","c|C"), alt_delimiter="|") # custom names
Benchmark comparison results (Brunner Munzel test, alpha=0.00100)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃            ┃ min (means),             ┃ mean (means),            ┃
┃ Benchmark  ┃ [0%, 50%, 100%]          ┃ [0%, 50%, 100%]          ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ c | A vs B │ 1.031ms < 1.048ms        │ 1.118ms ~ 1.108ms        │
│            │ [1.021m,1.029m,1.057m] < │ [1.085m,1.113m,1.159m] ~ │
│            │ [1.039m,1.048m,1.062m]   │ [1.093m,1.096m,1.141m]   │
│            │ p=0.00024 (10 vs 10)     │           (10 vs 10)     │
│ c | A vs C │ 1.031ms < 1.235ms        │ 1.118ms < 1.293ms        │
│            │ [1.021m,1.029m,1.057m] < │ [1.085m,1.113m,1.159m] < │
│            │ [1.224m,1.230m,1.257m]   │ [1.274m,1.283m,1.320m]   │
│            │ p=0.00000+(10 vs 10)     │ p=0.00000+(10 vs 10)     │
│ c | B vs C │ 1.048ms < 1.235ms        │ 1.108ms < 1.293ms        │
│            │ [1.039m,1.048m,1.062m] < │ [1.093m,1.096m,1.141m] < │
│            │ [1.224m,1.230m,1.257m]   │ [1.274m,1.283m,1.320m]   │
│            │ p=0.00000+(10 vs 10)     │ p=0.00000+(10 vs 10)     │
└────────────┴──────────────────────────┴──────────────────────────┘
>>> # benchmark 4 callables
>>> r = qb.bench((lambda:time.sleep(0.001), lambda:time.sleep(0.00102), lambda:time.sleep(0.0012), lambda: time.sleep(0.00125)))
>>> s = qb.showBench(r,bm_names=("A","B")) # compare first two as alternatives of A, and the last two as B
Benchmark comparison results (Brunner Munzel test, alpha=0.00100)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃            ┃ min (means),             ┃ mean (means),            ┃
┃ Benchmark  ┃ [0%, 50%, 100%]          ┃ [0%, 50%, 100%]          ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ A | 0 vs 1 │ 1.050ms ~ 1.064ms        │ 1.110ms ~ 1.125ms        │
│            │ [1.028m,1.055m,1.061m] ~ │ [1.074m,1.112m,1.149m] ~ │
│            │ [1.047m,1.064m,1.079m]   │ [1.095m,1.128m,1.155m]   │
│            │           (10 vs 10)     │           (10 vs 10)     │
│ B | 0 vs 1 │ 1.254ms < 1.300ms        │ 1.321ms < 1.358ms        │
│            │ [1.231m,1.259m,1.263m] < │ [1.308m,1.321m,1.332m] < │
│            │ [1.275m,1.307m,1.312m]   │ [1.323m,1.362m,1.394m]   │
│            │ p=0.00000+(10 vs 10)     │ p=0.00007 (10 vs 10)     │
└────────────┴──────────────────────────┴──────────────────────────┘

```

## Stability and Changelog

The project uses semantic versioning scheme.

Depending on a feedback the project or its individual components might get breaking changes, so prefer to use version pinning to prevent unexpected breakages. See [CHANGELOG.md](https://github.com/Arech/benchstats/blob/main/CHANGELOG.md) for details.

## FAQ

### Isn't default value for alpha/p-value = 0.001 too small? I heard people generally use 0.05.

There are several things to unpack here.

First, just to remind, `--alpha` CLI argument sets any value one wants.

Second, p-value=0.05 is just a [1 mistake per 20 tests](https://xkcd.com/882/). This isn't a lot even for a single test, but in the process of optimizing the code you might want to run a benchmark dozens of times. 20 runs isn't much, but it's almost certain that with a default alpha/p-value=0.05 you'll get at least one false positive (i.e. the test will erroneously detect a change when there's no real difference). Also it's a common practice to have dozens benchmarks in a single binary. That means dozens tests will happen in parallel and also just due to a bad luck one of them almost certainly will produce a false positive. `--bonferroni` flag could help in this case though, but it won't help against many re-runs. So just because of these practical considerations it makes sense to use smaller p-values. 

But there also exist a much bigger fundamental reason to use smaller p-value/alpha. The problem is - p-value is really a probability of mistake if and only if all preconditions of the statistical test are fulfilled. One of such requirements in case of Brunner Munzel and Mann-Whitney U tests is about different measurements being independent one from the other. The problem is - in case of benchmarking this requirements is (almost?) always broken, because most of the times we aren't using fully properly quiesced machines. Instead, there's almost always a ton of sh*t happening in the background interrupting a benchmark here and there, corrupting CPU caches, and so on, in a totally unpredictable, but systematic manner. This brings subtle hidden dependencies into measurements benchmarks take, and that breaks the important precondition of a statistical test. The result is, - if you run the same benchmark two times, it's probable (unless you've invested a notable conscious effort in quiescing the machine), that the test will detect differences. This is especially acute on Windows, btw, though Linux could be prone too.

So using `0.001` is a good and safe enough compromise to accommodate the fact that real false positive rate is much higher than the p-value used for the statistical test.

And by the way, don't worry much that lowering p-value makes the test less sensitive. It's already sensitive enough to detect differences in latencies that are much smaller than a clock granularity due to use of all data samples, especially when benchmarks repetition number is high.
