Metadata-Version: 2.4
Name: pyCoDaMath
Version: 1.1
Summary: Compositional data (CoDa) analysis tools for Python
Author-email: Christian Brinch <cbri@food.dtu.dk>
License: MIT
Project-URL: Homepage, https://bitbucket.org/genomicepidemiology/pycodamath
Project-URL: Bug Tracker, https://bitbucket.org/genomicepidemiology/pycodamath/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: adjustText>=0.7.3
Requires-Dist: matplotlib>=3.1.1
Requires-Dist: numpy>=1.17.2
Requires-Dist: pandas>=0.25.1
Requires-Dist: python-ternary>=1.0.6
Requires-Dist: scipy>=1.3.1
Requires-Dist: webcolors>=1.13
Dynamic: license-file

#  pyCoDaMath

[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)


pyCoDaMath provides compositional data (CoDa) analysis tools for Python

- **Source code:** https://bitbucket.org/genomicepidemiology/pycodamath

## Getting Started

This package extends the Pandas dataframe object with various CoDa tools. It also provides a set of plotting functions for CoDa figures.

### Installation

Clone the git repo to your local hard drive:

    git clone https://bitbucket.org/genomicepidemiology/pycodamath.git

Enter the directory and install:

    pip install .

### Usage

The pyCoDaMath module is loaded as

    import pycodamath

At this point, in order to get CLR values from a Pandas DataFrame `df`, do

    df.coda.clr()


## Documentation

### CLR transformation - point estimate
    df.coda.clr()

Returns centered logratio coefficients. If the dataframe contains zeros, values
will be replaced by the Aitchison mean point estimate.

### CLR transformation - standard deviation
    df.coda.clr_std(n_samples=5000)

Returns the standard deviation of `n_samples` random draws in CLR space.

**Parameters**

- n_samples (int) - Number of random draws from a Dirichlet distribution.


### ALR transformation - point estimate
    df.coda.alr(part=None)

Returns additive logratio values. If `part` is None, the last part of the composition is used as the denominator.

**Parameters**

- part (str) - Name of the part to use as denominator.


### ALR transformation - standard deviation
    df.coda.alr_std(part=None, n_samples=5000)

Returns the standard deviation of `n_samples` random draws in ALR space.

**Parameters**

- part (str) - Name of the part to use as denominator.
- n_samples (int) - Number of random draws from a Dirichlet distribution.


### ILR transformation - point estimate
    df.coda.ilr(psi=None)

Returns isometric logratio values. If no basis is given, a default sequential binary partition basis is used.

**Parameters**

- psi (array_like) - Orthonormal basis. If None, the default SBP basis is used.


### ILR inverse transformation
    df.coda.ilr_inv(psi=None)

Returns the composition corresponding to a set of ILR coordinates. The same basis used for the forward transform must be supplied.

**Parameters**

- psi (array_like) - Orthonormal basis. If None, the default SBP basis is used.


### Aitchison point estimate
    df.coda.aitchison_mean(alpha=1.0)

Returns the Bayesian point estimate based on the Dirichlet concentration parameter alpha.
Use values between 0.5 (sparse prior) and 1.0 (flat prior).

**Parameters**

- alpha (float) - Dirichlet concentration parameter. Defaults to 1.0.


### Bayesian zero replacement
    df.coda.zero_replacement(n_samples=5000)

Returns a count table with zero values replaced by finite values using Bayesian inference.

**Parameters**

- n_samples (int) - Number of random draws from a Dirichlet distribution.


### Closure
    df.coda.closure(N)

Applies closure to constant N to the composition.

**Parameters**

- N (float) - Closure constant.


### Variance matrix
    df.coda.varmatrix(nmp=False)

Returns the total variation matrix of a composition. For large datasets, variance is
estimated from at most 500 rows.

**Parameters**

- nmp (bool) - If True, return a numpy array instead of a DataFrame. Defaults to False.


### Total variance
    df.coda.totvar()

Returns the total variance of a set of compositions, computed as the sum of the
variance matrix divided by twice the number of parts.


### Geometric mean
    df.coda.gmean()

Returns the geometric mean of a set of compositions as percentages.


### Power transformation
    df.coda.power(alpha)

Applies compositional scalar multiplication (power transformation).

**Parameters**

- alpha (float) - Scalar multiplier.


### Perturbation
    df.coda.perturbation(comp)

Applies a compositional perturbation (Aitchison addition) with another composition.

**Parameters**

- comp (array_like) - Composition to perturb with.


### Scaling
    df.coda.scale()

Scales the composition by the reciprocal of the square root of the total variance.


### Centering
    df.coda.center()

Centers the composition by perturbing with the reciprocal of the geometric mean.


---

## Plotting functions

### Ternary diagram
    pycodamath.plot.ternary(data, descr=None, center=False, conf=False)

Plots a ternary diagram from a three-part composition closed to 100.

**Parameters**

- data (DataFrame) - Three-part compositional data, closed to 100.
- descr (Series) - Optional grouping variable; if provided, points are coloured by group.
- center (bool) - If True, the composition is centred before plotting. Defaults to False.
- conf (bool) - If True, a 95% confidence ellipse is overlaid. Defaults to False.


### Scree plot
    pycodamath.pca.scree_plot(axis, eig_val)

Plots a scree plot of explained variance from singular values.

**Parameters**

- axis - A Matplotlib axes object.
- eig_val (array_like) - Singular values from SVD.


### PCA biplot
    class pycodamath.pca.Biplot(data, axis=None, default=True)

Creates a PCA biplot based on a centered log-ratio transformation of the data.

**Parameters**

- data (DataFrame) - Compositional count data to analyse.
- axis - A Matplotlib axes object. If None, a new figure is created.
- default (bool) - If True, loadings and scores are plotted immediately. Defaults to True.

The following methods are available for customising the biplot:

- `plotloadings(cutoff=0, scale=None, labels=None, cluster=False)` — plot loading arrows.
  Set `cutoff` (as a fraction of the maximum loading length) to suppress short loadings.
  Set `cluster=True` to reduce the number of loadings by hierarchical clustering; the
  resulting cluster legend is accessible as `biplot.clusterlegend`.
- `plotloadinglabels(labels=None, loadings=None, cutoff=0)` — add text labels to loadings.
- `adjustloadinglabels()` — shift loading labels to reduce overlap.
- `plotscores(group=None, palette=None, legend=True, labels=None)` — plot sample scores
  as points, optionally coloured by group.
- `plotscorelabels(labels=None)` — add text labels to the scores.
- `plotellipses(group, palette=None, legend=False)` — plot 90% confidence ellipses for
  each group (requires at least 3 samples per group).
- `plotcentroids(group, palette=None, legend=False)` — plot the centroid of each group.
- `plothulls(group, palette=None, legend=True)` — plot convex hulls around each group
  (requires at least 3 samples per group).
- `plotcontours(group, palette=None, legend=True, plot_outliers=True, percent_outliers=0.1, linewidth=2.2)` — plot kernel density contours for each group. Samples outside the outermost contour are optionally shown as individual points.
- `labeloutliers(group, conf=3.0)` — label samples more than `conf` standard deviations
  from their group centroid.
- `displaylegend(loc=2)` — display the group legend at Matplotlib legend location `loc`.
- `removepatches()` — remove loading arrows and hull polygons from the plot.
- `removescores()` — remove score points from the plot.
- `removelabels()` — remove text labels from the plot.
- `removecontours()` — remove contour fills from the plot.

The keyword `labels` is a list of label names. If `labels` is None, all labels are plotted.

The keyword `group` is a Pandas Series with an index matching the data index.

The keyword `palette` is a dict mapping each unique group value to a colour.

**Example**

    import pycodamath as coda
    import pandas as pd
    data = pd.read_csv('example/kilauea_iki_chem.csv')
    mypca = coda.pca.Biplot(data)
    mypca.removelabels()
    mypca.plotloadings(cluster=True)
    print(mypca.clusterlegend)
    mypca.removelabels()
    mypca.plotloadings(labels=['FeO', 'Al2O3', 'CaO'], cluster=False)
    mypca.adjustloadinglabels()
