Metadata-Version: 2.1
Name: cu-cat
Version: 0.7.0
Summary: An end-to-end gpu Python library that encodes categorical variables into machine-learnable numerics
Home-page: https://github.com/graphistry/cu-cat
Download-URL: https://github.com/graphistry/cu-cat
Author: The Graphistry Team
Author-email: pygraphistry@graphistry.com
License: BSD
Project-URL: Homepage, http://github.com/graphistry/cu-cat/
Project-URL: Source, https://github.com/graphistry/cu-cat
Keywords: cudf,cuml,GPU,Rapids
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: scikit-learn >=1.2.1
Requires-Dist: numpy >=1.23.5
Requires-Dist: scipy >=1.9.3
Requires-Dist: pandas >=1.5.3
Requires-Dist: packaging >=23.1
Provides-Extra: benchmarks
Requires-Dist: numpy ; extra == 'benchmarks'
Requires-Dist: pandas ; extra == 'benchmarks'
Requires-Dist: matplotlib ; extra == 'benchmarks'
Requires-Dist: seaborn ; extra == 'benchmarks'
Requires-Dist: tqdm ; extra == 'benchmarks'
Requires-Dist: thefuzz ; extra == 'benchmarks'
Requires-Dist: autofj ; extra == 'benchmarks'
Requires-Dist: pyarrow ; extra == 'benchmarks'
Requires-Dist: loguru ; extra == 'benchmarks'
Provides-Extra: dev
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: pytest-xdist ==2.5.0 ; extra == 'dev'
Requires-Dist: pytest-xdist[psutil] ; extra == 'dev'
Requires-Dist: coverage ; extra == 'dev'
Requires-Dist: mypy ; extra == 'dev'
Requires-Dist: numpydoc ; extra == 'dev'
Requires-Dist: flake8 ; extra == 'dev'
Requires-Dist: openml ; extra == 'dev'
Requires-Dist: pre-commit ; extra == 'dev'
Provides-Extra: doc
Requires-Dist: pydata-sphinx-theme ; extra == 'doc'
Requires-Dist: sphinxext-opengraph ; extra == 'doc'
Requires-Dist: sphinx-copybutton ; extra == 'doc'
Requires-Dist: matplotlib ; extra == 'doc'
Requires-Dist: seaborn ; extra == 'doc'
Requires-Dist: statsmodels ; extra == 'doc'
Requires-Dist: numpydoc ; extra == 'doc'
Requires-Dist: jupyterlite-sphinx ; extra == 'doc'
Requires-Dist: jupyterlite-pyodide-kernel ; extra == 'doc'
Requires-Dist: pyarrow ; extra == 'doc'
Provides-Extra: min-py310
Requires-Dist: scikit-learn ==1.2.1 ; extra == 'min-py310'
Requires-Dist: numpy ==1.23.5 ; extra == 'min-py310'
Requires-Dist: scipy ==1.9.3 ; extra == 'min-py310'
Requires-Dist: pandas ==1.5.3 ; extra == 'min-py310'
Provides-Extra: polars
Requires-Dist: pyarrow ; extra == 'polars'
Requires-Dist: polars ; extra == 'polars'
Provides-Extra: pyarrow
Requires-Dist: pyarrow ; extra == 'pyarrow'


# **cu-cat** 

****cu-cat**** is an end-to-end gpu Python library that encodes
categorical variables into machine-learnable numerics. It is a cuda
accelerated port of what was dirty_cat, now rebranded as
[skrub](https://github.com/skrub-data/skrub), and allows more ambitious interactive analysis & real-time pipelines!

[Loom video walkthru](https://www.loom.com/share/d7fd4980b31949b7b840b230937a636f?sid=6d56b82e-9f50-4059-af9f-bfdc32cd3509)

# What can **cu-cat** do?

The latest PyGraphistry[AI] release GPU accelerates to its automatic feature encoding pipeline, and to do so, we are delighted to introduce the newest member to the open source GPU dataframe ecosystem: cu_cat! 
The Graphistry team has been growing the library out of need. The straw that broke the camel’s back was in December 2022 when we were hacking on our winning entry to the US Cyber Command AI competition for automatically correlating & triaging  gigabytes of alerts, and we realized that what was slowing down our team's iteration cycles was CPU-based feature engineering, basically pouring sand into our otherwise humming end-to-end GPU AI pipeline. Two months later, cu_cat was born. Fast forward to now, and we are getting ready to make it default-on for all our work.

Hinted by its name, cu_cat is our GPU-accelerated open source fork of the popular CPU Python  library dirty_cat.   Like dirty_cat, cu_cat makes it easy to convert messy dataframes filled with numbers, strings, and timestamps into numeric feature columns optimized for AI models. It adds interoperability for GPU dataframes and replaces key kernels and algorithms with faster and more scalable GPU variants. Even on low-end GPUs, we are now able to tackle much larger datasets in the same amount of time – or for the first time! – with end-to-end pipelines. We typically save time with **3-5X speedups and will even see 10X+**, to the point that the more data you encode, the more time you save!

## Startup Code:

    # !pip install graphistry[ai] ## future releases will have this by default
    !pip install git+https://github.com/graphistry/pygraphistry.git@dev/depman_gpufeat

    import cudf
    import graphistry
    df = cudf.read_csv(...)
    g = graphistry.nodes(df).featurize(feature_engine='cu_cat')
    print(g._node_features.describe()) # friendly dataframe interfaces
    g.umap().plot() # ML/AI embedding model using the features


## Example notebooks 

[Hello cu-cat notebook](https://github.com/dcolinmorgan/grph/blob/main/Hello_cu_cat.ipynb) goes in-depth on how to identify and deal with messy data using the **cu-cat** library.

**CPU v GPU Biological Demos:**
- Single Cell analysis [generically](https://github.com/dcolinmorgan/grph/blob/main/single_cell_umap_before_gpu.ipynb) and Single Cell analysis [accelerated by **cu-cat**](https://github.com/dcolinmorgan/grph/blob/main/single_cell_after_gpu.ipynb)

- Chemical Mapping [generically](https://github.com/dcolinmorgan/grph/blob/main/generic_chemical_mappings.ipynb) and Chemical Mapping [accelerated with **cu-cat**](https://github.com/dcolinmorgan/grph/blob/main/accelerating_chemical_mappings.ipynb)

- Metagenomic Analysis [generically](https://github.com/dcolinmorgan/grph/blob/main/generic_metagenomic_demo.ipynb) and Metagenomic Analysis [accelerated with **cu-cat**](https://github.com/dcolinmorgan/grph/blob/main/accelerating_metagenomic_demo.ipynb)


# Installation

**cu-cat** v 0.06.05 can be easily installed via \`pip\`:

    pip install git+http://github.com/graphistry/cu-cat.git@v0.06.05

## Dependencies

Major dependencies the cuml and cudf libraries, as well as [standard
python
libraries](https://github.com/skrub-data/skrub/blob/main/setup.cfg)

# Related projects

dirty_cat is now rebranded as part of the sklearn family as
[skrub](https://github.com/skrub-data/skrub)


