Metadata-Version: 2.1
Name: embetter
Version: 0.6.4
Summary: Just a bunch of useful embeddings to get started quickly.
Home-page: https://koaning.github.io/embetter/
Author: Vincent D. Warmerdam
Project-URL: Documentation, https://koaning.github.io/embetter/
Project-URL: Source Code, https://github.com/koaning/embetter/
Project-URL: Issue Tracker, https://github.com/koaning/embetter/issues
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Description-Content-Type: text/markdown
Requires-Dist: scikit-learn >=1.0.0
Requires-Dist: pandas >=1.0.0
Requires-Dist: diskcache >=5.6.1
Requires-Dist: skops >=0.8.0
Requires-Dist: sentence-transformers >=2.2.2
Provides-Extra: all
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'all'
Requires-Dist: pandas >=1.0.0 ; extra == 'all'
Requires-Dist: diskcache >=5.6.1 ; extra == 'all'
Requires-Dist: skops >=0.8.0 ; extra == 'all'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'all'
Requires-Dist: sense2vec ==2.0.0 ; extra == 'all'
Requires-Dist: bpemb >=0.3.3 ; extra == 'all'
Requires-Dist: gensim >=4.3.1 ; extra == 'all'
Requires-Dist: scipy <1.13.0 ; extra == 'all'
Requires-Dist: timm >=0.6.7 ; extra == 'all'
Requires-Dist: openai >=0.25.0 ; extra == 'all'
Provides-Extra: bpemb
Requires-Dist: bpemb >=0.3.3 ; extra == 'bpemb'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'bpemb'
Requires-Dist: pandas >=1.0.0 ; extra == 'bpemb'
Requires-Dist: diskcache >=5.6.1 ; extra == 'bpemb'
Requires-Dist: skops >=0.8.0 ; extra == 'bpemb'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'bpemb'
Provides-Extra: cohere
Requires-Dist: cohere >=4.11.2 ; extra == 'cohere'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'cohere'
Requires-Dist: pandas >=1.0.0 ; extra == 'cohere'
Requires-Dist: diskcache >=5.6.1 ; extra == 'cohere'
Requires-Dist: skops >=0.8.0 ; extra == 'cohere'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'cohere'
Provides-Extra: dev
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'dev'
Requires-Dist: pandas >=1.0.0 ; extra == 'dev'
Requires-Dist: diskcache >=5.6.1 ; extra == 'dev'
Requires-Dist: skops >=0.8.0 ; extra == 'dev'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'dev'
Requires-Dist: sense2vec ==2.0.0 ; extra == 'dev'
Requires-Dist: bpemb >=0.3.3 ; extra == 'dev'
Requires-Dist: gensim >=4.3.1 ; extra == 'dev'
Requires-Dist: scipy <1.13.0 ; extra == 'dev'
Requires-Dist: timm >=0.6.7 ; extra == 'dev'
Requires-Dist: openai >=0.25.0 ; extra == 'dev'
Requires-Dist: mkdocs ==1.5.2 ; extra == 'dev'
Requires-Dist: mkdocs-material ==9.1.21 ; extra == 'dev'
Requires-Dist: mkdocstrings ==0.22.0 ; extra == 'dev'
Requires-Dist: mkdocstrings-python ==1.3.0 ; extra == 'dev'
Requires-Dist: mktestdocs ==0.1.2 ; extra == 'dev'
Requires-Dist: interrogate >=1.5.0 ; extra == 'dev'
Requires-Dist: flake8 >=3.6.0 ; extra == 'dev'
Requires-Dist: pytest >=4.0.2 ; extra == 'dev'
Requires-Dist: black >=19.3b0 ; extra == 'dev'
Requires-Dist: pre-commit >=2.2.0 ; extra == 'dev'
Requires-Dist: datasets ==2.8.0 ; extra == 'dev'
Requires-Dist: matplotlib ==3.4.3 ; extra == 'dev'
Requires-Dist: pytest-xdist ; extra == 'dev'
Provides-Extra: gensim
Requires-Dist: gensim >=4.3.1 ; extra == 'gensim'
Requires-Dist: scipy <1.13.0 ; extra == 'gensim'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'gensim'
Requires-Dist: pandas >=1.0.0 ; extra == 'gensim'
Requires-Dist: diskcache >=5.6.1 ; extra == 'gensim'
Requires-Dist: skops >=0.8.0 ; extra == 'gensim'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'gensim'
Provides-Extra: openai
Requires-Dist: openai >=0.25.0 ; extra == 'openai'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'openai'
Requires-Dist: pandas >=1.0.0 ; extra == 'openai'
Requires-Dist: diskcache >=5.6.1 ; extra == 'openai'
Requires-Dist: skops >=0.8.0 ; extra == 'openai'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'openai'
Provides-Extra: pytorch
Requires-Dist: torch >=1.12.0 ; extra == 'pytorch'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'pytorch'
Requires-Dist: pandas >=1.0.0 ; extra == 'pytorch'
Requires-Dist: diskcache >=5.6.1 ; extra == 'pytorch'
Requires-Dist: skops >=0.8.0 ; extra == 'pytorch'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'pytorch'
Provides-Extra: sense2vec
Requires-Dist: sense2vec ==2.0.0 ; extra == 'sense2vec'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'sense2vec'
Requires-Dist: pandas >=1.0.0 ; extra == 'sense2vec'
Requires-Dist: diskcache >=5.6.1 ; extra == 'sense2vec'
Requires-Dist: skops >=0.8.0 ; extra == 'sense2vec'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'sense2vec'
Provides-Extra: spacy
Requires-Dist: spacy >=3.5.0 ; extra == 'spacy'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'spacy'
Requires-Dist: pandas >=1.0.0 ; extra == 'spacy'
Requires-Dist: diskcache >=5.6.1 ; extra == 'spacy'
Requires-Dist: skops >=0.8.0 ; extra == 'spacy'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'spacy'
Provides-Extra: text
Requires-Dist: sense2vec ==2.0.0 ; extra == 'text'
Requires-Dist: bpemb >=0.3.3 ; extra == 'text'
Requires-Dist: gensim >=4.3.1 ; extra == 'text'
Requires-Dist: scipy <1.13.0 ; extra == 'text'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'text'
Requires-Dist: pandas >=1.0.0 ; extra == 'text'
Requires-Dist: diskcache >=5.6.1 ; extra == 'text'
Requires-Dist: skops >=0.8.0 ; extra == 'text'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'text'
Provides-Extra: vision
Requires-Dist: timm >=0.6.7 ; extra == 'vision'
Requires-Dist: scikit-learn >=1.0.0 ; extra == 'vision'
Requires-Dist: pandas >=1.0.0 ; extra == 'vision'
Requires-Dist: diskcache >=5.6.1 ; extra == 'vision'
Requires-Dist: skops >=0.8.0 ; extra == 'vision'
Requires-Dist: sentence-transformers >=2.2.2 ; extra == 'vision'


# embetter

> "Just a bunch of useful embeddings to get started quickly."

<img src="https://raw.githubusercontent.com/koaning/embetter/main/docs/images/icon.png" width="125" height="125" align="right" />

<br> 

Embetter implements scikit-learn compatible embeddings for computer vision and text. It should make it very easy to quickly build proof of concepts using scikit-learn pipelines and, in particular, should help with [bulk labelling](https://www.youtube.com/watch?v=gDk7_f3ovIk). It's also meant to play nice with [bulk](https://github.com/koaning/bulk) and [scikit-partial](https://github.com/koaning/scikit-partial) but it can also be used together with your favorite ANN solution like [lancedb](https://lancedb.github.io/lancedb/).

## Install 

You can install via pip.

```
python -m pip install embetter
```

Many of the embeddings are optional depending on your use-case, so if you
want to nit-pick to download only the tools that you need: 

```
python -m pip install "embetter[text]"
python -m pip install "embetter[spacy]"
python -m pip install "embetter[sense2vec]"
python -m pip install "embetter[gensim]"
python -m pip install "embetter[bpemb]"
python -m pip install "embetter[vision]"
python -m pip install "embetter[all]"
```

## API Design 

This is what's being implemented now. 

```python
# Helpers to grab text or image from pandas column.
from embetter.grab import ColumnGrabber

# Representations/Helpers for computer vision
from embetter.vision import ImageLoader, TimmEncoder, ColorHistogramEncoder

# Representations for text
from embetter.text import SentenceEncoder, MatryoshkaEncoder, Sense2VecEncoder, BytePairEncoder, spaCyEncoder, GensimEncoder

# Representations from multi-modal models
from embetter.multi import ClipEncoder

# Finetuning components 
from embetter.finetune import FeedForwardTuner, ContrastiveTuner, ContrastiveLearner, SbertLearner

# External embedding providers, typically needs an API key
from embetter.external import CohereEncoder, OpenAIEncoder
```

All of these components are scikit-learn compatible, which means that you
can apply them as you would normally in a scikit-learn pipeline. Just be aware
that these components are stateless. They won't require training as these 
are all pretrained tools. 

## Text Example

```python
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)
```

## Image Example

The goal of the API is to allow pipelines like this: 

```python
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder

# This pipeline grabs the `img_path` column from a dataframe
# then it grabs the image paths and turns them into `PIL.Image` objects
# which then get fed into CLIP which can also handle images.
image_emb_pipeline = make_pipeline(
  ColumnGrabber("img_path"),
  ImageLoader(convert="RGB"),
  ClipEncoder()
)

dataf = pd.DataFrame({
  "img_path": ["tests/data/thiscatdoesnotexist.jpeg"]
})
image_emb_pipeline.fit_transform(dataf)
```

## Batched Learning 

All of the encoding tools you've seen here are also compatible
with the [`partial_fit` mechanic](https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning) 
in scikit-learn. That means
you can leverage [scikit-partial](https://github.com/koaning/scikit-partial)
to build pipelines that can handle out-of-core datasets. 

