Metadata-Version: 2.1
Name: scikit-embeddings
Version: 0.3.0
Summary: Tools for training word and document embeddings in scikit-learn.
License: MIT
Author: Márton Kardos
Author-email: power.up1163@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Provides-Extra: glove
Provides-Extra: spacy
Requires-Dist: catalogue (>=2.0.8,<3.0.0)
Requires-Dist: confection (>=0.1.0,<0.2.0)
Requires-Dist: gensim (>=4.3.0,<5.0.0)
Requires-Dist: glovpy (>=0.1.0,<0.2.0) ; extra == "glove"
Requires-Dist: huggingface-hub (>=0.16.0,<0.17.0)
Requires-Dist: scikit-learn (>=1.2.0,<2.0.0)
Requires-Dist: tokenizers (>=0.13.0,<0.14.0)
Description-Content-Type: text/markdown

<img align="left" width="82" height="82" src="assets/logo.svg">

# scikit-embeddings

<br>
Utilites for training, storing and using word and document embeddings in scikit-learn pipelines.

## Features
 - Train Word and Paragraph embeddings in scikit-learn compatible pipelines.
 - Fast and performant trainable tokenizer components from `tokenizers`.
 - Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.
 - Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.

### What scikit-embeddings is not for:
 - Training transformer models and deep neural language models (if you want to do this, do it with [transformers](https://huggingface.co/docs/transformers/index))
 - Using pretrained sentence transformers (use [embetter](https://github.com/koaning/embetter))

## Installation

You can easily install scikit-embeddings from PyPI:

```bash
pip install scikit-embeddings
```

If you want to use GloVe embedding models, install alogn with glovpy:

```bash
pip install scikit-embeddings[glove]
```

## Example Pipelines

You can use scikit-embeddings with many many different pipeline architectures, I will list a few here:

### Word Embeddings

You can train classic vanilla word embeddings by building a pipeline that contains a `WordLevel` tokenizer and an embedding model:

```python
from skembedding.tokenizers import WordLevelTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    WordLevelTokenizer(),
    Word2VecEmbedding(n_components=100, algorithm="cbow")
)
embedding_pipe.fit(texts)
```

### Fasttext-like

You can train an embedding pipeline that uses subword information by using a tokenizer that does that.
You may want to use `Unigram`, `BPE` or `WordPiece` for these purposes.
Fasttext also uses skip-gram by default so let's change to that.

```python
from skembedding.tokenizers import UnigramTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    UnigramTokenizer(),
    Word2VecEmbedding(n_components=250, algorithm="sg")
)
embedding_pipe.fit(texts)
```

### Paragraph Embeddings

You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.

```python
from skembedding.tokenizers import WordPieceTokenizer
from skembedding.models import ParagraphEmbedding
from skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline

embedding_pipe = EmbeddingPipeline(
    WordPieceTokenizer(),
    ParagraphEmbedding(n_components=250, algorithm="dm")
)
embedding_pipe.fit(texts)
```

## Serialization

Pipelines can be safely serialized to disk:

```python
embedding_pipe.to_disk("output_folder/")

pretrained = PretrainedPipeline("output_folder/")
```

Or published to HugginFace Hub:

```python
from huggingface_hub import login

login()
embedding_pipe.to_hub("username/name_of_pipeline")

pretrained = PretrainedPipeline("username/name_of_pipeline")
```

## Text Classification

You can include an embedding model in your classification pipelines by adding some classification head.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y)

cls_pipe = make_pipeline(pretrained, LogisticRegression())
cls_pipe.fit(X_train, y_train)

y_pred = cls_pipe.predict(X_test)
print(classification_report(y_test, y_pred))
```


