Metadata-Version: 2.1
Name: daluke
Version: 0.0.4
Summary: A Danish-speaking language model with entity-aware self-attention
Home-page: https://github.com/peleiden/daLUKE
Author: Søren Winkel Holm, Asger Laurits Schultz
Author-email: s18911@dtu.dk, s183912@dtu.dk
License: MIT License
Download-URL: https://pypi.org/project/daluke/
Keywords: nlp,ai,pytorch,ner
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: matplotlib
Requires-Dist: seqeval
Requires-Dist: torch (>=1.8.0)
Requires-Dist: transformers (>=4.3.3)
Requires-Dist: pelutils[ds] (==0.6.7)
Requires-Dist: click (>=7.0.0)
Requires-Dist: tqdm
Requires-Dist: wget
Provides-Extra: full
Requires-Dist: danlp ; extra == 'full'
Requires-Dist: pyicu ; extra == 'full'
Requires-Dist: wikipedia2vec ; extra == 'full'
Requires-Dist: umap-learn ; extra == 'full'
Requires-Dist: scikit-learn ; extra == 'full'
Requires-Dist: scipy ; extra == 'full'

# DaLUKE: The Entity-aware, Danish Language Model

<img src="https://raw.githubusercontent.com/peleiden/daluke/master/daluke-mascot.png" align="right"/>

[![pytest](https://github.com/peleiden/daLUKE/actions/workflows/pytest.yml/badge.svg?branch=master)](https://github.com/peleiden/daLUKE/actions/workflows/pytest.yml)

Implementation of the knowledge-enhanced transformer [LUKE](https://github.com/studio-ousia/luke) pretrained on the Danish Wikipedia and evaluated on named entity recognition (NER).

## Installation

```
pip install daluke
```
For including optional requirements that are necessary for training and general analysis:
```
pip install daluke[full]
```
Python 3.8 or newer is required.

## Explanation
For an explanation of the model, see our [bachelor's thesis](https://peleiden.github.io/bug-free-guacamole/main.pdf) or the original [LUKE paper](https://www.aclweb.org/anthology/2020.emnlp-main.523/).

## Usage
### Inference on simple NER or masked language modeling (MLM) examples

#### Python
For performing NER predictions
```py
from daluke import AutoNERDaLUKE, predict_ner

daluke = AutoNERDaLUKE()

document = "Det Kgl. Bibliotek forvalter Danmarks største tekstsamling, der strækker sig fra middelalderen til det nyeste litteratur."
iob_list = predict_ner(document, daluke)
```

For testing MLM predictions
```py
from daluke import AutoMLMDaLUKE, predict_mlm

daluke = AutoMLMDaLUKE()
# Empty list => No entity annotations in the string
document = "Professor i astrofysik, [MASK] [MASK], udtaler til avisen, at den nye måling sandsynligvis ikke er en fejl."
best_prediction, table = predict_mlm(document, list(), daluke)
```

#### CLI
```bash
daluke ner --text "Thomas Delaney fører Danmark til sejr ved EM i fodbold."
daluke masked --text "Slutresultatet af kampen mellem Danmark og Rusland bliver [MASK]-[MASK]."
```
For Windows, or systems where `#!/usr/bin/env python3` is not linked to the correct Python interpreter, the command `python -m daluke.api.cli` can be used instead of `daluke`.

### Training DaLUKE yourself

This part shows how to recreate the entire DaLUKE training pipeline from dataset preparation to fine-tuning.
This guide is designed to be run in a bash shell.
If you use Windows, you will probably have to make some modifications to the shell scripts used.

```bash
# Download forked luke submodule
git submodule update --init --recursive
# Install requirements
pip install -r requirements.txt
pip install -r optional-requirements.txt
pip install -r luke/requirements.txt

# Build dataset
# The script performs all the steps of building the dataset, including downloading the Danish Wikipedia
# You only need to modify DATA_PATH to where you want the data to be saved
# Be aware that this takes several hours
dev/build_data.sh

# Start pretraining using default hyperparameters
python daluke/pretrain/run.py <INSERT DATA_PATH HERE> -c configs/pretrain-main.ini --name $NAME --save-every 5 --epochs 150 --name daluke --fp16
# Optional: Make plots of pretraining
python daluke/plot/plot_pretraining.py <DATA_PATH>/daluke

# Fine-tune on DaNE
python daluke/collect_modelfile.py <DATA_PATH>/daluke <DATA_PATH>/ner/daluke.tar.gz
python daluke/ner/run.py <DATA_PATH>/ner/daluke -c configs/main-finetune.ini --model <DATA_PATH>/ner/daluke.tar.gz --name finetune --eval
# Evaluate on DaNE test set
python daluke/ner/run_eval.py <DATA_PATH>/ner/daluke/finetune --model <DATA_PATH>/ner/daluke/finetune/daluke_ner_best.tar.gz
# Optional: Fine-tuning plots
python daluke/plot/plot_finetune_ner.py <DATA_PATH>/ner/daluke/finetune/train-results
```


