Metadata-Version: 2.1
Name: extralit-server
Version: 0.1.0a0
Summary: Open-source tool for accurate & fast scientific literature data extraction with LLM and human-in-the-loop.
Keywords: literature-review,data-annotation,artificial-intelligence,machine-learning,human-in-the-loop,mlops
Home-page: https://www.argilla.io
Author-Email: Jonny Tran <nhat.c.tran@gmail.com>, argilla <contact@argilla.io>
Maintainer-Email: Jonny Tran <nhat.c.tran@gmail.com>, argilla <contact@argilla.io>
License: Apache-2.0
Project-URL: Homepage, https://www.argilla.io
Project-URL: Documentation, https://docs.argilla.io
Project-URL: Repository, https://github.com/argilla-io/argilla
Requires-Python: <3.11,>=3.9
Requires-Dist: pydantic
Requires-Dist: rich!=13.1.0
Requires-Dist: tqdm
Requires-Dist: python-dotenv
Requires-Dist: fastapi<1.0.0,>=0.103.1; extra == "argilla-server"
Requires-Dist: pydantic<2.0,>=1.10.7; extra == "argilla-server"
Requires-Dist: uvicorn[standard]<0.25.0,>=0.15.0; extra == "argilla-server"
Requires-Dist: opensearch-py~=2.0.0; extra == "argilla-server"
Requires-Dist: elasticsearch8[async]~=8.7.0; extra == "argilla-server"
Requires-Dist: smart-open; extra == "argilla-server"
Requires-Dist: brotli-asgi<1.3,>=1.1; extra == "argilla-server"
Requires-Dist: alembic~=1.9.0; extra == "argilla-server"
Requires-Dist: SQLAlchemy~=2.0.0; extra == "argilla-server"
Requires-Dist: greenlet>=2.0.0; extra == "argilla-server"
Requires-Dist: aiosqlite>=0.19.0; extra == "argilla-server"
Requires-Dist: scikit-learn>=0.24.2; extra == "argilla-server"
Requires-Dist: aiofiles<22.2,>=0.6; extra == "argilla-server"
Requires-Dist: PyYAML<6.1.0,>=5.4.1; extra == "argilla-server"
Requires-Dist: python-multipart~=0.0.5; extra == "argilla-server"
Requires-Dist: python-jose[cryptography]<3.4,>=3.2; extra == "argilla-server"
Requires-Dist: passlib[bcrypt]~=1.7.4; extra == "argilla-server"
Requires-Dist: httpx~=0.26.0; extra == "argilla-server"
Requires-Dist: oauthlib~=3.2.0; extra == "argilla-server"
Requires-Dist: social-auth-core~=4.5.0; extra == "argilla-server"
Requires-Dist: psutil<5.10,>=5.8; extra == "argilla-server"
Requires-Dist: segment-analytics-python==2.2.0; extra == "argilla-server"
Requires-Dist: rich!=13.1.0; extra == "argilla-server"
Requires-Dist: typer<0.10.0,>=0.6.0; extra == "argilla-server"
Requires-Dist: packaging>=23.2; extra == "argilla-server"
Requires-Dist: minio>=7.2.7; extra == "argilla-server"
Requires-Dist: psycopg2~=2.9.5; extra == "postgresql"
Requires-Dist: asyncpg>=0.27.0; extra == "postgresql"
Requires-Dist: minio; extra == "extraction"
Requires-Dist: argilla; extra == "extraction"
Requires-Dist: hypothesis; extra == "extraction"
Requires-Dist: html5lib; extra == "extraction"
Requires-Dist: fastapi<1.0.0; extra == "extraction"
Requires-Dist: pydantic; extra == "extraction"
Requires-Dist: pypandoc~=1.13; extra == "extraction"
Requires-Dist: weaviate-client~=4.5.7; extra == "extraction"
Requires-Dist: beautifulsoup4~=4.12.2; extra == "extraction"
Requires-Dist: pandas~=2.2.2; extra == "extraction"
Requires-Dist: pandera[io]~=0.19.3; extra == "extraction"
Requires-Dist: numpy~=1.26.4; extra == "extraction"
Requires-Dist: spacy~=3.7.2; extra == "extraction"
Requires-Dist: pyarrow==14.*; extra == "extraction"
Requires-Dist: natsort~=8.4.0; extra == "extraction"
Requires-Dist: rapidfuzz~=3.8.1; extra == "extraction"
Requires-Dist: dill~=0.3.8; extra == "extraction"
Requires-Dist: json-repair~=0.19.2; extra == "extraction"
Requires-Dist: fastparquet; extra == "extraction"
Requires-Dist: textdescriptives; extra == "nlp"
Requires-Dist: setfit~=0.7.0; extra == "nlp"
Requires-Dist: nougat-ocr[api]; extra == "ocr"
Requires-Dist: timm==0.5.4; extra == "ocr"
Requires-Dist: unstructured[pdf]~=0.12.3; extra == "pdf"
Requires-Dist: deepdoctection~=0.31.0; extra == "pdf"
Requires-Dist: llmsherpa~=0.1.3; extra == "pdf"
Requires-Dist: python-doctr~=0.8.1; extra == "pdf"
Requires-Dist: pypdf; extra == "pdf"
Requires-Dist: pypdfium2; extra == "pdf"
Requires-Dist: pdf2image~=1.16.0; extra == "pdf"
Requires-Dist: llama-index~=0.10.40; extra == "llm"
Requires-Dist: llama-index-core~=0.10.40; extra == "llm"
Requires-Dist: llama-index-vector-stores-weaviate~=1.0.0; extra == "llm"
Requires-Dist: llama-index-callbacks-langfuse~=0.1.4; extra == "llm"
Requires-Dist: llama-index-llms-openai; extra == "llm"
Requires-Dist: llama-index-embeddings-openai; extra == "llm"
Requires-Dist: llama-index-multi-modal-llms-openai; extra == "llm"
Provides-Extra: argilla-server
Provides-Extra: postgresql
Provides-Extra: extraction
Provides-Extra: nlp
Provides-Extra: ocr
Provides-Extra: pdf
Provides-Extra: llm
Description-Content-Type: text/markdown


<h1 align="center">
  <a href=""><img src="https://github.com/dvsrepo/imgs/raw/main/rg.svg" alt="Argilla" width="150"></a>
  <br>
  Extralit
  <br>
</h1>

<h2 align="center">Open-source feedback layer for LLM-assisted data extractions</h2>

<h3>
<p align="center">
<a href="https://docs.argilla.io">📄 Documentation</a> | </span>
<a href="#-quickstart">🚀 Quickstart</a> <span> | </span>
<a href="#-project-architecture">🛠️ Architecture</a> <span> | </span>
</p>
</h3>

## What is Extralit?

Extralit is a UI interface and platform for LLM-based document data extraction that integrates human and model feedback loops for continuous LLM refinement and data extraction oversight.

With a Python SDK and flexible UI, you can create human and model-in-the-loop workflows for:

* Data extraction validation
* Supervised fine-tuning
* Preference tuning (RLHF, DPO, RLAIF, and more)
* Small, specialized NLP models
* Scalable evaluation.

## 🚀 Development Quickstart

### Install the Pre-requisites
These steps are required to run and develop Argilla locally.

1. Install [Docker Desktop](https://docs.docker.com/get-docker/)
2. Install [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation)
2. Install [ctlptl](https://github.com/tilt-dev/ctlptl/tree/main#how-do-i-install-it)
3. Install [Tilt](https://docs.tilt.dev/)

### Set up local infrastructure for Kind

1. Create a `kind` cluster

```bash
ctlptl create registry ctlptl-registry --port=5005
ctlptl create cluster kind --registry=ctlptl-registry
```


2. Apply config to mount local directory

```bash
ctlptl apply -f k8s/kind/kind-config.yaml
kubectl taint node kind-control-plane node-role.kubernetes.io/control-plane:NoSchedule-

```

### Start local development

1. Run Tilt 

Select the K8s cluster
```bash
kubectl config set-cluster <cluster_name>
```

Setting the `ENV` variable to `dev` enables hot-reloading of Docker containers for 🚀 rapid deployment:
```bash
kubectl create ns <namespace>
ENV=dev tilt up --namespace=<namespace>
```

### Start staging/prod K8s deployment

```bash
ENV=dev DOCKER_REPO=<remote docker repository> tilt up --namespace <namespace> --context <K8s cluster context>
```

## 🛠️ Developer guide

### Editing database schema:
Editting the database schema files at `src/argilla/server/models/*.py` require running these commands to apply revisions to the database.

1. Create revision
```bash
cd src/argilla
alembic revision -m <message>
```

If you happen to run into errors due to the revisions from upstream argilla-io/argilla repo, set the down-revision tag to their latest in the revision `"7552df94427a"` at `src/argilla/server/alembic/versions`

2. Apply the revision
```bash
# Be sure to set environment variables ARGILLA_ELASTICSEARCH and ARGILLA_DATABASE_URL
python -m argilla server database migrate
```

3. Update frontend site to the API backend

```bash
bash scripts/build_frontend.sh
python setup.py bdist_wheel
```

## 🛠️ Project Architecture

Argilla is built on 5 core components:

- **Python SDK**: A Python SDK which is installable with `pip install argilla`. To interact with the Argilla Server and the Argilla UI. It provides an API to manage the data, configuration and annotation workflows.
- **FastAPI Server**: The core of Argilla is a *Python FastAPI* server that manages the data, by pre-processing it and storing it in the vector database. Also, it stores application information in the relational database. It provides a REST API to interact with the data from the Python SDK and the Argilla UI. It also provides a web interface to visualize the data.
- **Relational Database**: A relational database to store the metadata of the records and the annotations. *SQLite* is used as the default built-in option and is deployed separately with the Argilla Server but a separate *PostgreSQL* can be used too.
- **Vector Database**: A vector database to store the records data and perform scalable vector similarity searches and basic document searches. We currently support *ElasticSearch* and *AWS OpenSearch* and they can be deployed as separate Docker images.
- **Vue.js UI**: A web application to visualize and annotate your data, users and teams. It is built with *Vue.js* and is directly deployed alongside the Argilla Server within our Argilla Docker image.



<p align="center">
<a  href="https://pypi.org/project/argilla-server/">
<img alt="CI" src="https://img.shields.io/pypi/v/argilla.svg?style=flat-round&logo=pypi&logoColor=white">
</a>
<img alt="Codecov" src="https://codecov.io/gh/argilla-io/argilla-server/branch/main/graph/badge.svg?token=VDVR29VOMG"/>
<a href="https://pepy.tech/project/argilla-server">
<img alt="CI" src="https://static.pepy.tech/personalized-badge/argilla-server?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month">
</a>
<a href="https://huggingface.co/new-space?template=argilla/argilla-template-space">
<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-sm.svg"/>
</a>
</p>

<p align="center">
<a href="https://twitter.com/argilla_io">
<img src="https://img.shields.io/badge/twitter-black?logo=x"/>
</a>
<a href="https://www.linkedin.com/company/argilla-io">
<img src="https://img.shields.io/badge/linkedin-blue?logo=linkedin"/>
</a>
<a href="https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g">
<img src="https://img.shields.io/badge/slack-purple?logo=slack"/>
</a>
</p>


## Clone repository

`argilla-server` is using `argilla` repository as submodule to build frontend statics so when cloning use the following command:

```sh
git clone --recurse-submodules git@github.com:argilla-io/argilla-server.git
```

If you already cloned the repository without using `--recurse-submodules` you can init and update the submodules with:

```sh
git submodule update --remote --recursive --init
```

> [!IMPORTANT]
> By default `argilla` submodule is using `develop` branch so the previous command will get the latest commit from that branch.

### Specify a tag for argilla submodule

When doing a release we should change `argilla` submodule to use an specific tag. In the following example we are setting tag `v1.22.0`:

```sh
cd argilla
git fetch --tags
git checkout v1.22.0
```

> [!NOTE]
> You should see some changes on the `argilla-server` root folder where the subproject commit is now changed to the one from the tag version. Feel free to commit these changes.

## Development environment

By default all commands executed with `pdm run` will get environment variables from `.env.dev` except command `pdm test` that will overwrite some of them using values coming from `.env.test` file.

These environment variables can be overrided if necessary so feel free to defined your own ones locally.

### Run cli

```sh
pdm cli
```

### Run database migrations

By default a SQLite located at `~/.argilla/argilla.db` will be used. You can create the database and run migrations with the following custom PDM command:

```sh
pdm migrate
```

### Run tests

A SQLite database located at `~/.argilla/argilla-test.db` will be automatically created to run tests. You can run the entire test suite using the following custom PDM command:

```sh
pdm test
```

## Run development server

### Build frontend static files

Before running Argilla development server we need to build the frontend static files. Node version 18 is required for this action:

```sh
brew install node@18
```

After that you can build the frontend static files:

```sh
./scripts/build_frontend.sh
```

After running the previous script you should have a folder at `src/argilla_server/static` with all the frontend static files successfully generated.

### Run uvicorn development server

```sh
pdm server
```
