Metadata-Version: 2.4
Name: docetl
Version: 0.2.5
Summary: ETL with LLM operations.
Author-email: Shreya Shankar <shreyashankar@berkeley.edu>
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: asteval>=1.0.4
Requires-Dist: boto3>=1.37.27
Requires-Dist: diskcache>=5.6.3
Requires-Dist: frozendict>=2.4.4
Requires-Dist: jsonschema>=4.23.0
Requires-Dist: litellm>=1.75.4
Requires-Dist: lzstring>=1.0.4
Requires-Dist: pandas>=2.3.0
Requires-Dist: pydantic>=2.9.2
Requires-Dist: pyrate-limiter>=3.7.0
Requires-Dist: python-multipart>=0.0.20
Requires-Dist: rapidfuzz>=3.10.0
Requires-Dist: rich>=13.7.1
Requires-Dist: scikit-learn>=1.5.2
Requires-Dist: tqdm>=4.66.4
Requires-Dist: typer>=0.16.0
Requires-Dist: websockets>=13.1
Provides-Extra: parsing
Requires-Dist: azure-ai-documentintelligence>=1.0.0b4; extra == 'parsing'
Requires-Dist: openpyxl>=3.1.5; extra == 'parsing'
Requires-Dist: paddlepaddle>=2.6.2; extra == 'parsing'
Requires-Dist: pydub>=0.25.1; extra == 'parsing'
Requires-Dist: pymupdf>=1.24.10; extra == 'parsing'
Requires-Dist: python-docx>=1.1.2; extra == 'parsing'
Requires-Dist: python-pptx>=1.0.2; extra == 'parsing'
Provides-Extra: server
Requires-Dist: azure-ai-documentintelligence>=1.0.0b4; extra == 'server'
Requires-Dist: azure-ai-formrecognizer>=3.3.3; extra == 'server'
Requires-Dist: docling>=2.5.2; extra == 'server'
Requires-Dist: fastapi>=0.115.4; extra == 'server'
Requires-Dist: httpx>=0.27.2; extra == 'server'
Requires-Dist: uvicorn>=0.31.0; extra == 'server'
Description-Content-Type: text/markdown

# 📜 DocETL: Powering Complex Document Processing Pipelines

[![Website](https://img.shields.io/badge/Website-docetl.org-blue)](https://docetl.org)
[![Documentation](https://img.shields.io/badge/Documentation-docs-green)](https://ucbepic.github.io/docetl)
[![Discord](https://img.shields.io/discord/1285485891095236608?label=Discord&logo=discord)](https://discord.gg/fHp7B2X3xx)
[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2410.12189)

![DocETL Figure](docs/assets/readmefig.png)

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:

1. An interactive UI playground for iterative prompt engineering and pipeline development
2. A Python package for running production pipelines from the command line or Python code

> 💡 **Need Help Writing Your Pipeline?**  
> Want to use an LLM like ChatGPT or Claude to help you write your pipeline? See [docetl.org/llms.txt](https://docetl.org/llms.txt) for a big prompt you can copy paste into ChatGPT or Claude, before describing your task.


### 🌟 Community Projects

- [Conversation Generator](https://github.com/PassionFruits-net/docetl-conversation)
- [Text-to-speech](https://github.com/PassionFruits-net/docetl-speaker)
- [YouTube Transcript Topic Extraction](https://github.com/rajib76/docetl_examples)

### 📚 Educational Resources

- [UI/UX Thoughts](https://x.com/sh_reya/status/1846235904664273201)
- [Using Gleaning to Improve Output Quality](https://x.com/sh_reya/status/1843354256335876262)
- [Deep Dive on Resolve Operator](https://x.com/sh_reya/status/1840796824636121288)


## 🚀 Getting Started

There are two main ways to use DocETL:

### 1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)

[DocWrangler](https://docetl.org/playground) helps you iteratively develop your pipeline:
- Experiment with different prompts and see results in real-time
- Build your pipeline step by step
- Export your finalized pipeline configuration for production use

![DocWrangler](docs/assets/tutorial/one-operation.png)

DocWrangler is hosted at [docetl.org/playground](https://docetl.org/playground). But to run the playground locally, you can either:
- Use Docker (recommended for quick start): `make docker`
- Set up the development environment manually

See the [Playground Setup Guide](https://ucbepic.github.io/docetl/playground/) for detailed instructions.

### 2. 📦 Python Package (For Production Use)

If you want to use DocETL as a Python package:

#### Prerequisites
- Python 3.10 or later
- OpenAI API key

```bash
pip install docetl
```

Create a `.env` file in your project directory:
```bash
OPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)
```

To see examples of how to use DocETL, check out the [tutorial](https://ucbepic.github.io/docetl/tutorial/).

### 2. 🎮 DocWrangler Setup

To run DocWrangler locally, you have two options:

#### Option A: Using Docker (Recommended for Quick Start)

The easiest way to get the DocWrangler playground running:

1. Create the required environment files:

Create `.env` in the root directory:
```bash
OPENAI_API_KEY=your_api_key_here
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1
```

Create `.env.local` in the `website` directory:
```bash
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
```

2. Run Docker:
```bash
make docker
```

This will:
- Create a Docker volume for persistent data
- Build the DocETL image
- Run the container with the UI accessible at http://localhost:3000

To clean up Docker resources (note that this will delete the Docker volume):
```bash
make docker-clean
```

##### AWS Bedrock

This framework supports integration with AWS Bedrock. To enable:

1. Configure AWS credentials:
```bash
aws configure
```

2. Test your AWS credentials:
```bash
make test-aws
```

3. Run with AWS support:
```bash
AWS_PROFILE=your-profile AWS_REGION=your-region make docker
```

Or using Docker Compose:
```bash
AWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up
```

Environment variables:
- `AWS_PROFILE`: Your AWS CLI profile (default: 'default')
- `AWS_REGION`: AWS region (default: 'us-west-2')

Bedrock models are pefixed with `bedrock`. See liteLLM [docs](https://docs.litellm.ai/docs/providers/bedrock#supported-aws-bedrock-models) for more details.

#### Option B: Manual Setup (Development)

For development or if you prefer not to use Docker:

1. Clone the repository:
```bash
git clone https://github.com/ucbepic/docetl.git
cd docetl
```

2. Set up environment variables in `.env` in the root/top-level directory:
```bash
OPENAI_API_KEY=your_api_key_here
# BACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True

# FRONTEND configuration
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031
BACKEND_DOCKER_COMPOSE_PORT=8081

# Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1
```

And create an .env.local file in the `website` directory with the following:
```bash
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
NEXT_PUBLIC_HOSTED_DOCWRANGLER=false
```

3. Install dependencies:
```bash
make install      # Install Python deps with uv and set up pre-commit
make install-ui   # Install UI dependencies
```

If you prefer using uv directly instead of Make:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-groups --all-extras
```

Note that the OpenAI API key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.

4. Start the development server:
```bash
make run-ui-dev
```

5. Visit http://localhost:3000/playground to access the interactive UI.

### 🛠️ Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

```bash
make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)
```

For detailed documentation and tutorials, visit our [documentation](https://ucbepic.github.io/docetl).
