Metadata-Version: 2.4
Name: data_element_extractor
Version: 0.1.0
Summary: This python package is designed to provide an adaptable framework for data extraction. It can be used to manage and extract data from text across multiple topics using Large Language Models (LLMs).
Author-email: Fabio Dennstädt <fabiodennstaedt@gmx.de>
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# Data Element Extractor

An adaptable Python framework for extracting structured data from unstructured text using Large Language Models (LLMs). This library provides a flexible system for managing and extracting data across multiple topics, with support for categorical classification, number extraction, date parsing, and custom text extraction.

## Features

- **Multiple LLM Backends**: Support for Hugging Face Transformers models, OpenAI API, DeepInfra, and custom inference servers
- **Topic Management**: Define and manage multiple extraction topics with custom prompts and categories
- **Conditional Extraction**: Extract data elements conditionally based on previous extraction results
- **Prompt Optimization**: Iterative prompt improvement and performance evaluation tools
- **Flexible Data Types**: Support for categorical (value list), number, date, and text extraction
- **Batch Processing**: Extract data from CSV files with batch processing capabilities
- **Server Integration**: Load data elements and lists from remote CDE (Common Data Element) servers
- **Graphical UI**: Built-in Tkinter-based user interface for interactive data extraction
- **Prompt Generation**: Automatic prompt creation and few-shot learning support

## Installation

Install the package using pip:

```bash
pip install data-element-extractor
```

### Dependencies

The package requires:
- `torch` - PyTorch for local model inference
- `transformers` - Hugging Face Transformers library
- `openai` - OpenAI Python client
- `dateparser` - Date parsing utilities
- `requests` - HTTP client for server communication
- `tk` - Tkinter for the UI (usually included with Python)

## Quick Start

### Basic Usage

```python
from data_element_extractor import DataElementExtractor

# Initialize the extractor
extractor = DataElementExtractor()

# Set up your LLM model
# Option 1: Use a local Transformers model
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")

# Option 2: Use OpenAI API
# extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-api-key")

# Add a categorical topic (classification)
topic_id = extractor.add_topic(
    topic_name="Sentiment",
    topic_data=["Positive", "Negative", "Neutral"]
)

# Add a number extraction topic
topic_id_2 = extractor.add_topic(
    topic_name="Age",
    topic_data="number"
)

# Extract data from text
text = "The customer was very happy with the service. They are 25 years old."
results, probabilities = extractor.extract(text)

print(f"Sentiment: {results[0]} (confidence: {probabilities[0]:.2%})")
print(f"Age: {results[1]} (confidence: {probabilities[1]:.2%})")
```

### Working with CSV Files

```python
# Extract data from a CSV file
results = extractor.extract_from_table(
    csv_file_path="data.csv",
    delimiter=";",
    batch_size=100,
    constrained_output=True
)
```

## Core Concepts

### Topics

A **topic** defines what data you want to extract from text. Each topic has:
- **Name**: A descriptive name for the data element (e.g., "Sentiment", "Age")
- **Data Type**: The type of extraction:
  - List of categories (categorical/classification)
  - `"number"` - Extract numeric values
  - `"date"` - Extract dates
  - `"text"` - Extract free-form text
- **Prompt**: Custom instruction template for the LLM
- **Condition**: Optional condition to control when extraction runs

### Model Configuration

The library supports multiple inference backends:

#### Local Transformers Models
```python
extractor.set_model(
    model_name="microsoft/Phi-3-mini-4k-instruct",
    model_type="Transformers",
    inference_type="transformers",
    attn_implementation="flash_attention_2",
    move_to_gpu=True
)
```

#### OpenAI API
```python
extractor.set_model(
    model_name="gpt-3.5-turbo",
    model_type="OpenAI",
    api_key="your-api-key",
    inference_type="cloud"
)
```

#### DeepInfra
```python
extractor.set_model(
    model_name="meta-llama/Llama-3-70b-chat-hf",
    model_type="DeepInfra",
    api_key="your-api-key",
    inference_type="cloud"
)
```

#### Custom Inference Server
```python
extractor.set_inference_server_url("http://127.0.0.1:5000")
extractor.set_model(
    model_name="your-model-name",
    model_type="Transformers",
    inference_type="server"
)
```

### Conditional Extraction

You can make topic extraction conditional on previous results:

```python
# First topic extracts a category
topic1_id = extractor.add_topic(
    topic_name="Document Type",
    topic_data=["Contract", "Invoice", "Receipt"]
)

# Second topic only extracts if first topic was "Contract"
topic2_id = extractor.add_topic(
    topic_name="Contract Date",
    topic_data="date",
    condition="T1 == 'Contract'"
)
```

Conditions reference topic IDs (e.g., `T1`, `T2`) and can check for:
- Category matches: `T1 == 'CategoryName'`
- Non-empty values: `T1 != ''`
- Complex expressions using `and`, `or`, `not`

### Prompt Customization

Customize prompts for each topic:

```python
topic_id = extractor.add_topic(
    topic_name="Medical Condition",
    topic_data=["Diabetes", "Hypertension", "Asthma"],
    prompt="You are a medical expert. Classify the following medical text into one of the categories: [CATEGORIES]. Text: [TEXT]. Category:"
)
```

The library automatically replaces:
- `[TOPIC]` - The topic name
- `[TEXT]` - The input text
- `[CATEGORIES]` - The list of categories (for categorical topics)

## Advanced Features

### Prompt Optimization

Evaluate and improve prompt performance:

```python
# Evaluate current prompt performance
performance = extractor.evaluate_prompt_performance_for_topic(
    topic_id="T1",
    dataset_path="evaluation_data.csv",
    truth_col=1,
    text_col=0,
    delimiter=";"
)

# Iteratively improve a prompt
extractor.iteratively_improve_prompt(
    topic_id="T1",
    dataset_path="training_data.csv",
    text_column_index=0,
    ground_truth_column_index=1,
    num_iterations=3,
    delimiter=";"
)
```

### Few-Shot Learning

Generate few-shot prompts automatically:

```python
# Create few-shot prompt for a single topic
extractor.create_few_shot_prompt(
    topic_id="T1",
    csv_path="examples.csv",
    text_col_idx=0,
    label_col_idx=1,
    delimiter=";",
    num_examples=3
)

# Create few-shot prompts for all topics
extractor.create_few_shot_prompts_for_all_topics(
    csv_path="examples.csv",
    delimiter=";",
    num_examples=3
)
```

### Server Integration

Load data elements from remote servers:

```python
# Get all available CDEs (Common Data Elements)
all_cdes = extractor.get_all_cdes_from_server()

# Get CDE lists
cde_lists = extractor.get_cde_lists_from_server()

# Load a data element list
topics = extractor.load_data_element_list_from_server(cde_list_id="list-123")

# Load a single data element
topics = extractor.load_data_element_from_server(cde_id="cde-456")
```

### Topic Management

```python
# Get topic information
topic = extractor.get_topic_by_id("T1")
topic_id = extractor.get_topic_id_by_name("Sentiment")

# Modify topics
extractor.set_prompt(topic_id="T1", new_prompt="New prompt text")
extractor.increase_topic_order("T1")
extractor.decrease_topic_order("T1")

# Category management
extractor.add_category(topic_id="T1", category_name="New Category")
extractor.remove_category(topic_id="T1", category_id="cat-uuid")

# Category conditions
extractor.add_category_condition(topic_id="T1", category_id="cat-uuid", condition_str="T2 > 10")

# Save and load topics
extractor.save_topics("topics.json")
extractor.load_topics("topics.json")

# Display all topics
extractor.show_topics_and_categories()
```

### Thinking/Chain-of-Thought

Configure chain-of-thought reasoning for improved accuracy:

```python
# Set global thinking config
extractor.thinking_config = {
    "enabled": True,
    "temperature": 0.7,
    "max_length": 500
}

# Or configure per topic when adding
topic_id = extractor.add_topic(
    topic_name="Complex Classification",
    topic_data=["Category A", "Category B"],
    thinking_config={
        "enabled": True,
        "temperature": 0.5
    }
)
```

## Configuration

```python
# Configure choice symbols for categorical output
# Options: "none", "alphabetical", "numerical", or custom list like "A,B,C,D"
extractor.set_choice_symbol_config("alphabetical")

# Set inference server URL
extractor.set_inference_server_url("http://127.0.0.1:5000")
```

## User Interface

The library includes a graphical user interface built with Tkinter:

```python
from data_element_extractor.ui.main_app import ExtractorApp
import tkinter as tk

root = tk.Tk()
app = ExtractorApp(root)
root.mainloop()
```

Or use the UI module directly:

```python
from data_element_extractor import ui
# Launch UI (if available)
```

The UI provides:
- Model configuration management
- Topic creation and editing
- Interactive extraction interface
- Prompt editing and optimization
- CSV file processing

## API Reference

### Main Class

#### `DataElementExtractor()`

Main class for data extraction.

**Model Management:**
- `set_model(model_name, model_type="Transformers", api_key="", inference_type="transformers", ...)` - Configure the main extraction model
- `set_prompt_model(model_name, model_type="OpenAI", ...)` - Configure model for prompt generation
- `set_model_as_prompt_model()` - Use main model for prompt generation

**Topic Management:**
- `add_topic(topic_name, topic_data, condition="", prompt="", thinking_config={})` - Add a new extraction topic
- `get_topic_by_id(topic_id)` - Get topic by ID
- `get_topic_id_by_name(topic_name)` - Get topic ID by name
- `update_topics(topics)` - Update all topics
- `remove_topic(topic_id_str)` - Remove a topic
- `save_topics(filename)` - Save topics to file
- `load_topics(filename)` - Load topics from file
- `show_topics_and_categories()` - Display all topics

**Extraction:**
- `extract(text, is_single_extraction=True, constrained_output=True, with_evaluation=False, ground_truth_row=None)` - Extract data from text
- `extract_element(topic_id, text, constrained_output=False, thinking_data=None)` - Extract a single element
- `extract_from_table(csv_file_path, delimiter=';', batch_size=100, ...)` - Extract from CSV file

**Prompt Optimization:**
- `evaluate_prompt_performance_for_topic(topic_id, truth_col, dataset_path, ...)` - Evaluate prompt performance
- `iteratively_improve_prompt(topic_id, dataset_path, ...)` - Improve prompt iteratively
- `create_few_shot_prompt(topic_id, csv_path, ...)` - Generate few-shot prompt

## Examples

### Example 1: Document Classification

```python
from data_element_extractor import DataElementExtractor

extractor = DataElementExtractor()
extractor.set_model("microsoft/Phi-3-mini-4k-instruct", model_type="Transformers")

# Classify document type
extractor.add_topic(
    topic_name="Document Type",
    topic_data=["Invoice", "Receipt", "Contract", "Letter"]
)

# Extract date (conditional)
extractor.add_topic(
    topic_name="Document Date",
    topic_data="date",
    condition="T1 != ''"
)

text = "Invoice dated March 15, 2024 for services rendered."
results, probs = extractor.extract(text)
print(f"Type: {results[0]}, Date: {results[1]}")
```

### Example 2: Medical Data Extraction

```python
extractor = DataElementExtractor()
extractor.set_model("gpt-3.5-turbo", model_type="OpenAI", api_key="your-key")

# Extract diagnosis
extractor.add_topic(
    topic_name="Diagnosis",
    topic_data=["Diabetes", "Hypertension", "Asthma", "Other"]
)

# Extract age (number)
extractor.add_topic(
    topic_name="Patient Age",
    topic_data="number"
)

# Extract date of diagnosis
extractor.add_topic(
    topic_name="Diagnosis Date",
    topic_data="date",
    condition="T1 != 'Other'"
)

medical_text = "Patient is 45 years old, diagnosed with Diabetes on 2023-01-15."
results, _ = extractor.extract(medical_text)
```

## Requirements

- Python >= 3.7
- PyTorch (for local models)
- transformers
- openai
- dateparser
- requests

## License

MIT License

## Author

Fabio Dennstädt (fabiodennstaedt@gmx.de)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Support

For issues, questions, or contributions, please open an issue on the project repository.
