Metadata-Version: 2.4
Name: preempt
Version: 0.1.8
Summary: Sanitize text containing PII attributes
Requires-Python: >=3.11
Requires-Dist: aiohappyeyeballs==2.6.1
Requires-Dist: aiohttp==3.11.16
Requires-Dist: aiosignal==1.3.2
Requires-Dist: annotated-types==0.7.0
Requires-Dist: anyio==4.9.0
Requires-Dist: attrs==25.3.0
Requires-Dist: certifi==2025.1.31
Requires-Dist: charset-normalizer==3.4.1
Requires-Dist: click==8.1.8
Requires-Dist: distro==1.9.0
Requires-Dist: fastapi==0.115.12
Requires-Dist: fastchat==0.1.0
Requires-Dist: filelock==3.18.0
Requires-Dist: frozenlist==1.5.0
Requires-Dist: fschat==0.2.36
Requires-Dist: fsspec==2025.3.2
Requires-Dist: h11==0.14.0
Requires-Dist: httpcore==1.0.7
Requires-Dist: httpx==0.28.1
Requires-Dist: huggingface-hub==0.30.1
Requires-Dist: idna==3.10
Requires-Dist: jinja2==3.1.6
Requires-Dist: jiter==0.9.0
Requires-Dist: latex2mathml==3.77.0
Requires-Dist: markdown-it-py==3.0.0
Requires-Dist: markdown2==2.5.3
Requires-Dist: markupsafe==3.0.2
Requires-Dist: mdurl==0.1.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: multidict==6.4.3
Requires-Dist: names-dataset==3.1.0
Requires-Dist: names==0.3.0
Requires-Dist: networkx==3.4.2
Requires-Dist: nh3==0.2.21
Requires-Dist: numpy==2.2.4
Requires-Dist: nvidia-cublas-cu12==12.4.5.8
Requires-Dist: nvidia-cuda-cupti-cu12==12.4.127
Requires-Dist: nvidia-cuda-nvrtc-cu12==12.4.127
Requires-Dist: nvidia-cuda-runtime-cu12==12.4.127
Requires-Dist: nvidia-cudnn-cu12==9.1.0.70
Requires-Dist: nvidia-cufft-cu12==11.2.1.3
Requires-Dist: nvidia-curand-cu12==10.3.5.147
Requires-Dist: nvidia-cusolver-cu12==11.6.1.9
Requires-Dist: nvidia-cusparse-cu12==12.3.1.170
Requires-Dist: nvidia-cusparselt-cu12==0.6.2
Requires-Dist: nvidia-nccl-cu12==2.21.5
Requires-Dist: nvidia-nvjitlink-cu12==12.4.127
Requires-Dist: nvidia-nvtx-cu12==12.4.127
Requires-Dist: openai==1.70.0
Requires-Dist: packaging==24.2
Requires-Dist: prompt-toolkit==3.0.50
Requires-Dist: propcache==0.3.1
Requires-Dist: protobuf==6.30.2
Requires-Dist: pycountry==24.6.1
Requires-Dist: pycryptodome==3.10.1
Requires-Dist: pydantic-core==2.33.0
Requires-Dist: pydantic==2.11.1
Requires-Dist: pyfpe==0.10.3
Requires-Dist: pygments==2.19.1
Requires-Dist: pyyaml==6.0.2
Requires-Dist: regex==2024.11.6
Requires-Dist: requests==2.32.3
Requires-Dist: rich==14.0.0
Requires-Dist: safetensors==0.5.3
Requires-Dist: sentencepiece==0.2.0
Requires-Dist: shortuuid==1.0.13
Requires-Dist: six==1.17.0
Requires-Dist: sniffio==1.3.1
Requires-Dist: starlette==0.46.2
Requires-Dist: svgwrite==1.4.3
Requires-Dist: sympy==1.13.1
Requires-Dist: tiktoken==0.9.0
Requires-Dist: tokenizers==0.21.1
Requires-Dist: torch==2.6.0
Requires-Dist: tqdm==4.67.1
Requires-Dist: transformers==4.50.3
Requires-Dist: triton==3.2.0
Requires-Dist: typing-extensions==4.13.0
Requires-Dist: typing-inspection==0.4.0
Requires-Dist: urllib3==2.3.0
Requires-Dist: uv==0.6.14
Requires-Dist: uvicorn==0.34.1
Requires-Dist: wavedrom==2.0.3.post3
Requires-Dist: wcwidth==0.2.13
Requires-Dist: yarl==1.19.0
Description-Content-Type: text/markdown

# preempt
This is a modular version of Prϵϵmpt, meant to be used as part of other projects. 

For the experiments and results found in [Prϵϵmpt: Sanitizing Sensitive Prompts for LLMs](https://arxiv.org/abs/2504.05147), please refer to [this repo](https://github.com/danshumaan/preempt-experiments).
## Setup
1. Clone this repo and navigate to the root directory (`preempt`).
2. Install uv following the [instructions here](https://docs.astral.sh/uv/getting-started/installation/).
3. Create a virtual environment with Python 3.11 and activate it:
```
uv venv --python 3.11
. ./.venv/bin/activate
uv sync
```


## Usage
Additional usage examples can be found in `demo.ipynb`.

We will add support for generalized NER and sanitization in the near future. 

### Complete Usage Example
This is a complete usage example where we sanitize names and currency values. Make sure you either have [Universal NER](https://huggingface.co/Universal-NER/UniNER-7B-all) or [Llama-3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) available. 

1. Import all utilities:
```
# Import utils
from preempt.utils import *
```

2. Initialize a `NER` and `Sanitizer` object:
```
# Load NER object
# ner_model = NER("/path/to/uniner-7b-pii-v3", device="cuda:1")
ner_model = NER("/path/to/Meta-Llama-3-8B-Instruct/", device="cuda:1")

# Load Sanitizer object
sanitizer_name = Sanitizer(ner_model, key = "EF4359D8D580AA4F7F036D6F04FC6A94", tweak = "D8E7920AFA330A73")
sanitizer_money = Sanitizer(ner_model, key = "FF4359D8D580AA4F7F036D6F04FC6A94", tweak = "E8E7920AFA330A73")

# Sentences
sentences = ["Ben Parker and John Doe went to the bank and withdrew $200.", "Adam won $20 in the lottery."]
```

3. Sanitize names in `sentences`:
```
# Sanitizing names
sanitized_sentences, _ = sanitizer_name.encrypt(sentences, entity='Name', epsilon=1)
print("Sanitized sentences:")
print(sanitized_sentences)
"""
Prints:

Sanitized sentences:
['Jay Francois and Lamine Franklin went to the bank and withdrew $200.', 'Elie Vinod won $20 in the lottery.']
"""
```

4. Sanitize currency values in `sanitized_sentences`:
```
# Sanitizing currency values
sanitized_sentences, _ = sanitizer_money.encrypt(sanitized_sentences, entity='Money', epsilon=1)
print("Sanitized sentences:")
print(sanitized_sentences)
"""
Prints:

Sanitized sentences:
['Jay Francois and Lamine Franklin went to the bank and withdrew $769451698.', 'Elie Vinod won $37083668 in the lottery.']
"""
```

5. Desanitize encrypted names in `sanitized_sentences`:
```
# Desanitizing names
desanitized_sentences = sanitizer_name.decrypt(sanitized_sentences, entity='Name')
print("Desanitized sentences:")
print(desanitized_sentences)
"""
Prints:

Desanitized sentences:
['Ben Parker and John Doe went to the bank and withdrew $769451698.', 'Adam won $37083668 in the lottery.']
"""
```

6. Desanitize encrypted currency values in `desanitized_sentences`:
```
# Desanitizing currency values
desanitized_sentences = sanitizer_money.decrypt(desanitized_sentences, entity='Money')
print("Desanitized sentences:")
print(desanitized_sentences)
"""
Prints:

Desanitized sentences:
['Ben Parker and John Doe went to the bank and withdrew $200.', 'Adam won $20 in the lottery.']
"""
```

### Extraction
We currently support [Universal NER](https://huggingface.co/Universal-NER/UniNER-7B-all) and [Llama-3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) for NER. We will add support for including your own NER models in the near future. 

Initialize a `NER` class object by passing the path to one of the supported NER models mentioned above:
```
ner_model = NER("/path/to/Meta-Llama-3-8B-Instruct/", device="cuda:0")
```
Extract PII values found in a list of target strings using `ner_model.extract()`:
```
sentences = ["Ben Parker and John Doe went to the bank.", "Who was late today? Adam."]
extracted = ner_model.extract(sentences, entity_type='{Name/Money/Age}')
```

### Sanitization
We currently only support sanitization for names, currency values and age, using either FPE or m-LDP.

Initialize a `Sanitizer` class object by passing the previously initialized `ner_model`, a `key` and `tweak` parameter (required for the FF3 cipher used for FPE).
```
sanitizer = Sanitizer(ner_model, key = "EF4359D8D580AA4F7F036D6F04FC6A94", tweak = "D8E7920AFA330A73")
```
Sanitize a list of target strings using `sanitizer.encrypt()`:
```
sanitized_sentences, _ = sanitizer.encrypt(sentences, entity='Name', epsilon=1, use_fpe=True, use_mdp=False)
```
PII values found during NER are stored under `sanitizer.new_entities` as a nested list.

The mappings between plain text and cipher text PII values are stored under `sanitizer.entity_mapping`. FPE will typically extract PII values from the sanitized sentences before decryption.

Sanitized sentences can be desanitized using `sanitizer.decrypt()`:
```
desanitized_sentences = sanitizer.decrypt(sanitized_sentences, entity='Name')
```

#### Sanitizing multiple PII attributes
If you want to sanitize multiple sensitive attributes, create a sanitizer for each category separately. 

For more examples, check out `demo.ipynb`

### Usage tips
NER typically works better when the inputs are smaller. Consider breaking a large chunk of text into smaller sentences when using the sanitizer.
