Metadata-Version: 2.4
Name: datafog
Version: 4.3.0b2
Summary: Lightning-fast PII detection and anonymization library with 190x performance advantage
Author: Sid Mohan
Author-email: sid@datafog.ai
Project-URL: Homepage, https://datafog.ai
Project-URL: Documentation, https://docs.datafog.ai
Project-URL: Discord, https://discord.gg/bzDth394R4
Project-URL: Twitter, https://twitter.com/datafoginc
Project-URL: GitHub, https://github.com/datafog/datafog-python
Keywords: pii detection anonymization privacy regex performance
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Security
Requires-Python: >=3.10,<3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic<3.0,>=2.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: typing-extensions>=4.0
Provides-Extra: nlp
Requires-Dist: spacy<4.0,>=3.7.0; extra == "nlp"
Provides-Extra: nlp-advanced
Requires-Dist: gliner>=0.2.5; extra == "nlp-advanced"
Requires-Dist: torch<2.7,>=2.1.0; extra == "nlp-advanced"
Requires-Dist: transformers>=4.20.0; extra == "nlp-advanced"
Requires-Dist: huggingface-hub>=0.16.0; extra == "nlp-advanced"
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.0; extra == "ocr"
Requires-Dist: Pillow>=10.0.0; extra == "ocr"
Requires-Dist: sentencepiece>=0.2.0; extra == "ocr"
Requires-Dist: protobuf>=4.0.0; extra == "ocr"
Provides-Extra: distributed
Requires-Dist: pandas>=2.0.0; extra == "distributed"
Requires-Dist: numpy>=1.24.0; extra == "distributed"
Provides-Extra: web
Requires-Dist: fastapi>=0.100.0; extra == "web"
Requires-Dist: aiohttp>=3.8.0; extra == "web"
Requires-Dist: requests>=2.30.0; extra == "web"
Provides-Extra: cli
Requires-Dist: typer>=0.12.0; extra == "cli"
Requires-Dist: pydantic-settings>=2.0.0; extra == "cli"
Provides-Extra: crypto
Requires-Dist: cryptography>=40.0.0; extra == "crypto"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: sphinx>=7.0.0; extra == "dev"
Provides-Extra: all
Requires-Dist: spacy<4.0,>=3.7.0; extra == "all"
Requires-Dist: gliner>=0.2.5; extra == "all"
Requires-Dist: torch<2.7,>=2.1.0; extra == "all"
Requires-Dist: transformers>=4.20.0; extra == "all"
Requires-Dist: huggingface-hub>=0.16.0; extra == "all"
Requires-Dist: pytesseract>=0.3.0; extra == "all"
Requires-Dist: Pillow>=10.0.0; extra == "all"
Requires-Dist: sentencepiece>=0.2.0; extra == "all"
Requires-Dist: protobuf>=4.0.0; extra == "all"
Requires-Dist: pandas>=2.0.0; extra == "all"
Requires-Dist: numpy>=1.24.0; extra == "all"
Requires-Dist: fastapi>=0.100.0; extra == "all"
Requires-Dist: aiohttp>=3.8.0; extra == "all"
Requires-Dist: requests>=2.30.0; extra == "all"
Requires-Dist: typer>=0.12.0; extra == "all"
Requires-Dist: pydantic-settings>=2.0.0; extra == "all"
Requires-Dist: cryptography>=40.0.0; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

﻿# DataFog Python

DataFog is a Python library for detecting and redacting personally identifiable information (PII).

It provides:

- Fast structured PII detection via regex
- Optional NER support via spaCy and GLiNER
- A simple agent-oriented API for LLM applications
- Backward-compatible `DataFog` and `TextService` classes

## Installation

```bash
# Core install (regex engine)
pip install datafog

# Add spaCy support
pip install datafog[nlp]

# Add GLiNER + spaCy support
pip install datafog[nlp-advanced]

# Everything
pip install datafog[all]
```

## Quick Start

```python
import datafog

text = "Contact john@example.com or call (555) 123-4567"
clean = datafog.sanitize(text, engine="regex")
print(clean)
# Contact [EMAIL_1] or call [PHONE_1]
```

## For LLM Applications

```python
import datafog

# 1) Scan prompt text before sending to an LLM
prompt = "My SSN is 123-45-6789"
scan_result = datafog.scan_prompt(prompt, engine="regex")
if scan_result.entities:
    print(f"Detected {len(scan_result.entities)} PII entities")

# 2) Redact model output before returning it
output = "Email me at jane.doe@example.com"
safe_result = datafog.filter_output(output, engine="regex")
print(safe_result.redacted_text)
# Email me at [EMAIL_1]

# 3) One-liner redaction
print(datafog.sanitize("Card: 4111-1111-1111-1111", engine="regex"))
# Card: [CREDIT_CARD_1]
```

### Guardrails

```python
import datafog

# Reusable guardrail object
guard = datafog.create_guardrail(engine="regex", on_detect="redact")

@guard
def call_llm() -> str:
    return "Send to admin@example.com"

print(call_llm())
# Send to [EMAIL_1]
```

## Engines

Use the engine that matches your accuracy and dependency constraints:

- `regex`:
  - Fastest and always available.
  - Best for structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE`.
- `spacy`:
  - Requires `pip install datafog[nlp]`.
  - Useful for unstructured entities like person and organization names.
- `gliner`:
  - Requires `pip install datafog[nlp-advanced]`.
  - Stronger NER coverage than regex for unstructured text.
- `smart`:
  - Cascades regex with optional NER engines.
  - If optional deps are missing, it degrades gracefully and warns.

## Backward-Compatible APIs

The existing public API remains available.

### `DataFog` class

```python
from datafog import DataFog

result = DataFog().scan_text("Email john@example.com")
print(result["EMAIL"])
```

### `TextService` class

```python
from datafog.services import TextService

service = TextService(engine="regex")
result = service.annotate_text_sync("Call (555) 123-4567")
print(result["PHONE"])
```

## CLI

```bash
# Scan text
datafog scan-text "john@example.com"

# Redact text
datafog redact-text "john@example.com"

# Replace text with pseudonyms
datafog replace-text "john@example.com"

# Hash detected entities
datafog hash-text "john@example.com"
```

## Telemetry

DataFog includes anonymous telemetry by default.

To opt out:

```bash
export DATAFOG_NO_TELEMETRY=1
# or
export DO_NOT_TRACK=1
```

Telemetry does not include input text or detected PII values.

## Development

```bash
git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e ".[all,dev]"
pytest tests/
```
