Metadata-Version: 2.4
Name: idscrub
Version: 2.0.15
Author: Department for Business and Trade
Classifier: Development Status :: 3 - Alpha
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ipykernel>=7.1.0
Requires-Dist: ipywidgets
Requires-Dist: numpy>=2.3.4
Requires-Dist: pandas<3.0
Requires-Dist: phonenumbers>=9.0.18
Requires-Dist: pip>=25.3
Requires-Dist: spacy
Requires-Dist: transformers
Requires-Dist: tqdm>=4.67.1
Requires-Dist: presidio-analyzer
Requires-Dist: presidio-anonymizer
Provides-Extra: trf
Requires-Dist: en_core_web_trf; extra == "trf"
Dynamic: license-file

![Development](https://img.shields.io/badge/status-development-orange)

# idscrub 🧽✨

* Names and other personally identifying information are often present in text, even if they are not clearly visible or requested.
* This information may need to be removed prior to further analysis in many cases.
* `idscrub` identifies and removes (*✨scrubs✨*) personal data from text using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).

> [!IMPORTANT]
> * This package is undergoing frequent internal development. Major updates will be made public periodically.

## Installation

`idscrub` can be installed using `pip` into a Python **>=3.12** environment. 

We recommend installing with the SpaCy transformer model (`en_core_web_trf`) as a dependency:

```console
pip install idscrub[trf]
```

If you do not need SpaCy:

```console
pip install idscrub
```

## How to use the code

Basic usage example (see [basic_usage.ipynb](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further examples):

```python
from idscrub import IDScrub

scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
scrubbed_texts = scrub.scrub(
    pipeline=[
        {"method": "spacy_entities", "entity_types": ["PERSON"]},
        {"method": "uk_phone_numbers"},
        {"method": "uk_postcodes"},
    ]
)

print(scrubbed_texts)

# Output: ['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE].']
```

> [!IMPORTANT]
> * This package will identify and scrub many types of data that you might not want to scrub, such as context-relevant, fictional or organisational names. 
> * We therefore recommend manually removing scrubbed data identified by `idscrub` from your original dataset on a case-by-case basis.

Scrubbed data can be identified using the following methods (see the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) for further information):

```python
import pandas as pd
from idscrub import IDScrub

# From lists of text:
scrub = IDScrub(['Our names are Hamish McDonald, L. Salah, and Elena Suárez.', 'My number is +441111111111 and I live at AA11 1AA.'])
scrubbed_texts = scrub.scrub(
    pipeline=[
        {"method": "spacy_entities", "entity_types": ["PERSON"]},
        {"method": "uk_phone_numbers"},
        {"method": "uk_postcodes"},
    ]
)
scrubbed_df = scrub.get_scrubbed_data()
print(scrubbed_df)

# From a Pandas DataFrame:
scrubbed_df, scrubbed_data = IDScrub.dataframe(
    df=pd.read_csv('path/to/csv'), 
    id_col="ID", 
    pipeline=[
        {"method": "spacy_entities", "entity_types": ["PERSON"]},
        {"method": "uk_phone_numbers"},
        {"method": "uk_postcodes"},
    ]
)
print(scrubbed_data)
```

Method key-value pairs for further customisation, e.g. `"entity_types"`, can be viewed by viewing the docstring e.g. `?IDScrub.spacy_entities`. 

Key-value pairs represent method arguments:

`IDScrub.spacy_entities(entity_types=["PERSON"])` is equivalent to `pipeline=[{"method": "spacy_entities", "entity_types": ["PERSON"]}]`.

## Personal data types supported

| Method                | Scrubs                                                                 |
|-------------------------|------------------------------------------------------------------------|
| `all`                  | All supported personal data types (see `IDScrub.all()` for further customisation) |
| `spacy_entities`        | Entities detected by spaCy's `en_core_web_trf` or other user-selected spaCy models (e.g. persons (names), organisations) |
| `presidio_entities`     | Entities supported by [Microsoft Presidio](https://microsoft.github.io/presidio/) (e.g. persons (names), URLs, NHS numbers, NINO, IBAN codes) |
| `huggingface_entities`  | Entities detected by user-selected HuggingFace models |
| `email_addresses`      | Email addresses (e.g. john@email.com)   |
| `titles`               | Titles (e.g. Mr., Mrs., Dr.)    |
| `handles`              | Social media handles (e.g. @username)  |
| `urls`                 | URLs (e.g. www.bbc.co.uk) |
| `ip_addresses`         | IP addresses (e.g. 8.8.8.8)  |
| `uk_postcodes`         | UK postal codes (e.g. SW1A 2AA) |
| `uk_addresses`         | UK addresses (e.g. 10 Downing Street)  |
| `uk_phone_numbers`     | UK phone numbers (e.g. +441111111111) |
| `google_phone_numbers` | Phone numbers detected by Google's [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) |

> [!IMPORTANT]
> If you wish to scrub something not included in this list or contribute another method to the codebase, see the [custom methods notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/custom_methods.ipynb) for guidance and examples. 

## Considerations before use

- You must follow [GDPR guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/the-research-provisions/principles-and-grounds-for-processing/) when processing personal data using this package.
- This package has been designed as a *first pass* for standardised personal data removal. 
- Users are encouraged to check and confirm outputs and conduct manual reviews where necessary, e.g. when cleaning high risk datasets.
- It is up to the user to assess whether this removal process needs to be supplemented by other methods for their given dataset and security requirements.

### Input data

- This package is designed for text-based documents structured as a list of strings.
- It performs best when contextual meaning can be inferred from the text.
- For best results, input text should therefore resemble natural language. 
- **Highly fragmented, informal, technical, or syntactically broken text may reduce detection accuracy and lead to incomplete or incorrect name detection.**

### Biases and evaluation

- `idscrub` supports integration with SpaCy and Hugging Face models for name cleaning.
- These models are state-of-the-art, capable of identifying approximately 90% of named entities, but **may not remove all names**.
- **Biases present in these models due to their training data may affect performance**. For example:
    - English names may be more reliably identified than names common in other languages.
    - Uncommon or non-Western naming conventions may be missed or misclassified.

> [!IMPORTANT]
> * See [our wiki](https://github.com/uktrade/idscrub/wiki/Evaluation) for further details and notes on our evaluation of `idscrub`.

### Models

* Only Spacy's `en_core_web_trf` and no Hugging Face models have been formally evaluated.
* We therefore recommend that the current default `en_core_web_trf` is used for name scrubbing. **Other models need to be evaluated by the user.**

## Similar Python packages

* Similar packages exist for undertaking this task, such as [Presidio](https://microsoft.github.io/presidio/), [Scrubadub](https://github.com/LeapBeyond/scrubadub) and [Sanityze](https://github.com/UBC-MDS/sanityze). 
* Development of `idscrub` was undertaken to: 

    * Bring together different scrubbing methods across the Department for Business and Trade.
    * Adhere to infrastructure requirements.
    * Guarantee future stability and maintainability.
    * Encourage future scrubbing methods to be added collaboratively and transparently.
    * Allow for full flexibility depending on the use case and required outputs.
    
* To leverage the power of other packages, we have added methods that allow you to interact with them. These include: `IDScrub.presidio()` and `IDScrub.google_phone_numbers()`. See the [usage example notebook](https://github.com/uktrade/idscrub/blob/main/notebooks/basic_usage.ipynb) and method docstrings for further information.

## AI declaration

AI has been used in the development of `idscrub`, primarily to develop regular expressions, suggest code refinements and draft documentation.

## Development setup

This project is managed by [uv](https://docs.astral.sh/uv/).

To install all dependencies for this project, run:

```console
uv sync
```

If you do not have Python 3.12, run:

```console
uv python install 3.12
```

To run tests:

```console
uv run pytest
```

or

```console
make test
```

## Author 

Analytical Data Science, Department for Business and Trade
