Metadata-Version: 2.4
Name: deduplicate_lib
Version: 0.0.3
Summary: A Python package for deduplicating data.
Author-email: Julian Holland <holland@fhi.mpg.de>
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: numba
Requires-Dist: scipy
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-mock; extra == "dev"
Requires-Dist: coverage; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: nbconvert; extra == "dev"
Requires-Dist: nbclient; extra == "dev"
Requires-Dist: notebook; extra == "dev"
Requires-Dist: ipykernel; extra == "dev"
Requires-Dist: matplotlib; extra == "dev"
Dynamic: license-file

<div align="center">
  <h1><code>deduplicate_lib</code></h1>
  <p><i>deduplication algorithms in python</i></p>
</div>

***
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/julianholland/deduplicate)
[![codecov](https://codecov.io/gh/julianholland/deduplicate/graph/badge.svg?token=JL3OTRCXZD)](https://codecov.io/gh/julianholland/deduplicate)
[![PyPI version](https://badge.fury.io/py/deduplicate_lib.svg)](https://badge.fury.io/py/deduplicate_lib)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/julianholland/deduplicate/actions/workflows/ci.yml/badge.svg)](https://github.com/julianholland/deduplicate/actions/workflows/ci.yml)
[![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
<!-- [![Documentation Status](https://readthedocs.org/projects/alomancy/badge/?version=latest)](https://alomancy.readthedocs.io/en/latest/index.html) -->

## Key Features


- Easy to use deduplication algorithms for any vector array
- Suite of tolerance tuning algorithms to help you find the right tolerance value for your system
- Suite of benchmarking tools to ensure rigor, accuracy, and speed (not yet implemented)
- Factory Plugin architecture, for easy extensibility and modification

***
## Implemented Algorithms

- Distance Matrix (Simple, accurate, expensive): Computes the distance matrix for all vectors and determines duplicates by finding those that fall below a given distance
- Multi Hashing (Fast): Smears and rounds the vectors using a normal distribution and computes the hashes for each which are then used to determine duplicates by proportion of hash clashes.
<!-- - Locality Sensitive Hashing (Fast, Accurate) -->

## Quick Start
install using pip
```bash
pip install deduplicate_lib
```

load your data into python 

```python
from deduplicate_lib.plugins.deduplication_algorithms.multi_hash import MultiHash

# define your paramerters in the MultiHash object
dda=MultiHash(dataset_array=your_data_array)

# return a list of all unique values
print(dda.deduplicate())
```

A more detailed example can be seen in the `examples` directory

### Dependencies

- Python 3.9+
- `numpy`
- `numba`
- `scipy`

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup

```bash
# Clone the repository
git clone https://github.com/julianholland/deduplicate.git
cd deduplicate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .
ruff format .
```

### Running Tests

```bash
# Run all tests
pytest

# Run specific test categories
pytest tests/core/
pytest tests/plugins/
pytest tests/plugins/duplicate_detection_algorithms/distance_matrix

# Run with coverage
pytest --cov
```

## 📝 Citation

If you use deduplicate_lib in your research, please cite:

```bibtex
@software{deduplicate2026,
  title={deduplicate_lib: Auto Tolerance Finding Deduplication Algorithms in Python},
  author={Julian Holland},
  year={2026},
  url={https://github.com/julianholland/deduplicate},
  version={0.0.2dev}
}
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- The Fritz Haber Institute
- Juan Manuel Lombardi <3


## Project Links

- [GitHub Repository](https://github.com/julianholland/deduplicate)

## Project To-Do

- [x] Add example.ipynb
- [ ] Create general Pre-allocation protocal
- [ ] Add benchmarks for time and robustness
- [ ] Add Locality-Sensitive Hashing as an option
- [x] Speedup slow tasks with Numba
- [ ] Set up Read the Docs
- [x] Create general deduplicate function
