Metadata-Version: 2.1
Name: fuzzyjoin
Version: 0.4.1
Summary: Join two tables by a fuzzy comparison of text columns.
Home-page: https://github.com/chancyk/fuzzyjoin
Author: Chancy Kennedy
Author-email: kennedychancy+fuzzyjoin@gmail.com
License: MIT license
Keywords: fuzzyjoin
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Dist: Click (<8.0,>=7.0)
Requires-Dist: pylev (<1.4.0,>=1.3.0)
Provides-Extra: fast
Requires-Dist: editdistance (<0.6.0,>=0.5.3) ; extra == 'fast'

fuzzyjoin
=========

Join two tables by a fuzzy comparison of text columns.

Features
--------
* Command line utility to quickly join CSV files.
* Ngram blocking to reduce the total number of comparisons.
* Pure python levenshtein edit distance using [pylev](https://github.com/toastdriven/pylev).
* Fast levenshtein edit distance using [editdistance](https://github.com/aflc/editdistance).
* License: [MIT](https://opensource.org/licenses/MIT)


Installation
------------
* Pure python: `pip install fuzzyjoin`
* Optimized: `pip install fuzzyjoin[fast]`


Description
-----------
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. `fuzzyjoin` satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.

For a more sophisticated and comprehensive treatment of the topic that will allow
you to join records using multiple fields, see the packages below:

* [dedupe](https://github.com/dedupeio/dedupe)
* [recordlinkage](https://recordlinkage.readthedocs.io/en/latest/about.html)


TODO
----
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.


History
=======

0.3.4 (2019-04-11)
------------------
* Fix function defaults.
* Minor optimizations.
* Additional CLI parameters.

0.3.3 (2019-04-10)
------------------
* Cleanup checks.

0.3.2 (2019-04-10)
------------------
* Include basic installation instructions.

0.3.1 (2019-04-10)
------------------
* Minor README updates.


0.3.0 (2019-04-10)
------------------
* Use editdistance if available, otherwise fallback to pylev.
* Report progress by default.
* Number comparison options.
* Renamed get_multiples to filter_multiples.


0.2.1 (2019-04-10)
------------------
* Additional docs and tests.

0.2.0 (2019-04-09)
------------------
* Write multiples matches to a separate file.
* Added types and docstrings.

0.1.2 (2019-04-09)
------------------
* Duplicate release of 0.1.1

0.1.1 (2019-04-09)
------------------
* First release on PyPI.


