Metadata-Version: 2.1
Name: pyjanitor
Version: 0.20.13
Summary: Tools for cleaning pandas DataFrames
Home-page: https://github.com/ericmjl/pyjanitor
Author: pyjanitor devs
Author-email: ericmajinglong@gmail.com
License: MIT
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: natsort
Requires-Dist: pandas-flavor
Requires-Dist: sklearn
Requires-Dist: multipledispatch
Provides-Extra: all
Requires-Dist: black (>=19.3b0) ; extra == 'all'
Requires-Dist: darglint ; extra == 'all'
Requires-Dist: ipython ; extra == 'all'
Requires-Dist: hypothesis (>=4.4.0) ; extra == 'all'
Requires-Dist: isort (>=4.3.18) ; extra == 'all'
Requires-Dist: biopython ; extra == 'all'
Requires-Dist: interrogate ; extra == 'all'
Requires-Dist: unyt ; extra == 'all'
Requires-Dist: pytest (>=3.4.2) ; extra == 'all'
Requires-Dist: flake8 ; extra == 'all'
Requires-Dist: py (>=1.10.0) ; extra == 'all'
Requires-Dist: pip-tools ; extra == 'all'
Requires-Dist: pandas-vet ; extra == 'all'
Requires-Dist: sphinxcontrib-fulltoc (==1.2.0) ; extra == 'all'
Requires-Dist: pytest-cov ; extra == 'all'
Requires-Dist: tqdm ; extra == 'all'
Requires-Dist: pre-commit ; extra == 'all'
Requires-Dist: pyspark ; extra == 'all'
Requires-Dist: nbsphinx (>=0.4.2) ; extra == 'all'
Requires-Dist: sphinx ; extra == 'all'
Provides-Extra: biology
Requires-Dist: biopython ; extra == 'biology'
Provides-Extra: chemistry
Requires-Dist: tqdm ; extra == 'chemistry'
Provides-Extra: dev
Requires-Dist: pip-tools ; extra == 'dev'
Requires-Dist: pre-commit ; extra == 'dev'
Requires-Dist: isort (>=4.3.18) ; extra == 'dev'
Requires-Dist: black (>=19.3b0) ; extra == 'dev'
Requires-Dist: darglint ; extra == 'dev'
Requires-Dist: flake8 ; extra == 'dev'
Provides-Extra: docs
Requires-Dist: sphinx ; extra == 'docs'
Requires-Dist: nbsphinx (>=0.4.2) ; extra == 'docs'
Requires-Dist: sphinxcontrib-fulltoc (==1.2.0) ; extra == 'docs'
Requires-Dist: ipython ; extra == 'docs'
Requires-Dist: biopython ; extra == 'docs'
Requires-Dist: tqdm ; extra == 'docs'
Requires-Dist: unyt ; extra == 'docs'
Requires-Dist: pyspark ; extra == 'docs'
Provides-Extra: engineering
Requires-Dist: unyt ; extra == 'engineering'
Provides-Extra: spark
Requires-Dist: pyspark ; extra == 'spark'
Provides-Extra: test
Requires-Dist: pytest-cov ; extra == 'test'
Requires-Dist: pytest (>=3.4.2) ; extra == 'test'
Requires-Dist: hypothesis (>=4.4.0) ; extra == 'test'
Requires-Dist: interrogate ; extra == 'test'
Requires-Dist: pandas-vet ; extra == 'test'
Requires-Dist: py (>=1.10.0) ; extra == 'test'



``pyjanitor`` is a Python implementation of the R package `janitor`_, and
provides a clean API for cleaning data.

.. _janitor: https://github.com/sfirke/janitor

Why janitor?
------------

Originally a port of the R package,
``pyjanitor`` has evolved from a set of convenient data cleaning routines
into an experiment with the `method chaining`__ paradigm.

.. _chaining: https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69

__ chaining_

Data preprocessing usually consists of a series of steps
that involve transforming raw data into an understandable/usable format.
These series of steps need to be run in a certain sequence to achieve success.
We take a base data file as the starting point,
and perform actions on it,
such as removing null/empty rows,
replacing them with other values,
adding/renaming/removing columns of data,
filtering rows and others.
More formally, these steps along with their relationships
and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The `pandas` API has been invaluable for the Python data science ecosystem,
and implements method chaining of a subset of methods as part of the API.
For example, resetting indexes (``.reset_index()``),
dropping null values (``.dropna()``), and more,
are accomplished via the appropriate ``pd.DataFrame`` method calls.

Inspired by the ease-of-use
and expressiveness of the ``dplyr`` package
of the R statistical language ecosystem,
we have evolved ``pyjanitor`` into a language
for expressing the data processing DAG for ``pandas`` users.



Functionality
-------------

Current functionality includes:

- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values
  into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark



