Metadata-Version: 2.1
Name: piicatcher
Version: 0.5.0
Summary: Find PII data in databases
Home-page: https://github.com/vrajat/piicatcher
Author: Rajat Venkatesh
Author-email: vrajat@dblint.io
License: Apache 2.0
Download-URL: https://github.com/vrajat/piicatcher/tarball/0.5.0
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
Requires-Dist: pypi-publisher
Requires-Dist: commonregex
Requires-Dist: spacy
Requires-Dist: tableprint
Requires-Dist: pymysql
Requires-Dist: pymssql
Requires-Dist: cx-Oracle
Requires-Dist: psycopg2-binary
Requires-Dist: cryptography
Requires-Dist: peewee
Requires-Dist: pyyaml
Requires-Dist: python-magic

[![CircleCI](https://circleci.com/gh/tokern/piicatcher.svg?style=svg)](https://circleci.com/gh/tokern/piicatcher)
[![codecov](https://codecov.io/gh/tokern/piicatcher/branch/master/graph/badge.svg)](https://codecov.io/gh/tokern/piicatcher)
[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)
[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)
[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)

Pii Catcher for Files and Databases
===================================

Overview
--------

PiiCatcher finds PII data in your databases. It scans all the columns in your 
database and finds the following types of PII information:
* PHONE
* EMAIL
* CREDIT_CARD
* ADDRESS
* PERSON
* LOCATION
* BIRTH_DATE
* GENDER
* NATIONALITY
* IP_ADDRESS
* SSN
* USER_NAME
* PASSWORD

PiiCatcher uses two types of scanners to detect PII information:
1. [CommonRegex](https://github.com/madisonmay/CommonRegex) uses a set of regular expressions 
for common types of information
2. [Spacy Named Entity Recognition](https://spacy.io/usage/linguistic-features#named-entities) 
uses Natural Language Processing to detect named entities. Only English language is currently supported.

Supported Technologies
----------------------
PiiCatcher supports the following filesystems:
* POSIX
* AWS S3 _(Coming soon)_
* Google Cloud Storage _(Coming Soon)_
* ADLS _(Coming Soon)_

PiiCatcher supports the following databases:
1. **Sqlite3** v3.24.0 or greater
2. **MySQL** 5.6 or greater
3. **PostgreSQL** 9.4 or greater
4. **AWS Redshift**
5. **SQL Server**
6. **Oracle**

Installation
------------
PiiCatcher is available as a command-line application.

To install use pip:

    python3 -m venv .env
    source .env/bin/activate
    pip install piicatcher


Or clone the repo:

    git clone https://github.com/vrajat/piicatcher.git
    python3 -m venv .env
    source .env/bin/activate
    python setup.py install

Install Spacy Language Model

    python -m spacy download en_core_web_sm 

Install Oracle Client

PiiCatcher on Oracle, requires a working client. Please refer to [cx_Oracle documentation](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html#oracle-client-and-oracle-database-interoperability)
for more information.

Usage
-----
    # Print usage to scan databases
    piicatcher db -h
    usage: piicatcher db [-h] -s HOST [-R PORT] [-u USER] [-p PASSWORD]
                     [-t {sqlite,mysql,postgres}] [-c {deep,shallow}]
                     [-o OUTPUT] [-f {ascii_table,json,orm}]

    optional arguments:
      -h, --help            show this help message and exit
      -s HOST, --host HOST  Hostname of the database. File path if it is SQLite
      -R PORT, --port PORT  Port of database.
      -u USER, --user USER  Username to connect database
      -p PASSWORD, --password PASSWORD
                            Password of the user
      -t {sqlite,mysql,postgres}, --connection-type {sqlite,mysql,postgres}
                            Type of database
      -c {deep,shallow}, --scan-type {deep,shallow}
                            Choose deep(scan data) or shallow(scan column names
                            only)
      -o OUTPUT, --output OUTPUT
                            File path for report. If not specified, then report is
                            printed to sys.stdout
      -f {ascii_table,json,orm}, --output-format {ascii_table,json,orm}
                            Choose output format type

    usage: piicatcher files [-h] [--path PATH] [--output OUTPUT]
                        [--output-format {ascii_table,json,orm}]


    piicatcher files -h
    # Print usage to scan databases
    optional arguments:
      -h, --help            show this help message and exit
      --path PATH           Path to file or directory
      --output OUTPUT       File path for report. If not specified, then report is
                            printed to sys.stdout
      --output-format {ascii_table,json,orm}
                            Choose output format type


Example
-------

    # run piicatcher on a sqlite db and print report to console
    piicatcher db -c '/db/sqlqb'
    ╭─────────────┬─────────────┬─────────────┬─────────────╮
    │   schema    │    table    │   column    │   has_pii   │
    ├─────────────┼─────────────┼─────────────┼─────────────┤
    │        main │    full_pii │           a │           1 │
    │        main │    full_pii │           b │           1 │
    │        main │      no_pii │           a │           0 │
    │        main │      no_pii │           b │           0 │
    │        main │ partial_pii │           a │           1 │
    │        main │ partial_pii │           b │           0 │
    ╰─────────────┴─────────────┴─────────────┴─────────────╯



