Metadata-Version: 2.1
Name: AdvancedAnalytics
Version: 1.37
Summary: Python support for 'The Art and Science of Data Analytics'
Home-page: https://github.com/tandonneur/AdvancedAnalytics
Author: Edward R Jones
Author-email: ejones@tamu.edu
Keywords: Analytics,data map,preprocessing,pre-processing,postprocessing,post-processing,NLTK,Sci-Learn,sklearn,StatsModels,web scraping,word cloud,regression,decision trees,random forest,neural network,cross validation,topic analysis,sentiment analytic,natural language processing,NLP
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Utilities
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Requires-Dist: scikit-learn
Requires-Dist: scikit-image
Requires-Dist: statsmodels
Requires-Dist: nltk
Requires-Dist: pydotplus

AdvancedAnalytics
===================

A collection of python modules, classes and methods for simplifying the use of machine learning solutions.  **AdvancedAnalytics** provides easy access to advanced tools in **Sci-Learn**, **NLTK** and other machine learning packages.  **AdvancedAnalytics** was developed to simplify learning python from the book *The Art and Science of Data Analytics*.

Description
===========

From a high level view, building machine learning applications typically proceeds through three stages:

    1. Data Preprocessing
    2. Modeling or Analytics
    3. Postprocessing

The classes and methods in **AdvancedAnalytics** primarily support the first and last stages of machine learning applications. 

Data scientists report they spend 80% of their total effort in first and last stages. The first stage, *data preprocessing*, is concerned with preparing the data for analysis.  This includes:

    1. identifying and correcting outliers, 
    2. imputing missing values, and 
    3. encoding data. 

The last stage, *solution postprocessing*, involves developing graphic summaries of the solution, and metrics for evaluating the quality of the solution.

Documentation and Examples
============================

The API and documentation for all classes and examples are available at https://github.com/tandonneur/AdvancedAnalytics/. 

Usage
=====

Currently the most popular usage is for supporting solutions developed using these advanced machine learning packages:

    * Sci-Learn
    * StatsModels
    * NLTK

The intention is to expand this list to other packages.  This is a simple example for linear regression that uses the data map structure to preprocess data:

.. code-block:: python

    from AdvancedAnalytics.ReplaceImputeEncode import DT
    from AdvancedAnalytics.ReplaceImputeEncode import ReplaceImputeEncode
    from AdvancedAnalytics.Tree import tree_regressor
    from sklearn.tree import DecisionTreeRegressor, export_graphviz 
    # Data Map Using DT, Data Types
    data_map = {
        "Salary":         [DT.Interval, (20000.0, 2000000.0)],
        "Department":     [DT.Nominal, ("HR", "Sales", "Marketing")] 
        "Classification": [DT.Nominal, (1, 2, 3, 4, 5)]
        "Years":          [DT.Interval, (18, 60)] }
    # Preprocess data from data frame df
    rie = ReplaceImputeEncode(data_map=data_map, interval_scaling=None,
                              nominal_encoding= "SAS", drop=True)
    encoded_df = rie.fit_transform(df)
    y = encoded_df["Salary"]
    X = encoded_df.drop("Salary", axis=1)
    dt = DecisionTreeRegressor(criterion= "gini", max_depth=4,
                                min_samples_split=5, min_samples_leaf=5)
    dt = dt.fit(X,y)
    tree_regressor.display_importance(dt, encoded_df.columns)
    tree_regressor.display_metrics(dt, X, y)

Current Modules and Classes
=============================

ReplaceImputeEncode
    Classes for Data Preprocessing
        * DT defines new data types used in the data dictionary
        * ReplaceImputeEncode a class for data preprocessing

Regression
    Classes for Linear and Logistic Regression
        * linreg support for linear regressino
        * logreg support for logistic regression
        * stepwise a variable selection class

Tree
    Classes for Decision Tree Solutions
        * tree_regressor support for regressor decision trees
        * tree_classifier support for classification decision trees

Forest
    Classes for Random Forests
        * forest_regressor support for regressor random forests
        * forest_classifier support for classification random forests

NeuralNetwork
    Classes for Neural Networks
        * nn_regressor support for regressor neural networks
        * nn_classifier support for classification neural networks

Text
    Classes for Text Analytics
        * text_analysis support for topic analysis
        * text_plot for word clouds
        * sentiment_analysis support for sentiment analysis

Internet
    Classes for Internet Applications
        * scrape support for web scrapping
        * metrics a class for solution metrics

Installation and Dependencies
=============================

**AdvancedAnalytics** is designed to work on any operating system running python 3.  It can be installed using **pip** or **conda**.

.. code-block:: python

    pip install AdvancedAnalytics
    # or
    conda install -c dr.jones AdvancedAnalytics

General Dependencies
    There are dependencies.  Most classes import one or more modules from    
    **Sci-Learn**, referenced as *sklearn* in module imports, and 
    **StatsModels**.  These are both installed with the current version
    of **anaconda**.

Installed with AdvancedAnalytics
    Most packages used by **AdvancedAnalytics** are automatically 
    installed with its installation.  These consist of the following 
    packages.

        * statsmodels
        * scikit-learn
        * scikit-image
        * nltk
        * pydotplus

Other Dependencies
    The *Tree* and *Forest* modules plot decision trees and importance
    metrics using **pydotplus** and the **graphviz** packages.  These
    should also be automatically installed with **AdvancedAnalytics**.

    However, the **graphviz** install is sometimes not fully complete 
    with the conda install.  It may require an additional pip install.

    .. code-block:: python

        pip install graphviz

Text Analytics Dependencies
    The *TextAnalytics* module uses the **NLTK**, **Sci-Learn**, and 
    **wordcloud** packages.  Usually these are also automatically 
    installed automatically with **AdvancedAnalytics**.  You can verify 
    they are installed using the following commands.

    .. code-block:: python

        conda list nltk
        conda list sci-learn
        conda list wordcloud

    However, when the **NLTK** package is installed, it does not 
    install the data used by the package.  In order to load the
    **NLTK** data run the following code once before using the 
    *TextAnalytics* module.

    .. code-block:: python

        #The following NLTK commands should be run once
        nltk.download("punkt")
        nltk.download("averaged_preceptron_tagger")
        nltk.download("stopwords")
        nltk.download("wordnet")

    The **wordcloud** package also uses a little know package
    **tinysegmenter** version 0.3.  Run the following code to ensure
    it is installed.

    .. code-block:: python

        conda install -c conda-forge tinysegmenter==0.3
        # or
        pip install tinysegmenter==0.3

Internet Dependencies
    The *Internet* module contains a class *scrape* which has some   
    functions for scraping newsfeeds. Some of these use the 
    **newspaper3k** package.  It should be automatically installed with 
    **AdvancedAnalytics**.

    However, it also uses the package **newsapi-python**, which is not 
    automatically installed.  If you intended to use this news scraping
    scraping tool, it is necessary to install the package using the 
    following code:

    .. code-block:: python

        conda install -c conda-forge newsapi
        # or
        pip install newsapi

    In addition, the newsapi service is sponsored by a commercial company
    www.newsapi.com.  You will need to register with them to obtain an 
    *API* key required to access this service.  This is free of charge 
    for developers, but there is a fee if *newsapi* is used to broadcast 
    news with an application or at a website.

Code of Conduct
---------------

Everyone interacting in the AdvancedAnalytics project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct: https://www.pypa.io/en/latest/code-of-conduct/ .


