Metadata-Version: 2.1
Name: contentai-activity-classifier
Version: 1.3.5
Summary: ContentAI Activity Classification Service
Home-page: https://gitlab.research.att.com/turnercode/activity-classifier-extractor
Author: Eric Zavesky
License: Apache
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: pandas (==1.0.4)
Requires-Dist: numexpr (==2.7.1)
Requires-Dist: scikit-learn (==0.23.1)
Requires-Dist: h5py (==2.10.0)
Requires-Dist: matplotlib (==3.2.1)
Requires-Dist: imblearn (==0.0)
Requires-Dist: tensorflow (==2.2.0)
Requires-Dist: contentaiextractor (>=1.0.4)

activity-classifier-extractor
=============================


Generates activity classifications from low-level feature inputs in support
of analytic workflows within the `ContentAI Platform <https://www.contentai.io>`__, 
published as the extractor
``dsai_activity_classifier``. 

1. `Getting Started <#getting-started>`__
2. `Execution <#execution-and-deployment>`__
3. `Creating Models <#creating-models>`__
4. `Testing <#testing>`__
5. `Future Development <#future-development>`__
6. `Changes <CHANGES.md>`__

Getting Started
===============

| This library is used as a `single-run executable <#contentai-standalone>`__.
| Runtime parameters can be passed for processing that configure the
  returned results and can be examined in more detail in the
  `main <main.py>`__ script.

-  ``verbose`` - *(bool)* - verbose input/output configuration printing (*default=false*)
-  ``path_content`` - *(str)* - input video path for files to label (*default=video.mp4*)
-  ``path_result`` - *(str)* - output path for samples (*default=.*)
-  ``path_models`` - *(str)* - manifest path for model information (*default=data/models/manifest.json*)
-  ``time_interval`` - *(float)* - time interval for predictions from models (*default=3.0*)
-  ``average_predictions`` - *(bool)* - flatten predictions across time and class (*default=false*)
-  ``round_decimals`` - *(int)* - rounding decimals for predictions (*default=5*)
-  ``score_min`` - *(float)* - apply a minimum score threshold for classes (*default=0.1*)


dependencies
------------

| To install package dependencies in a fresh system, the recommended
  technique is a set of
| vanilla pip packages. The latest requirements should be validated from
  the ``requirements.txt`` file but at time of writing, they were the
  following.

.. code:: shell

   pip install --no-cache-dir -r requirements.txt 

Execution and Deployment
========================

This package is meant to be run as a one-off processing tool that
aggregates the insights of other extractors.

command-line standalone
-----------------------

Run the code as if it is an extractor. In this mode, configure a few
environment variables to let the code know where to look for content.

One can also run the command-line with a single argument as input and
optionally ad runtime configuration (see `runtime
variables <#getting-started>`__) as part of the ``EXTRACTOR_METADATA``
variable as JSON.

.. code:: shell

   EXTRACTOR_METADATA='{"compressed":True}'

Locally Run Classifier on Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For utility, the above line has been wrapped in the bash script
``run_local.sh``.

.. code:: shell

    ./run_local.sh <docker_image> [<source_directory> <output_data_dir> [<json_args>]] [<all_args>]
       - run clip extraction on source with prior processing

      <docker_image> = 0 IF local command-line based (args using arg parse) 
                     = 1 IF local docker emulation
                     = IMAGE_NAME IF docker image name to run

      ./run_local.sh 0 --path_content features/ --path_result results/ --verbose 
      ./run_local.sh 1 features/ results/ 0 '{\"verbose\"true}' 

Through all of the above examples, the underlying command-line execution is 
similar to this excution run on the testing data.

.. code:: shell

    python -u activity_classifier/main.py --path_content testing/data/launch/video.mp4 
            --path_result testing/class --path_models activity_classifier/data/models/manifest.json --verbose

Feature-Based Similarity
~~~~~~~~~~~~~~~~~~~~~~~~

A helper script is also avaialble to compute the similarity of clips in 
one or more feature files. *(v1.1.0)*

.. code:: shell

    python -u activity_classifier/features.py --path_content testing/data/dummy.txt \\ 
            --feature_type dsai_videocnn dsai_vggish --path_result testing/dist


ContentAI
---------

Deployment
~~~~~~~~~~

Deployment is easy and follows standard ContentAI steps.

.. code:: shell

   contentai deploy dsai_activity_classifier
   Deploying...
   writing workflow.dot
   done

Alternatively, you can pass an image name to reduce rebuilding a docker
instance.

.. code:: shell

   docker build -t dsai_activity_classifier
   contentai deploy metadata-flatten dsai_activity_classifier

Locally Downloading Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can locally download data from a specific job for this extractor to
directly analyze.

.. code:: shell

   contentai data wHaT3ver1t1s --dir data

Run as an Extractor
~~~~~~~~~~~~~~~~~~~

.. code:: shell

   contentai run https://bucket/video.mp4  -w 'digraph { dsai_videocnn -> dsai_activity_classifier; dsai_vggish -> dsai_activity_classifier }'

   JOB ID:     1Tfb1vPPqTQ0lVD1JDPUilB8QNr
   CONTENT:    s3://bucket/video.mp4
   STATE:      complete
   START:      Fri Feb 15 04:38:05 PM (6 minutes ago)
   UPDATED:    1 minute ago
   END:        Fri Feb 15 04:43:04 PM (1 minute ago)
   DURATION:   4 minutes 

   EXTRACTORS

   my_extractor

   TASK      STATE      START           DURATION
   724a493   complete   5 minutes ago   1 minute 

Or run it via the docker image.  Please review the ``run_local.sh`` file for more information.


View Extractor Logs (stdout)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: shell

   contentai logs -f <my_extractor>
   my_extractor Fri Nov 15 04:39:22 PM writing some data
   Job complete in 4m58.265737799s


Creating Models
===============

There are two steps to adding new models. 

1. First, training the models and formulating
   a well-known structure (this can be done exhaustively across a number of model types).  
2. Update the manifest according to the structure above that indicates how the activity
   classifier should load the model (e.g. the `framework`), the required features, and
   a few fields for understanding other descriptions (e.g. the `name` and the `id`).

Exhaustive Training
-------------------

To ease standarized training across different models, several scripting options have
been created and are described below.  Coupled with the manifest file described above,
one can easily choose the best individual (via cross-validation analysis) for every 
label or dataset configuration, but combine them all into a unified output with this
package.


Binary Models
~~~~~~~~~~~~~

Also referred to as one-vs-all models, binary models offer only two outputs.

*(label format to be described here)*

Models may be trained, tested, and saved as follows, using modeling.py.

.. code:: shell

    python modeling.py  -b labelset_1 -l completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5 
                        -p Logistic-performance.csv -m models/Logistic

The preceding used a LogisticRegression estimator by default, and its default settings.  But customized settings may be added directly on the command line.  The following is a 'tuned' Logistic estimator which has better overall performance.

.. code:: shell

    python modeling.py  -b labelset_1 -l completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5 
                        -p Logistic_tuned-performance.csv -m models/LogisticTuned 
                        --est_params '{"C":4.0,"random_state":0,"max_iter":500,"class_weight":"balanced","solver":"lbfgs"}'

In fact the estimator itself may be specified on the command line as well, and if it already exists (such as those in sklearn) no additional coding is necessary.  For example the following uses a Multi-Layer Perceptron estimator.

.. code:: shell

    python modeling.py  -b labelset_1 -l completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5 
                        -p MLP-performance.csv -m models/MLP  --estimator sklearn.neural_network.MLPClassifier --est_params '{"max_iter":500}'

Custom estimators may need to specify the 'framework' so that they can be saved and loaded properly.  This example is a keras-based estimator with an sklearn-like wrapper.

.. code:: shell

    python modeling.py  -b labelset_1 -l labels/video/completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5 
                        -p NL-performance.csv -m models/NL --framework wrapped --estimator nine_layers.NineLayerEstimator

The script *collate.py* collects the statistics from all the runs and formats them for easier (web) viewing.

.. code:: shell

    python collate.py models




Adding Model Assets
-------------------

Adding models to the pre-determined set of models is as easy as editing a manifest file and 
adding a model into git LFS.  

1. Archive the new model into a serialized fileset.  At time of writing, this was serializing 
   models from `sklearn <https://scikit-learn.org>`__ with simple 
   `pickle load/save serialization <https://scikit-learn.org/stable/modules/model_persistence.html>`__. 
2. Gather all of the relevant output files and compress them if you can.  Currently, the library 
   understands gzip compression extensions (e.g. ".gz").
3. Choose the appropriate sub-directory that corresponds to the upstream feature extractor.  For 
   example, models built on ``3dcnn`` features may process new videos (via `extractor chaining <https://www.contentai.io/docs/extractor-chaining>`__)  
   to the extractor ``dsai_3dcnn``.  If one doesn't exist yet, please create a new directory, but
   remember what combination of audio and video features is required.
4. Modify the manifest file in ``activity_classifier/data/models/manifest.json`` for your new entry.
   Specifically, the input video and audio features must be defined as well as the serialization
   library.  Below is an example block that indicates ``3dcnn` video and ``vggish`` audio features for 
   a model crated with ``sklearn`` where prediction results will be nested with the name ``Running``.

    .. code:: shell

        [ ...
        {
            "path": "3dcnn-vggish/lr-Running.pkl.gz",
            "name": "Running",
            "id": "ugc",
            "framework": "sklearn",
            "video": "dsai_videocnn",
            "audio": "dsai_vggish"
        },
        ... ]

5. Prepare to add your model files to the repo.  **NOTE This repo uses `git-lfs <https://git-lfs.github.com/>`__
   to store all binary files like models.  If your model is added with regular git tools alone, you will 
   get a sternly worded email (and friendly advice on how to re-add correctly).**  

    .. code:: shell

        (from the base directory only)
        git lfs track activity_classifier/data/models/3dcnn/moonwalk_model.pkl.gz
        git add activity_classifier/data/models/3dcnn/moonwalk_model.pkl.gz
        git add activity_classifier/data/models/manifest.json

6. Test your model with the data in the ``testing`` directory.  The CI/CD process should do this tool
   but it's always easier to find and fix problems here than with a vague email.  The features in this
   directory came from processing of the `HBO Max Launch Video <https://www.youtube.com/watch?v=9yLNhhHs3-k>`__,
   which is publicaly available as a reference.

    .. code:: shell

        (from the base directory)

        ./run_local.sh 0 --path_content testing/data/test.mp4 --time_interval 1.5

        (check for predictions from your new model in data.json) 



Testing
=======

Testing is included via tox.  To launch testing for the entire package, just run `tox` at the command line. 
Testing can also be run for a specific file within the package by setting the evironment variable `TOX_ARGS`.

.. code:: shell

   TOX_ARG=test_basic.py tox 



Future Development
==================

-  additional training hooks?




Changes
=======

Generates activity classifications from low-level feature inputs in support
of analytic workflows within the `ContentAI Platform <https://www.contentai.io>`__.

1.3
---

1.3.5
~~~~~
- contentai key request fix

1.3.3
~~~~~
- docs update
- multiclass write

1.3.2
~~~~~
- docker build update, run example update

1.3.1
~~~~~
- docs fix for example of using package
- bug fix for default location, change inputs to classify function

1.3.0
~~~~~
- move models out of the primary package
- *breaking change*, rename input param `path_models` to `path_manifest`

1.2
---

1.2.2
~~~~~
- bump version for model migration to LFS

1.2.1
~~~~~
- fix docker/deployed image run command

1.2.0
~~~~~
- switch to package representation, push to pypi
- several updates for MANIFEST definition (id)
- inclusion of multi-parameter training and testing framework
- safety for model loading, catch exceptions, return gracefully
- update documents to split for binary models 

1.1
---

1.1.1
~~~~~
- cosmetic change for reuse in other libraries

1.1.0
~~~~~

- refactor feature code, add utility for difference computation among segments
- min value thresholding to avoid low scoring results in output (default=0.1)
- refactor caching information for feature load (allow flatten, remove cache, allow multi-asset)
- allow recursive feature load for distance compute


1.0
---

1.0.2
~~~~~

- fixes for output, modify to require other extractors as dependencies
- fix order of paramters for local runs


1.0.1
~~~~~

- updates for integration of other models, fixes for prediction output
- add l2norm after average/merge in time of source features

1.0.0
~~~~~

- initial project merge from other sources
- generates json prediction dict
- callable as package
- includes some testing routines with windowing comparison



