Metadata-Version: 1.1
Name: sprm
Version: 0.1.6
Summary: Sparse Partial Robust M Regression, including plot functions
Home-page: https://github.com/SvenSerneels/sprm
Author: Sven Serneels
Author-email: svenserneels@gmail.com
License: UNKNOWN
Description: .. role:: raw-html-m2r(raw)
           :format: html
        
        
        sprm
        ====
        
        Sparse partial robust M regression
        ==================================
        
        Pyhton code for Sparse Partial Robust M regresion (SPRM)[1], a sparse and robust version of univariate partial least squares (PLS1). 
        
        Description
        -----------
        
        The method performs four tasks at the same time in a single, consistent estimate: 
        
        
        * *regression*\ : yields regression coefficients and predicts responses
        * *dimension reduction*\ : calculates interpretable PLS-like components maximizing covariance to the predictand in a robust way 
        * *variable selection*\ : depending on the paramter settings, can yield highly sparse regression coefficients that contain exact zero elements 
        * *outlier detection and compensation*\ : yields a set of case weights in [0,1]. The lower the weight, the more outlying a case is. The estimate itself is outlier robust. 
        
        The code is aligned to ScikitLearn, such that modules such as GridSearchCV can flawlessly be applied to it. 
        
        The repository contains
        
        
        * The estimator (sprm.py) 
        * Plotting functionality based on Matplotlib (as well in sprm.py)
        * Robust data pre-processing (robcent.py) 
        
        How to install
        --------------
        
        The package is distributed through PyPI, so install through: 
        
        .. code-block::
        
               pip install sprm 
        
        
        The SPRM estimator
        ==================
        
        The main SPRM implementation yields a class with the following structure:
        
        Dependencies
        ------------
        
        
        * From <sklearn.base>: BaseEstimator,TransformerMixin,RegressorMixin
        * From <sklearn.utils>: _BaseComposition
        * copy
        * From <scipy.stats>: norm,chi2
        * numpy 
        * from :raw-html-m2r:`<matplotlib>`\ : pyplot. 
        
        Parameters
        ----------
        
        
        * eta: float. Sparsity parameter in [0,1)
        * n_components: int > 1. Note that if applied on data, n_components shall take a value <= min(x_data.shape)
        * fun: str, downweighting function. 'Hampel' (recommended), 'Fair' or 'Huber'
        * probp1: float, probability cutoff for start of downweighting (e.g. 0.95)
        * probp2: float, probability cutoff for start of steep downweighting (e.g. 0.975, only relevant if fun='Hampel')
        * probp3: float, probability cutoff for start of outlier omission (e.g. 0.999, only relevant if fun='Hampel')
        * centring: str, type of centring ('mean' or 'median', the latter recommended)
        * scaling: str, type of scaling ('std','mad', the latter recommended, or 'None')
        * verbose: boolean, specifying verbose mode
        * maxit: int, maximal number of iterations in M algorithm
        * tol: float, tolerance for convergence in M algorithm 
        * start_cutoff_mode: str, value 'specific' will set starting value cutoffs specific to X and y (preferred); any other value will set X and y stating cutoffs identically. The non-specific setting yields identical results to the SPRM R implementation available from `CRAN <https://cran.r-project.org/web/packages/sprm/index.html>`_.
        * start_X_init: str, values 'pcapp' will include a PCA/broken stick projection to calculate the initial predictor block caseweights; any other value will just calculate initial predictor block case weights based on Euclidian distances within that block. The is less stable for very flat data (p >> n). 
        * colums (def false): Either boolean or list. If False, no column names supplied. If a list (will only take length x_data.shape[1]), the column names of the x_data supplied in this list, will be printed in verbose mode
        * copy (def True): boolean, whether to create deep copy of the data in the calculation process 
        
        Attributes
        ----------
        
        
        * x\ *weights*\ : X block PLS weighting vectors (usually denoted W)
        * x\ *loadings*\ : X block PLS loading vectors (usually denoted P)
        * C_: vector of inner relationship between response and latent variablesblock re
        * x\ *scores*\ : X block PLS score vectors (usually denoted T)
        * coef_: vector of regression coefficients 
        * intercept_: intercept
        * coef\ *scaled*\ : vector of scaled regression coeeficients (when scaling option used)
        * intercept\ *scaled*\ : scaled intercept
        * residuals_: vector of regression residuals
        * x\ *ev*\ : X block explained variance per component
        * y\ *ev*\ : y block explained variance 
        * fitted_: fitted response
        * x\ *Rweights*\ : X block SIMPLS style weighting vectors (usually denoted R)
        * x\ *caseweights*\ : X block case weights
        * y\ *caseweights*\ : y block case weights
        * caseweights_: combined case weights
        * colret_: names of variables retained in the sparse model
        * x\ *loc*\ : X block location estimate 
        * y\ *loc*\ : y location estimate
        * x\ *sca*\ : X block scale estimate
        * y\ *sca*\ : y scale estimate
        
        Methods
        -------
        
        
        * fit(X,y): fit model 
        * predict(X): make predictions based on fit 
        * transform(X): project X onto latent space 
        * weightnewx(X): calculate X case weights
        * getattr(): get list of attributes
        * setattr(**kwargs): set individual attribute of sprm object 
        * valscore(X,y,scoring): option to use weighted scoring function in cross-validation if scoring=weighted 
        
        Ancillary functions 
        --------------------
        
        
        * snipls (class): sparse NIPALS regression (first described in: [2]) 
        * Hampel: Hampel weight function 
        * Huber: Huber weight function 
        * Fair: Fair weight function 
        * brokenstick: broken stick rule to estimate number of relevant principal components  
        * robcent (class): robust centring and scaling 
        
        Example
        -------
        
        To run a toy example: 
        
        
        * 
          Source packages and data: 
        
          .. code-block::
        
               import pandas as ps
               data = ps.read_csv("./Returns_shares.csv")
               columns = data.columns[2:8]
               data = data.values[:,2:8]
               X = data[:,0:5]
               y = data[:,5]
               X0 = X.astype('float')
               y0 = y.astype('float')
        
        * 
          Estimate and predict by SPRM
        
          .. code-block::
        
               from sprm import sprm
               res_sprm = sprm.sprm(2,.8,'Hampel',.95,.975,.999,'median','mad',True,100,.01,'ally','xonly',columns,True)
               res_sprm.fit(X0[:2666],y0[:2666])
               res_sprm.predict(X0[2666:])
               res_sprm.transform(X0[2666:])
               res_sprm.weightnewx(X0[2666:])
               res_sprm.get_params()
               res_sprm.set_params(fun="Huber")
        
        * 
          Cross-validated using GridSearchCV: 
        
          .. code-block::
        
               from sklearn.model_selection import GridSearchCV 
               res_sprm_cv = GridSearchCV(sprm.sprm(), cv=10, param_grid={"n_components": [1, 2, 3], 
                                          "eta": np.arange(.1,.9,.05).tolist()})  
               res_sprm_cv.fit(X0[:2666],y0[:2666])  
               res_sprm_cv.best_params_
        
        Plotting functionality
        ======================
        
        The file sprm.py also contains a set of plot functions based on Matplotlib. The class plot contains plots for sprm objects, wheras the class plot_cv contains a plot for cross-validation. 
        
        Dependencies
        ------------
        
        
        * pandas
        * numpy
        * matplotlib.pyplot
        * for plotting cross-validation results: sklearn.model_selection.GridSearchCV
        
        Paramaters
        ----------
        
        
        * res_sprm, sprm. An sprm class object that has been fit.  
        * colors, list of str entries. Only mandatory input. Elements determine colors as: 
        
          * [0]: borders of pane 
          * [1]: plot background
          * [2]: marker fill
          * [3]: diagonal line 
          * [4]: marker contour, if different from fill
          * [5]: marker color for new cases, if applicable
          * [6]: marker color for harsh calibration outliers
          * [7]: marker color for harsh prediction outliers
        
        * markers, a list of str entries. Elements determkine markers for: 
        
          * [0]: regular cases 
          * [1]: moderate outliers 
          * [2]: harsh outliers 
        
        Methods
        -------
        
        
        * plot\ *coeffs(entity="coef*\ ",truncation=0,columns=[],title=[]): Plot regression coefficients, loadings, etc. with the option only to plot the x% smallest and largets coefficients (truncation) 
        * plot_yyp(ytruev=[],Xn=[],label=[],names=[],namesv=[],title=[],legend_pos='lower right',onlyval=False): Plot y vs y predicted. 
        * plot_projections(Xn=[],label=[],components = [0,1],names=[],namesv=[],title=[],legend_pos='lower right',onlyval=False): Plot score space. 
        * plot_caseweights(Xn=[],label=[],names=[],namesv=[],title=[],legend_pos='lower right',onlyval=False,mode='overall'): Plot caseweights, with the option to plot 'x', 'y' or 'overall' case weights for cases used to train the model. For new cases, only 'x' weights can be plotted. 
        
        Remark
        ------
        
        The latter 3 methods will work both for cases that the models has been trained with (no additional input) or new cases (requires Xn and in case of plot_ypp, ytruev), with the option to plot only the latter (option onlyval = True). All three functions have the option to plot case names if supplied as list.       
        
        Ancillary classes
        -----------------
        
        
        * sprmplotcv has method eta_ncomp_contour(title) to plot sklearn GridSearchCV results 
        * ABline2D plots the first diagonal in y vs y predicted plots. 
        
        Example (continued) 
        --------------------
        
        
        * 
          initialize some values: 
        
          .. code-block::
        
               import numpy as np
               colors = ["white","#BBBBDD","#0000DD",'#1B75BC','#4D4D4F','orange','red','black']
               markers = ['o','d','v']
               label = ["AIG"]
               names = [str(i) for i in range(1,len(res_sprm.y)+1)]
               namesv = [str(i) for i in range(1,len(y0[2667:])+1)]
        
        * 
          run sprm.plot: 
        
          .. code-block::
        
               res_sprm_plot = sprm.plot(res_sprm,colors)
        
        * 
          plot coefficients: 
        
          .. code-block::
        
               res_sprm_plot.plot_coeffs(title="All AIG SPRM scaled b")
               res_sprm_plot.plot_coeffs(truncation=.05,columns=columns,title="5% smallest and largest AIG sprm b")
        
        
        
          .. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_b.png
             :target: https://github.com/SvenSerneels/sprm/blob/master/AIG_b.png
             :alt: AIG sprm regression coefficients
        
        
        * 
          plot y vs y predicted, training cases only: 
        
          .. code-block::
        
               res_sprm_plot.plot_yyp(label=label,title="AIG SPRM y vs. y predicted")
               res_sprm_plot.plot_yyp(label=label,names=names,title="AIG SPRM y vs. y predicted")
        
        
        
          .. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train.png
             :target: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train.png
             :alt: AIG sprm y vs y predicted, taining set
        
        
        * 
          plot y vs y predicted, including test cases
        
          .. code-block::
        
               res_sprm_plot.plot_yyp(ytruev=y0[2667:],Xn=X0[2667:],label=label,names=names,namesv=namesv,title="AIG SPRM y vs. 
                       y predicted")            
               res_sprm_plot.plot_yyp(ytruev=y0[2667:],Xn=X0[2667:],label=label,title="AIG SPRM y vs. y predicted")
        
        
           
          .. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train_test.png
             :target: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train_test.png
             :alt: AIG sprm y vs y predicted, taining set
        
        
        * 
          plot y vs y predicted, only test set cases: 
        
          .. code-block::
        
               res_sprm_plot.plot_yyp(ytruev=y0[2667:],Xn=X0[2667:],label=label,title="AIG SPRM y vs. y predicted",onlyval=True)
        
        * 
          plot score space, options as above, with the second one shown here: 
        
          .. code-block::
        
               res_sprm_plot.plot_projections(Xn=X0[2667:],label=label,names=names,namesv=namesv,title="AIG SPRM score space, components 1 and 2")
               res_sprm_plot.plot_projections(Xn=X0[2667:],label=label,title="AIG SPRM score space, components 1 and 2")
               res_sprm_plot.plot_projections(Xn=X0[2667:],label=label,namesv=namesv,title="AIG SPRM score space, components 1 and 2",onlyval=True)
        
           
        .. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_T12.png
           :target: https://github.com/SvenSerneels/sprm/blob/master/AIG_T12.png
           :alt: AIG sprm score space
        
        
        
        * 
          plot caseweights, options as above, with the second one shown here:
        
          .. code-block::
        
               res_sprm_plot.plot_caseweights(Xn=X0[2667:],label=label,names=names,namesv=namesv,title="AIG SPRM caseweights")
               res_sprm_plot.plot_caseweights(Xn=X0[2667:],label=label,title="AIG SPRM caseweights")
               res_sprm_plot.plot_caseweights(Xn=X0[2667:],label=label,namesv=namesv,title="AIG SPRM caseweights",onlyval=True)  
        
        
           
          .. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_caseweights.png
             :target: https://github.com/SvenSerneels/sprm/blob/master/AIG_caseweights.png
             :alt: AIG sprm caseweights
        
        
        
        * 
          plot cross-validation results: 
        
          .. code-block::
        
               res_sprm_plot_cv = sprm.plot_cv(res_sprm_cv,colors)
               res_sprm_plot_cv.eta_ncomp_contour()
               res_sprm_plot_cv.cv_score_table_
        
        
        
          .. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_CV.png
             :target: https://github.com/SvenSerneels/sprm/blob/master/AIG_CV.png
             :alt: AIG sprm CV results
        
        
        References
        ----------
        
        
        #. `Sparse partial robust M regression <https://www.sciencedirect.com/science/article/abs/pii/S0169743915002440>`_\ , Irene Hoffmann, Sven Serneels, Peter Filzmoser, Christophe Croux, Chemometrics and Intelligent Laboratory Systems, 149 (2015), 50-59.
        #. `Sparse and robust PLS for binary classification <https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.2775>`_\ , I. Hoffmann, P. Filzmoser, S. Serneels, K. Varmuza, Journal of Chemometrics, 30 (2016), 153-162.
        
        Work to do
        ----------
        
        
        * while the code is aligned with sklearn, it does not yet 100% follow the naming conventions therein
        * optimize for speed 
        * manipulations in robcent can be written more elegantly
        * suggestions always welcome
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
