Stanford statistical learning software

This is a collection of R packages written by current and former members of the labs of Trevor Hastie, Jon Taylor and Rob Tibshirani. All of these packages are actively supported by their authors.

Lasso, elastic net and regularized modelling

glmnet : Our most popular, and actively updated and maintained package. Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, poisson regression and the Cox model. Two recent additions are the multiresponse gaussian, and the grouped multinomial. The algorithm uses cyclical coordinate descent in a pathwise fashion, as described in the paper listed below. Maintained by Trevor Hastie

lars : Least angle regression. Efficient procedures for fitting an entire lasso sequence with the cost of a single least squares fit. Stepwise regression and infinitessimal forward stagewise regression are options as well. Less efficient than glmnet, but returns entire continuous path of solutions including the knots. The latter for important for inference-- see covTest library below. Maintained by Trevor Hastie

glmpath : A path-following algorithm for L1 regularized generalized linear models and Cox proportional hazards model. Like LARS, less efficient than glmnet but returns entire continuous path of solutions including the knots. Maintained by Mee Young Park

sparseNet : Fit sparse linear regression models via nonconvex optimization. Sparsenet uses the MC+ penalty of Zhang. It computes the regularization surface over both the family parameter and the tuning parameter by coordinate descent. Maintained by Trevor Hastie

SGL : Group lasso and sparse group lasso. Maintained by Noah Simon

covTest : Computes the covariance test significance testing in adaptive linear modelling. Can be used with LARS (lasso) for linear models, elastic net, binomial and Cox survival model. This package should be considered EXPERIMENTAL. The background paper (Lockhart et al 2013) is not yet published and rigorous theory does not yet exist for the logistic and Cox models. Maintained by Rob Tibshirani

Fused lasso, trend filtering, generalized lasso

flsa : This package implements a path algorithm for the Fused Lasso Signal Approximator. It includes functions for 1D data (signals) and 2D data (images). Maintained by Holger Hoefling

genlasso : Path algorithms for Generalized lasso problems, including trend and 2D filtering and the fused lasso. Maintained by Ryan Tibshirani

Interactions

hierNet : A Lasso for Hierarchical Interactions. Fits sparse interaction models for continuous and binary responses subject to the strong (or weak) hierarchy restriction that an interaction between two variables only be included if both (or at least one of) the variables is included as a main effect. For more details, see Bien, J., Taylor, J.,Tibshirani, R., (2012) ‘‘A Lasso for Hierarchical Interactions’’, Annals of Statistics. Maintained by Jacob Bien

Interact : This package searches for marginal interactions in a binary response model. Interact uses permutation methods to estimate false discovery rates for these marginal interactions and has some, limited visualization capabilities. Maintained by Noah Simon

Graphical models

glasso : Graphical lasso: estimation of the edges in an undirected graphical model (inverse covariance model) using an L1 penalty. Maintained by Rob Tibshirani

Sparse SVD, principal components, canonical correlation analysis

PMA : Penalized Multivariate Analysis: a penalized matrix decomposition, sparse principal components analysis, and sparse canonical correlation analysis. Maintained by Daniela Witten

Clustering

sparcl : Implements the sparse clustering methods of Witten and Tibshirani (2010): "A framework for feature selection in clustering"; published in Journal of the American Statistical Association 105(490): 713-726. Maintained by Daniela Witten

protoclust : Performs minimax linkage hierarchical clustering. Every cluster has an associated prototype element that represents that cluster as described in Bien, J., and Tibshirani, R. (2011), "Hierarchical Clustering with Prototypes via Minimax Linkage," The Journal of the American Statistical Association. Maintained by Jacob Bien

Support vector Machines

svmpath : Path algorithm for Support Vector Machines. Computes the entire regularization path for the two-class svm classifier with essentially the same cost as a single SVM fit. Maintained by Trevor Hastie

High dimensional hypothesis testing and classification, especially for genomics.

samr : Significance analysis for microarrays. This package does significance testing and estimates FDRs for high-dimensional problems. Can handle a wide variety of outcome types- two and multiclass, quantitative, survival, timecourse, etc. This package is the underlying "engine" for the popular SAM Excel Addin. Maintained by Rob Tibshirani

pamr : Prediction analysis for microarrays. Some functions for sample classification in microarrays and other high dimensional classification problems, using the nearest shrunken centroid method. Maintained by Rob Tibshirani

GSA : Gene set analysis- an alternative approach to gene set enrichment analysis, due to Efron and tibshirani (2007), AOAS. Maintained by Rob Tibshirani

Generalized additive models

gam : Fits generalized additive models. Maintained by Trevor Hastie

Independent components analysis

ProdDenICA : Product Density Independent Components Analysis. Estimate ICA components using the Product Density Maximum likelihood method due to Hastie and Tibshirani . Maintained by Trevor Hastie

Matrix completion

softImpute : SoftImpute is a package for matrix completion - i.e. for imputing missing values in matrices.
It uses squared-error loss with nuclear norm regularization - one can think of it as the "lasso" for matrix approximation - to find a low-rank approximation to the observed entries in the matrix. This low-rank approximation is then used to impute the missing entries. softImpute works in a kind of "EM" fashion. Given a current guess, it fills in the missing entries. Then it computes a soft-thresholded SVD of this complete matrix, which yields the next guess. These steps are iterated till convergence to the solution of the convex-optimation problem. The algorithm can work with large matrices, such as the "netflix" matrix (400K x 20K) by making heavy use of sparse-matrix methods in the Matrix package. It creates new S4 classes such as "Incomplete" for storing the large data matrix, and "SparseplusLowRank" for representing the completed matrix. SVD computations are done using a specially built block-alternating algorithm, svd.als, that exploits these structures and uses warm starts. Some of the methods used are described in Rahul Mazumder, Trevor Hastie and Rob Tibshirani: Spectral Regularization Algorithms for Learning Large Incomplete Matrices. JMLR 2010 11 2287-2322. Other newer and more efficient methods that inter-weave the alternating block algorithm steps with imputation steps will be described in a forthcoming article. Maintained by Trevor Hastie

Missing data

impute : Imputation for microarray data and other high-dimensional datasets. Maintained by Balasubramanian Narasimhan

Other packages that we like and use

gbm : Gradient boosting machines

e1071 : Support vector machines