Stanford statistical learning software
This is a collection of R packages written by current and former members
of the labs of Trevor Hastie, Jon Taylor and Rob Tibshirani. All of these packages
are actively supported by their authors.
Lasso, elastic net and regularized modelling
glmnet :
Our most popular, and actively updated and maintained package.
Extremely efficient procedures for fitting the entire lasso or elastic-net regularization
path for linear regression, logistic and multinomial regression models, poisson regression
and the Cox model. Two recent additions are the multiresponse
gaussian, and the grouped multinomial. The algorithm uses
cyclical coordinate descent in a pathwise fashion, as described in the paper listed below.
Maintained by
Trevor Hastie
lars :
Least angle regression. Efficient procedures for fitting an entire lasso sequence
with the cost of a single least squares fit.
Stepwise regression and infinitessimal forward stagewise regression are options as well.
Less efficient than glmnet, but returns entire continuous path of solutions
including the knots. The latter for important for inference-- see covTest
library below.
Maintained by
Trevor Hastie
glmpath :
A path-following algorithm for L1 regularized generalized
linear models and Cox proportional hazards model.
Like LARS, less efficient than glmnet but returns entire continuous path of solutions
including the knots.
Maintained by
Mee Young Park
sparseNet :
Fit sparse linear regression models via nonconvex optimization.
Sparsenet uses the MC+ penalty of Zhang. It computes the
regularization surface over both the family parameter and the
tuning parameter by coordinate descent.
Maintained by
Trevor Hastie
SGL :
Group lasso and sparse group lasso.
Maintained by
Noah Simon
covTest :
Computes the covariance test significance testing in adaptive linear modelling. Can be used with
LARS (lasso) for linear models, elastic net, binomial and Cox survival model. This package should
be considered EXPERIMENTAL. The background paper (Lockhart et al 2013) is not yet published and rigorous theory
does not yet exist for the logistic and Cox models.
Maintained by
Rob Tibshirani
Fused lasso, trend filtering, generalized lasso
flsa :
This package implements a path algorithm for the Fused
Lasso Signal Approximator. It includes functions for 1D data (signals) and 2D data (images).
Maintained by
Holger Hoefling
genlasso :
Path algorithms for Generalized lasso problems, including trend and 2D filtering
and the fused lasso. Maintained by
Ryan Tibshirani
Interactions
hierNet :
A Lasso for Hierarchical Interactions.
Fits sparse interaction models for continuous and binary
responses subject to the strong (or weak) hierarchy restriction
that an interaction between two variables only be included if
both (or at least one of) the variables is included as a main
effect. For more details, see Bien, J., Taylor, J.,Tibshirani, R., (2012) ‘‘A Lasso for Hierarchical Interactions’’, Annals of Statistics.
Maintained by
Jacob Bien
Interact :
This package searches for marginal interactions in a
binary response model. Interact uses permutation methods to
estimate false discovery rates for these marginal interactions
and has some, limited visualization capabilities.
Maintained by
Noah Simon
Graphical models
glasso :
Graphical lasso: estimation of the edges in an undirected graphical
model (inverse covariance model) using an L1 penalty.
Maintained by
Rob Tibshirani
Sparse SVD, principal components, canonical correlation analysis
PMA :
Penalized Multivariate Analysis: a penalized
matrix decomposition, sparse principal components analysis, and
sparse canonical correlation analysis.
Maintained by
Daniela Witten
Clustering
sparcl :
Implements the sparse clustering methods of Witten and Tibshirani (2010): "A framework for feature selection in clustering"; published in Journal of the American Statistical Association 105(490): 713-726.
Maintained by
Daniela Witten
protoclust :
Performs minimax linkage hierarchical clustering. Every cluster has an associated prototype element that represents that cluster as described in Bien, J., and Tibshirani, R. (2011), "Hierarchical Clustering with Prototypes via Minimax Linkage," The Journal of the American Statistical Association.
Maintained by
Jacob Bien
Support vector Machines
svmpath :
Path algorithm for Support Vector Machines.
Computes the entire regularization path for the two-class
svm classifier with essentially the same cost as a single SVM fit.
Maintained by
Trevor Hastie
High dimensional hypothesis testing and classification, especially for genomics.
samr :
Significance analysis for microarrays. This package does significance testing
and estimates FDRs for high-dimensional problems. Can handle a wide variety
of outcome types- two and multiclass, quantitative, survival, timecourse, etc.
This package is the underlying "engine" for the popular SAM Excel Addin.
Maintained by
Rob Tibshirani
pamr :
Prediction analysis for microarrays. Some functions for sample classification in microarrays and other high dimensional classification problems, using
the nearest shrunken centroid method.
Maintained by
Rob Tibshirani
GSA :
Gene set analysis- an alternative approach to gene set enrichment analysis,
due to Efron and tibshirani (2007), AOAS.
Maintained by
Rob Tibshirani
Generalized additive models
gam :
Fits generalized additive models. Maintained by
Trevor Hastie
Independent components analysis
ProdDenICA :
Product Density Independent Components Analysis. Estimate ICA components
using the Product Density Maximum likelihood method due to Hastie and Tibshirani .
Maintained by
Trevor Hastie
Matrix completion
softImpute :
SoftImpute is a package for matrix completion - i.e. for imputing missing values in matrices.
It uses squared-error loss with nuclear norm regularization - one can think of it as
the "lasso" for matrix approximation - to find a low-rank approximation to the observed entries in the matrix.
This low-rank approximation is then used to impute the missing entries.
softImpute works in a kind of "EM" fashion. Given a current guess, it fills in the missing entries.
Then it computes a soft-thresholded SVD of this complete matrix, which yields the next guess.
These steps are iterated till convergence to the solution of the convex-optimation problem.
The algorithm can work with large matrices, such as the "netflix" matrix (400K x 20K) by making heavy use
of sparse-matrix methods in the Matrix package. It creates new S4 classes such as "Incomplete" for storing the large
data matrix, and "SparseplusLowRank" for representing the completed matrix. SVD computations are done using
a specially built block-alternating algorithm, svd.als, that exploits these structures and uses warm starts.
Some of the methods used are described in
Rahul Mazumder, Trevor Hastie and Rob Tibshirani:
Spectral Regularization Algorithms for Learning Large Incomplete Matrices.
JMLR 2010 11 2287-2322.
Other newer and more efficient methods that inter-weave the alternating block algorithm steps with imputation steps will
be described in a forthcoming article.
Maintained by
Trevor Hastie
Missing data
impute :
Imputation for microarray data and other high-dimensional datasets. Maintained by
Balasubramanian Narasimhan
Other packages that we like and use
gbm :
Gradient boosting machines
e1071 :
Support vector machines