NIPS : Conferences : 2014 : Program

Accepted Papers

Find an error with your paper? Please login to CMT to fix any errors. Fixes will eventually be propagated here.

Orals

Spotlights

Posters

Orals

"How hard is my MDP?" The distribution-norm to the rescue
In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel $p$. In many problems, a good approximation of $p$ is not needed. For instance, if from one state-action pair $(s,a)$, one can only transit to states with the same value, learning $p(\cdot|s,a)$ accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) we call the {\em distribution-norm}. The distribution-norm w.r.t.~a measure $\nu$ is defined on zero $\nu$-mean functions $f$ by the standard variation of $f$ with respect to $\nu$. We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the generic but loose $||\cdot||_1$ concentration inequalities used in most previous analysis of RL algorithms, to benefit from this new hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.

Timothy Mann, Odalric-Ambrym Maillard, Shie Mannor
A Statistical Decision-Theoretic Framework for Social Choice
In this paper, we take a statistical decision-theoretic viewpoint on social choice, putting a focus on the decision to be made on behalf of a system of agents. In our framework, we are given a statistical ranking model, a decision space, and a loss function defined on (parameter, decision) pairs, and formulate social choice mechanisms as statistical estimators that minimize expected loss. This suggests a general framework for the design and analysis of new social choice mechanisms. We compare Bayesian estimators, which minimize Bayesian expected loss, for two variants of the Mallows model and the Kemeny rule. We consider various normative properties, in addition to computational complexity and asymptotic behavior.

Hossein Azari Soufiani, David Parkes, Lirong Xia
A Wild Bootstrap for Degenerate Kernel Tests
A wild bootstrap method for nonparametric hypothesis tests based on kernel distribution embeddings is proposed. This bootstrap method is used to construct provably consistent tests that apply to random processes, for which the naive permutation-based bootstrap fails. It applies to a large group of kernel tests based on V-statistics, which are degenerate under the null hypothesis, and non-degenerate elsewhere. To illustrate this approach, we construct a two-sample test, an instantaneous independence test and a multiple lag independence test for time series.In experiments, the wild bootstrap gives strong performance on synthetic examples, on audio data, and in performance benchmarking for the Gibbs sampler.

Kacper Chwialkowski, Dino Sejdinovic, Arthur Gretton
A* Sampling
The problem of drawing exact samples from a discrete distribution can be converted into a discrete optimization problem. In this work, we show how sampling from general continuous distributions can be converted into an optimization problem over continuous space. Central to the method is a special case of a stochastic process recently described in mathematical statistics that we call a Gumbel process. We introduce a novel construction of Gumbel processes and A* Sampling, a general sampling algorithm that searches for the maximum of a Gumbel process using A* search. When the global optimum is provably reached, an exact sample is returned. We analyze the convergence time of the algorithms and demonstrate empirically that they make more efficient use of bound and likelihood evaluations than adaptive rejection sampling-based algorithms.

Christopher Maddison, Daniel Tarlow, Tom Minka
Analog Memories in a Balanced Rate-Based Network of E-I Neurons
The dynamics of brain circuits often settles in persistent and graded activity states that are sometimes seen as signatures of the autoassociative retrieval of memorized items. Despite decades of theoretical work on the subject, the mechanisms that support the storage and retrieval of memories remain unclear. Previous proposals concerning the dynamics of memory networks have fallen short of incorporating some key physiological constraints in a unified way. Specifically, some models violate Dale’s law (i.e. allow neurons to be both excitatory (E) and inhibitory (I)), while some others restrict the representation of memories to a binary format, or induce recall states in which most neurons fire at saturation. We propose a novel control-theoretic framework to build functioning attractor networks that meet all of the above physiological constraints. We directly optimize networks of E and I neurons to force collections of arbitrary, analog activity patterns to become stable fixed points of the dynamics. The resulting networks operate in the balanced regime, are robust to corruptions of the memory cue as well as ongoing noise, and incidentally explain the reduction of trial-to-trial variability following stimulus onset that is ubiquitously observed in the sensory and motor cortices. Our results constitute a step forward in our growing understanding of the neural substrate of memory.

Dylan Festa, Guillaume Hennequin, Mate Lengyel
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)
We present the first provably sublinear time algorithm for approximate \emph{Maximum Inner Product Search} (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we extend the LSH framework to allow asymmetric hashing schemes. Our proposal is based on an interesting mathematical phenomenon that, the problem of searching for inner products, after independent asymmetric transformations, can be converted into the problem of approximate near neighbor search. This key observation makes efficient sublinear hashing scheme for MIPS possible. In the extended asymmetric LSH (ALSH) framework, we provide an explicit construction of provably fast hashing scheme for MIPS. Our proposed algorithm is simple and easy to implement. The proposed hashing scheme outperforms the two popular existing LSH heuristics (i) Signed Random Projection (SRP) and (ii) hashing based on p-stable distributions for L2 norm (L2LSH), in the collaborative filtering task of item recommendations on Netflix and Movielens (10M) datasets.

Anshumali Shrivastava, Ping Li
Asynchronous Anytime Sequential Monte Carlo
We introduce a new sequential Monte Carlo algorithm we call the particle cascade. The particle cascade is an asynchronous, anytime alternative to traditional particle filtering algorithms. It uses no barrier-type synchronizations which leads to improved particle throughput and memory efficiency. It is an anytime algorithm in the sense that it can be run forever to emit an unbounded number of particles while keeping within a fixed memory budget. We prove that the particle cascade is an unbiased marginal likelihood estimator which means that it can be straightforwardly plugged into existing pseudomarginal methods.

Brooks Paige, Frank Wood
Combinatorial Pure Exploration of Multi-Armed Bandits
We study the {\em combinatorial pure exploration (CPE)} problem in the stochastic multi-armed bandit setting, where a learner explores a set of arms with the objective of identifying the optimal member of a \emph{decision class}, which is a collection of subsets of arms with certain combinatorial structures such as size-$K$ subsets, matchings, spanning trees or paths, etc. The CPE problem represents a rich class of pure exploration tasks which covers not only many existing models but also novel cases where the object of interest has a non-trivial combinatorial structure. In this paper, we provide a series of results for the general CPE problem. We present general learning algorithms which work for all decision classes that admit offline maximization oracles in both fixed confidence and fixed budget settings. We prove problem-dependent upper bounds of our algorithms. Our analysis exploits the combinatorial structures of the decision classes and introduces a new analytic tool. We also establish a general problem-dependent lower bound for the CPE problem. Our results show that the proposed algorithms achieve the optimal sample complexity (within logarithmic factors) for many decision classes. In addition, applying our results back to the problems of top-$K$ arms identification and multiple bandit best arms identification, we recover the best available upper bounds up to constant factors and partially resolve a conjecture on the lower bounds.

Shouyuan Chen, Tian Lin, Michael Lyu, Irwin King, Wei Chen
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is (re)generated, conditional on the latent structure, using models for which maximum likelihood estimation has a closed-form. Our autoencoder formulation enables efficient learning without making unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. We show competitive results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines.

Waleed Ammar, Chris Dyer, Noah Smith
Feedforward Learning of Mixture Models
We develop a biologically-plausible learning rule that provably converges to the class means of general mixture models. It generalizes the classical BCM neural rule within a tensor framework, incorporates a multi-view assumption, and shows how learning of higher-order structure is possible. Together our model provides a novel information processing interpretation to spike-timing-dependent plasticity.

Matthew Lawlor, Steven Zucker
From Stochastic Mixability to Fast Rates
Empirical risk minimization (ERM) is a fundamental algorithm for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with experts setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the \emph{mixability} of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of \emph{stochastic mixability} builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell,\mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the generalized moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.

Nishant Mehta, Robert Williamson
Learning Generative Models with Visual Attention
Attention has long been proposed by psychologists to be important for efficiently dealing with the massive amounts of sensory stimulus in the neocortex. Inspired by the attention models in visual neuroscience and the need for object-centered data for generative models, we propose a deep-learning based generative framework using attention. The attentional mechanism propagates signals from the region of interest in a scene to an aligned canonical representation for generative modeling. By ignoring scene background clutter, the generative model can concentrate its resources on the object of interest. A convolutional neural net is employed to provide good initializations during posterior inference which uses Hamiltonian Monte Carlo. Upon learning images of faces, our model can robustly attend to face regions of novel test subjects. More importantly, our model can learn generative models of new faces from a novel dataset of large images where the face locations are not known.

Yichuan Tang, Nitish Srivastava, Ruslan Salakhutdinov
Median selection subset aggregation for parallel inference
For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.

Xiangyu Wang, Peichao Peng, David Dunson
On Communication Cost of Distributed Statistical Estimation and Dimensionality
We explore the connection between dimensionality and communication cost in distributed learning problems. Specifically we study the problem of estimating the mean $\vec{theta}$ of an unknown $d$ dimensional gaussian distribution in the distributed setting. In this problem, the samples from the unknown distribution are distributed among $m$ different machines. The goal is to estimate the mean $\vec{theta}$ at the optimal minimax rate while communicating as few bits as possible. We show that in this setting, the communication cost scales linearly in the number of dimensions i.e. one needs to deal with different dimensions individually. Applying this result to previous lower bounds for one dimension in the interactive setting by Zhang et al. NIPS'13 and to our improved bounds for the simultaneous setting, we prove new lower bounds of $\Omega(md/\log(m))$ and $\Omega(md)$ for the bits of communication needed to achieve the minimax squared loss, in the interactive and simultaneous settings respectively. To complement, we also demonstrate an interactive protocol achieving the minimax squared loss with $O(md)$ bits of communication. Given the strong lower bounds in the general setting, we initiate the study of the distributed parameter estimation problems with structured parameters. Specifically, when the parameter is known to be $s$-sparse, we show a protocol achieving the minimax squared loss with high probability and with communication cost proportional to $s$ rather than the dimension $d$ of the ambient space.

Ankit Garg, Tengyu Ma, Huy Nguyen
On the power of clamping
It was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition function is upper bounded by the true partition function for a binary pairwise model that is attractive. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise (and hence improve) the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error.

Adrian Weller
Probabilistic ODE Solvers with Runge-Kutta Means
Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state-of-the art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.

Michael Schober, David Duvenaud, Philipp Hennig
Provable Submodular Minimization using Wolfe's Algorithm
Owing to several applications in large scale learning and vision problems, fast submodular function minimization (SFM) has become a critical problem. Theoretically, unconstrained SFM can be performed in polynomial time (Iwata and Orlin 2009), however these algorithms are not practical. In 1976, Wolfe proposed an algorithm to find the minimum Euclidean norm point in a polytope, and in 1980, Fujishige showed how Wolfe's algorithm can be used for SFM. For general submodular functions, the Fujishige-Wolfe minimum norm algorithm seems to have the best empirical performance. Despite its good practical performance, theoretically very little is known about Wolfe's minimum norm algorithm -- to our knowledge the only result is an exponential time analysis due to Wolfe himself. In this paper we give a maiden convergence analysis of Wolfe's algorithm. We prove that in t iterations, Wolfe's algorithm returns a O(1/t)-approximate solution to the min-norm point. We also prove a robust version of Fujishige's theorem which shows that an O(1/n^2)-approximate solution to the min-norm point problem implies exact submodular minimization. As a corollary, we get the first pseudo-polynomial time guarantee for the Fujishige-Wolfe minimum norm algorithm for submodular function minimization. In particular, we show that the min-norm point algorithm solves SFM in O(n^7F^2)-time, where $F$ is an upper bound on the maximum change a single element can cause in the function value.

Deeparnab Chakrabarty, Prateek Jain, Pravesh Kothari
Quantifying the transferability of features in deep neural networks
A high percentage of recently reported deep networks trained on natural images exhibit a curious aspect in common: they all learn features similar to Gabor filters on the first layer. Such features appear to be general, as opposed to specific for a particular task. Because higher layers are difficult to visualize, not as much is known about whether the are general or specific. In this paper we experimentally quantify the generality of neurons on each layer of a deep convolutional neural network and through this expose a few surprising aspects. First, we find that transferability is negatively affected by two distinct issues — not only the expected specialization of higher layer neurons to their original task at the expense of the target task, but also optimization difficulties related to splitting networks in the middle of co-adapted neurons. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also show how the transferability gap grows as the distance between base task and target task increases, but how even transfer from distant tasks can be better than using random features. Finally, we show a surprising result that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson
Sequence to Sequence Learning with Neural Networks
Deep Neural Networks (DNNs) are powerful models that have shown significant success on difficult supervised learning tasks. Although generally their performance improves as dataset and model size increases, they are applicable only to problems whose inputs and outputs are encoded as vectors of fixed dimensionality, so they cannot straightforwardly learn to map sequences to sequences. Existing approaches for sequence problems (such as speech recognition and machine translation) use problem specific multistage pipelines that are highly complex. In this paper, we present a general end-to-end approach to sequence learning. Our method uses a Long Short-Term Memory (LSTM) to convert the input sequence into a vector of a fixed dimensionality, and then another LSTM to extract the target sequence from input's fixed-size representation. We evaluate the feasibility of our approach on a machine translation problem. Our main result is that on the WMT dataset, our system achieves near state-of-the-art BLUE score on the German to English translation when all the words in the input sentences are frequent. We also obtained an improvement of 1.5 BLUE points on an English to French translation over a strong baseline by rescoring the 1000 most-likely hypotheses. Finally, we show that our LSTM learns sensible phrase and sentence representations that are sensitive to word order and are fairly invariant to the active and the passive voice.

Ilya Sutskever, Oriol Vinyals, Quoc Le
Sparse Polynomial Learning and Graph Sketching
Let $f: \{-1,1\}^n \rightarrow \mathbb{R}$ be a polynomial with at most $s$ non-zero real coefficients. We give an algorithm for exactly reconstructing $f$ given random examples only from the uniform distribution on $\{-1,1\}^n$ that runs in time polynomial in $n$ and $2^{s}$ and succeeds if the function satisfies \textit{unique sign property}: there is one output value which corresponds to a unique set of values of the participating parities. This sufficient condition is satisfied when every coefficient of $f$ is perturbed by a small random noise, or satisfied with high probability when $s$ parity functions are chosen randomly or when all the coefficients are positive. Learning sparse polynomials over the Boolean domain in time polynomial in $n$ and $2^{s}$ is considered a notoriously hard problem in the worst-case. Our result shows that the problem is tractable in some special but generic settings. Then, we show an application of this result to hypergraph sketching which is the problem of learning a sparse (both in the number of hyperedges and the size of the hyperedges) hypergraph from uniformly drawn random cuts. We also provide experimental results on a real world dataset.

Murat Kocaoglu, Karthikeyan Shanmugam, Alexandros Dimakis, Adam Klivans

Spotlights

(Almost) No Label No Cry
In Learning with Label Proportions (LLP), the objective is to learn a supervised classifier when, instead of labels, only label proportions for bags of observations are known. This setting has broad practical relevance, in particular for privacy preserving data processing. We first show that the mean operator, a statistic which aggregates all labels, is minimally sufficient for the minimization of many proper scoring losses with linear (or kernelized) classifiers without using labels. We provide a fast learning algorithm that estimates the mean operator via a manifold regularizer with guaranteed approximation bounds. Then, we present an iterative learning algorithm that uses this as initialization. We ground this algorithm in Rademacher-style generalization bounds that fit the LLP setting, introducing a generalization of Rademacher complexity and a Label Proportion Complexity measure. This latter algorithm optimizes tractable bounds for the corresponding bag-empirical risk. Experiments are provided on fourteen domains, whose size ranges up to 300K observations. They display that our algorithms are scalable and tend to consistently outperform the state of the art in LLP. Moreover, in many cases, our algorithms compete with or are just percents of AUC away from the Oracle that learns knowing all labels. On the largest domains, half a dozen proportions can suffice, i.e. roughly 40K times less than the total number of labels.

Giorgio Patrini, Richard Nock, Tiberio Caetano, paul Rivera
A Bayesian model for identifying hierarchically organised states in neural population activity
Neural population activity in cortical circuits is not solely driven by external inputs, but is also modulated by endogenous states. These cortical states vary on multiple time-scales and also across areas and layers of the neocortex. To understand information processing in cortical circuits, we need to understand the statistical structure of internal states and their interaction with sensory inputs. Here, we present a statistical model for extracting hierarchically organized neural population states from multi-channel recordings of neural spiking activity. We model population states using a hidden Markov decision tree with state-dependent tuning parameters and a generalized linear observation model. Using variational Bayesian inference, we estimate the posterior distribution over parameters from population recordings of neural spike trains. On simulated data, we show that we can identify the underlying sequence of population states over time and reconstruct the ground truth parameters. Using extracellular population recordings from visual cortex, we find that a model with two levels of population states outperforms a generalized linear model which does not include state-dependence, as well as models which only including a binary state. Finally, modelling of state-dependence via our model also improves the accuracy with which sensory stimuli can be decoded from the population response.

Patrick Putzky, Florian Franzen, Giacomo Bassetto, Jakob Macke
A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights
We derive a second-order ordinary differential equation (ODE), which is the limit of Nesterov’s accelerated gradient method. This ODE exhibits approximate equivalence to Nesterov’s scheme and thus can serve as a tool for analysis. We show that the continuous time ODE allows for a better understanding of Nesterov’s scheme. As a byproduct, we obtain a family of schemes with similar convergence rates. The ODE interpretation also suggests restarting Nesterov’s scheme leading to an algorithm, which can be rigorously proven to converge at a linear rate whenever the objective is strongly convex.

Weijie Su, Stephen Boyd, Emmanuel Candes
A Latent Source Model for Online Collaborative Filtering
Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the ``online'' setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. Concretely, there are $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly $\log(km)$ initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing $k$. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).

Guy Bresler, George Chen, Devavrat Shah
Advances in Learning Bayesian Networks of Bounded Treewidth
This work presents novel algorithms for learning Bayesian networks of bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in sampling k-trees (maximal graphs of treewidth k), and subsequently selecting, exactly or approximately, the best structure whose moral graph is a subgraph of that k-tree. The approaches are empirically compared to each other and to state-of-the-art methods on a collection of public data sets with up to 100 variables. The experiments show that our algorithms are quite competitive.

Siqi Nie, Denis Maua, Cassio de Campos, Qiang Ji
Augur: Data-Parallel Probabilistic Modelling
Implementing inference procedures for each new probabilistic model is time-consuming and error-prone. Probabilistic programming addresses this problem by allowing a user to specify the model and automatically generating the inference procedure. To make this practical it is important to generate high performance inference code. In turn, on modern architectures, high performance implies parallel execution. In this paper we present Augur, a probabilistic modelling language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.

Jean-Baptiste Tristan, Daniel Huang, Joseph Tassarotti, Adam Pocock, Stephen Green, Guy Steele
Beyond Disagreement-Based Agnostic Active Learning
We study active learning of classifiers in an agnostic setting, where the goal is to learn a classifier in a hypothesis class interactively with as few label queries as possible. The primary strategy for general active learning in this setting is {\em{disagreement-based active learning}}, which has a relatively high label complexity. A major challenge in the literature is to find an algorithm which achieves better label complexity, applies to general classification problems, and is consistent in an agnostic setting. In this paper, we provide a solution to this problem. Our solution is based on two novel contributions: a reduction from confidence-rated predictors with guaranteed error to consistent active learning algorithms, and as well as a new confidence-rated predictor with guaranteed error.

Chicheng Zhang, Kamalika Chaudhuri
Causal Strategic Inference in Networked Microfinance Economies
Performing interventions is a major challenge in economic policy-making. We propose \emph{causal strategic inference} as a framework for conducting interventions and apply it to large, networked microfinance economies. The basic solution platform consists of modeling a microfinance market as a networked economy, learning the parameters of the model from the real-world microfinance data, and designing algorithms for various computational problems in question. We adopt Nash equilibrium as the solution concept for our model. For a special case of our model, we show that an equilibrium point always exists and that the equilibrium interest rates are unique. For the general case, we give a constructive proof of the existence of an equilibrium point. Our empirical study is based on the microfinance data from Bangladesh and Bolivia, which we use to first learn our models. We show that causal strategic inference can assist policy-makers by evaluating the outcomes of various types of interventions, such as removing a loss-making bank from the market, imposing an interest rate cap, and subsidizing banks.

Mohammad Irfan, Luis Ortiz
Clustered factor analysis of multineuronal spike data
High-dimensional, simultaneous recordings of neural spiking activity are often explored, analyzed and visualized with the help of latent variable or factor models. Such models are however ill-equipped to extract structure beyond shared, distributed aspects of firing activity across multiple cells. Here, we extend unstructured factor models by proposing a model that discovers sub-populations or groups of cells from the pool of recorded neurons. The model combines aspects of mixture of factor analyzer models for capturing clustering structure in the data and aspects of latent dynamical system models for capturing temporal dependencies in the recordings. In the resulting model, we infer the sub-populations and the latent factors from data using variational inference and model parameters are estimated by Expectation Maximization (EM). We also address the crucial problem of initializing parameters for EM by extending a sparse subspace clustering algorithm to integer-valued spike count observations. We illustrate the merits of the proposed model by applying it to calcium-imaging data from spinal cord neurons, and we show that it uncovers meaningful clustering structure in the data.

Lars Buesing, John Cunningham, Timothy Machado, Liam Paninski
Clustering from Labels and Time-Varying Graphs
We present a general framework for graph clustering where a label is observed to each pair of nodes. This allows a very rich encoding of various types of pairwise interactions between nodes. We propose a new tractable approach to this problem based on maximum likelihood estimator and convex optimization. We analyze our algorithm under a general generative model, and provide both necessary and sufficient conditions for successful recovery of the underlying clusters. Our theoretical results cover and subsume a wide range of existing graph clustering results including planted partition, weighted clustering and partially observed graphs. Furthermore, the result is applicable to novel settings including time-varying graphs such that new insights can be gained on solving these problems. Our theoretical findings are further supported by empirical results on both synthetic and real data.

Shiau Hong Lim, Yudong Chen, Huan Xu
Consistent Binary Classification with Generalized Performance Metrics
Performance metrics for binary classification are designed to capture tradeoffs between four fundamental population quantities: true positives, false positives, true negatives and false negatives.Despite significant interest from the theoretical and applied communities, little is known about either optimal classifiers or consistent algorithms for optimizing binary classification performance metrics beyond a few special cases. We consider a fairly large family of general performance metrics given by ratios of linear combinations of the four fundamental population quantities. This family includes many well known binary classification metrics such as classification accuracy, AM measure, F-measure and the Jaccard similarity coefficient as special cases. Our analysis identifies the optimal classifiers as the sign of the thresholded conditional probability of the positive class, with a performance metric-dependent threshold. The optimal threshold can be constructed using simple plug-in estimators when the performance metric is a linear combination of the population quantities, but alternative techniques are required for the general case. Our results unify and extend known results for special cases. We propose two algorithms for estimating the optimal classifiers, and prove their statistical consistency. Both algorithms are straightforward modifications of standard approaches to address the key challenge of optimal threshold selection, thus are simple to implement in practice. The first algorithm combines a plug-in estimate of the conditional probability of the positive class with optimal threshold selection. The second algorithm leverages recent work on calibrated asymmetric surrogate losses to construct candidate classifiers. We present empirical comparisons between these algorithms on benchmark datasets.

Nagarajan Natarajan, Oluwasanmi Koyejo, Pradeep Ravikumar, Inderjit Dhillon
Convolutional Kernel Networks
An important goal in visual recognition is to devise image representations that are invariant to particular transformations. In this paper, we address this goal with a new type of convolutional neural network (CNN) whose invariance is encoded by a reproducing kernel. Unlike traditional approaches where neural networks are learned either to represent data or for solving a classification task, our network learns to approximate the kernel feature map on training data. Such an approach enjoys several benefits over classical ones. First, by teaching CNNs to be invariant, we obtain simple network architectures that achieve a similar accuracy to more complex ones, while being easy to train and robust to overfitting. Second, we bridge a gap between the neural network literature and kernels, which are natural tools to model invariance. We evaluate our methodology on visual recognition tasks where CNNs have proven to perform well, e.g., digit recognition with the MNIST dataset, and the more challenging CIFAR-10 and STL-10 datasets, where our accuracy is competitive with the state of the art.

Julien Mairal, Piotr Koniusz, Zaid Harchaoui, Cordelia Schmid
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.

David Eigen, Christian Puhrsch, Rob Fergus
Design Principles of the Hippocampal Cognitive Map
Hippocampal place fields have been shown to reflect behaviorally relevant aspects of space. For instance, place fields tend to be skewed in commonly traveled directions, they cluster around rewarded locations, and they are constrained by the geometric structure of the environment. Bearing this in mind, we address the question of how place fields represent space in a way that facilitates navigation and reinforcement learning. We hypothesize a set of design principles for the hippocampal cognitive map. In particular, place fields encode not just information about the current location, but also predictions about future locations under the current transition distribution. Under this model, a variety of place field phenomena arise naturally from the structure of rewards, barriers, and directional biases as reflected in the transition policy. Furthermore, we demonstrate that this representation of space can support efficient reinforcement learning. We also hypothesize that grid cells compute the eigendecomposition of place fields, which is useful for segmenting an enclosure along natural boundaries. When applied recursively, this segmentation can be used to discover a hierarchical decomposition of space. Thus, grid cells might be involved in computing subgoals for hierarchical reinforcement learning.

Kimberly Stachenfeld, Matthew Botvinick, Samuel Gershman
Difference of Convex Functions Programming for Reinforcement Learning
Large Markov Decision Processes (MDPs) are usually solved using Approximate Dynamic Programming (ADP) methods such as Approximate Value Iteration (AVI) or Approximate Policy Iteration (API). The main contribution of this paper is to show that, alternatively, the optimal state-action value function can be estimated using Difference of Convex functions (DC) Programming. To do so, we study the minimization of a norm of the Optimal Bellman Residual (OBR) $T^*Q-Q$, where $T^*$ is the so-called optimal Bellman operator. Controlling this residual allows controlling the distance to the optimal action-value function, and we show that minimizing an empirical norm of the OBR is consistant in the Vapnik sense. Finally, we frame this optimization problem as a DC program. That allows envisioning using the large related literature on DC Programming to address the Reinforcement Leaning (RL) problem.

Bilal Piot, Matthieu Geist, Olivier Pietquin
Discrete Graph Hashing
Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art unsupervised hashing methods, especially for longer codes.

Wei Liu, Cun Mu, Sanjiv Kumar, Shih-Fu Chang
Exploiting easy data in online optimization
We consider the problem of online optimization, where a learner picks a decision from a given decision set and suffers some loss associated with the decision and the state of the environment. The learner's objective is to minimize its cumulative regret against the best fixed decision \textit{in hindsight}. Over the past few decades numerous variants have been considered, with many algorithms designed to achieve sublinear regret in the worst-case. However, this level of robustness comes at a cost. Proposed algorithms are often over-conservative, failing to adapt to the \textit{actual} complexity of the loss sequence; which is often far from the worst-case. In this paper we introduce a general algorithm that provided with a ``safe'' learning algorithm, and an opportunistic ``benchmark'', is able to effectively combine good worst-case guarantees with much improved performance on ``easy'' data. We derive general theoretical bounds on the regret of the proposed algorithm and discuss its implementation in a wide range of applications; notably in the problem of learning with shifting experts (a recent COLT open problem). Finally, we provide numerical simulations in the setting of prediction with expert advice with comparison to the state-of-the-art.

Amir Sani, Gergely Neu, Alessandro Lazaric
Extended and Unscented Gaussian Processes
We present two new methods for inference in Gaussian process (GP) models with general nonlinear likelihoods. Inference is based on a variational framework where a Gaussian posterior is assumed and the likelihood is linearized about the variational posterior mean using either a Taylor series expansion, or statistical linearization. We show that the parameter updates obtained by these algorithms are equivalent to the state update equations in the iterative extended and unscented Kalman filters respectively, hence we refer to our algorithms as extended and unscented GPs. The unscented GP treats the likelihood as a 'black-box' by not requiring its derivative for inference, so it also applies to non-differentiable likelihood models. We evaluate the performance of our algorithms on a number of synthetic inversion problems and a binary classification dataset.

Daniel Steinberg, Edwin Bonilla
Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning
Analyzing the multivariate spatio-temporal data accurately and efficiently is critical to climatology, geology and sociology applications. Existing models usually represent the data with matrices and underestimate the inter-dependence among variables, space and time. We formulate two main tasks in multivariate spatio-temporal analysis: cokring and forecasting, into a unified low rank tensor learning framework which incorporates the multifaceted commonalities of the data. We develop two learning algorithms: one alternating direction method for solving the convex relaxation of problem. The second is an efficient greedy algorithm with guaranteed global optimal solution. We evaluate our framework on synthetic datasets, and conduct cokring and forecasting tasks on real application datasets. We demonstrate that our framework is not only significantly faster than existing methods but also achieves lower estimate error and model complexity.

Rose Yu, Mohammad Taha Bahadori, Yan Liu
Fast and Robust Least Squares Estimation in Corrupted Linear Models
Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates. The concept of influence that was developed for regression diagnostics can be used to detect such corrupted observations as shown in this paper. This property of influence -- for which we also develop a randomized approximation -- motivates our proposed subsampling algorithm for large scale corrupted linear regression which limits the influence of data points since highly influential points contribute most to the residual error. Under a general model of corrupted observations, we show theoretically and empirically on a variety of simulated and real datasets that our algorithm improves over the current state-of-the-art approximation schemes for ordinary least squares.

Brian McWilliams, Gabriel Krummenacher, Mario Lucic, Joachim Buhmann
Greedy Algorithms for Finding Diverse Subsets in Exponentially-Large Structured Item Sets
Intelligent systems for perception domains such as Computer Vision or Natural Language Processing are required to deal with tremendous levels of ambiguity. Hence, robust approaches search for a set of diverse, high-quality solutions. This is a challenging in structured prediction problems because the space of the so- lutions (image segmentations, sentence parses, etc.) is exponentially large. We study greedy algorithms for finding diverse solutions in structured-output spaces by drawing new connections between submodular functions over combinatorial item sets and High-Order Potentials (HOPs) studied for graphical models. Specifically, we show via examples that when marginal gains of submodular diversity functions allow structured representations, this enables efficient (sub-linear time) approximate maximization by reducing the greedy augmentation step to an infer- ence problem on a factor graph with appropriately constructed HOPs. We discuss benefits and trade-offs, and experimentally demonstrate that our constructions lead to efficient algorithms and significantly better solutions.

Adarsh Prasad, Stefanie Jegelka, Dhruv Batra
Hamming Ball Auxiliary Sampling for Factorial Hidden Markov Models
We introduce a novel sampling algorithm for Markov Chain Monte Carlo-based Bayesian inference for Factorial Hidden Markov Models. The sampling algorithm uses an auxiliary variable construction that restricts the model space allowing iterative exploration in polynomial time. The method gives exact samples from the true posterior distribution and is computationally tractable even for quite large models. The sampling approach overcomes limitations with common conditional Gibbs samplers that use asymmetric updates and become easily trapped in local modes. Instead, our method uses symmetric moves that allows joint updating of the latent sequences and improves mixing. We illustrate the application of the approach with simulated and a real data example.

Michalis Titsias, Christopher Yau
Large-scale L-BFGS using MapReduce
L-BFGS has been applied as an effective parameter estimation method for various machine learning algorithms since 1980s. With an increasing demand to deal with massive instances and variables, it is very valuable to scale up and parallelize L-BFGS effectively in a distributed system. In this paper, we study the problem of parallelizing the L-BFGS algorithm in large clusters of tens of thousands of shared-nothing commodity machines. First, we show that a naive implementation of L-BFGS using Map-Reduce requires either a significant amount of memory or a large number of map-reduce operations with negative performance impact. Second, we propose a new L-BFGS algorithm, called Vector-free L-BFGS, which avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. The algorithm scales very well and enables a variety of machine learning algorithms to handle a massive number of variables over large datasets. We prove the mathematical equivalence of the new Vector-free L-BFGS and demonstrate its excellent performance and scalability using real-world machine learning problems with billions of variables in production clusters.

Weizhu Chen, Zhenghao Wang, Jingren Zhou
Learning Deep Features for Scene Recognition using PLACES Database
Scene recognition is one of the hallmark tasks of computer vision, allowing defining a context for object recognition. Whereas the tremendous recent progress in object recognition tasks are due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performances at scene recognition have not attained the same level of success. This is because current pre trained deep features with ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called PLACES with more than 6 millions of labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that PLACES is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art performances on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show the differences of the internal representations of object-centric and scene-centric networks.

Bolei Zhou, Jianxiong Xiao, Agata Garcia, Aude Oliva, Antonio Torralba
Learning Distributional Representations for Structured Output Prediction
In recent years, distributed representations of inputs have led to performance gains in many applications by allowing statistical information to be shared across inputs. However, the predicted outputs (labels, or more generally, structures like trees) are still treated as discrete objects, even though outputs are also not discrete units of meaning. In this paper, we present a new formulation for structured prediction where we represent individual labels in a structure as real-valued vectors allowing semantically similar labels to share parameters. We extend this representation to larger structures by defining compositionality using tensor products and show that our approach is a natural extension to standard structured prediction approaches. We propose a learning objective for jointly learning the model parameters and the label vectors and define an alternating minimization algorithm for learning. We apply our formulation to two tasks -- multiclass document classification and part-of-speech tagging (a sequence model) and show that we outperform standard structured learning baselines.

Vivek Srikumar, Christopher Manning
Learning Mixtures of Ranking Models
This work concerns learning probabilistic models for ranking data in a heterogeneous population. The specific problem we study is learning the parameters of a {\em Mallows Mixture Model}. Despite being widely studied, current heuristics for this problem do not have theoretical guarantees and can get stuck in bad local optima. We present the first polynomial time algorithm which provably learns the parameters of a mixture of two Mallows models. A key component of our algorithm is a novel use of tensor decomposition techniques to learn the top-$k$ prefix in both the rankings. Before this work, even the question of {\em identifiability} in the case of a mixture of two Mallows models was unresolved.

Pranjal Awasthi, Avrim Blum, Or Sheffet, Aravindan Vijayaraghavan
Learning from Latent and Observable Patterns on Multi-Relational Data
Factorizations of adjacency tensors have become popular methods for learning from multi-relational data. For these approaches, the rank of the factorization is an important parameter that determines runtime as well as generalization ability. To determine conditions under which factorization is an efficient approach for learning from relational data, we derive upper and lower bounds on the required rank to recover adjacency tensors. Based on our findings, we propose a novel additive tensor factorization model and present a scalable algorithm for computing the factorization. Experimentally, we show that the proposed approach does not only improve the predictive performance of pure factorization methods but that it also reduces the required rank -- and therefore the runtime and memory complexity -- significantly.

Maximilian Nickel, Volker Tresp, Xueyan Jiang
Learning to Discover Efficient Mathematical Identities
In this paper we explore how machine learning techniques can be applied to the discovery of efficient mathematical identities. We introduce an attribute grammar framework for representing symbolic expressions. Given a set of grammar rules we build trees that combine different rules, looking for branches which yield compositions that are analytically equivalent to a target expression, but of lower computational complexity. However, as the size of the trees grows exponentially with the complexity of the target expression, brute force search is impractical for all but the simplest of expressions. Consequently, we introduce two novel learning approaches that are able to learn from simpler expressions to guide the tree search. The first of these is a simple n-gram model, the other being a recursive neural-network. We show how these approaches enable us to derive complex identities, beyond reach of brute-force search, or human derivation.

Wojciech Zaremba, Karol Kurach, Rob Fergus
Log-Hilbert-Schmidt metric between positive definite operators on Hilbert spaces
This paper introduces a novel mathematical and computational framework, namely {\it Log-Hilbert-Schmidt metric} between positive definite operators on a Hilbert space. This is a generalization of the Log-Euclidean metric on the Riemannian manifold of positive definite matrices to the infinite-dimensional setting. The general framework is applied in particular to compute distances between covariance operators on a Reproducing Kernel Hilbert Space (RKHS), for which we obtain explicit formulas via the corresponding Gram matrices. Empirically, we apply our formulation to the task of multi-category image classification, where each image is represented by an infinite-dimensional RKHS covariance operator. On several challenging datasets, our method significantly outperforms approaches based on covariance matrices computed directly on the original input features, including those using the Log-Euclidean metric, Stein and Jeffreys divergences, achieving new state of the art results.

Minh Ha Quang, Marco San Biagio, Vittorio Murino
Making Pairwise Binary Graphical Models Attractive
Estimating the partition function of a given pairwise binary graphical model is NP-hard in general. As a result, the partition function is typically estimated by approximate inference algorithms such as belief propagation (BP) and tree-reweighted belief propagation (TRBP). The former provides reasonable estimates in practice but has convergence issues. The later has better convergence properties but typically provides worse estimates. In this work, we propose a novel scheme that has better convergence properties than BP and provably provides better partition function estimates than TRBP. We accomplish this using a special double-cover which essentially replaces a general graphical model with a fully attractive model.

Nicholas Ruozzi, Tony Jebara
Mode Estimation for High Dimensional Discrete Tree Graphical Models
This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading $(\delta,\rho)$-modes of the underlying distributions. A point is defined to be a $(\delta,\rho)$-mode if it is a local optima of the density within a $\delta$-neighborhood under metric $\rho$. As we increase the ``scale'' parameter $\delta$, the neighborhood size increases and the total number of modes monotonically decreases. The sequence of the $(\delta,\rho)$-modes reveal intrinsic topographical information of the underlying distributions. Though the mode finding problem is generally intractable in high dimensions, this paper unveils that, if the distribution can be approximated well by a tree graphical model, mode characterization is significantly easier. An efficient algorithm with provable theoretical guarantees is proposed and is applied to applications like data analysis and multiple prediction.

Chao Chen, Tianqi Zhao, Han Liu, Dimitris Metaxas
Near-optimal Reinforcement Learning in Factored MDPs
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = \Omega(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than $S$ or $A$. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

Ian Osband, Benjamin Van Roy
Non-convex Robust PCA
We propose a new provable method for robust PCA, where the task is to recover a low-rank matrix, which is corrupted with sparse perturbations. Our method consists of simple alternating projections onto the set of low rank and sparse matrices with intermediate de-noising steps. We prove correct recovery of the low rank and sparse components under tight recovery conditions, which match those for the state-of-art convex relaxation techniques. Our method is extremely simple to implement and has low computational complexity. For a $m \times n$ input matrix (say m \geq n), our method has O(r^2 mn\log(1/\epsilon)) running time, where $r$ is the rank of the low-rank component and $\epsilon$ is the accuracy. In contrast, the convex relaxation methods have a running time O(mn^2/\epsilon), which is not scalable to large problem instances. Our running time nearly matches that of the usual PCA (i.e. non robust), which is O(rmn\log (1/\epsilon)). Thus, we achieve ``best of both the worlds'', viz low computational complexity and provable recovery for robust PCA. Our analysis represents one of the few instances of global convergence guarantees for non-convex methods.

Praneeth Netrapalli, Niranjan U N, Sujay Sanghavi, Anima Anandkumar, Prateek Jain
Optimal Regret Minimization in Posted-Price Auctions with Strategic Buyers
We study revenue optimization learning algorithms for posted-price auctions with strategic buyers. We analyze a very broad family of monotone regret minimization algorithms for this problem, which includes the previous best known algorithm, and show that no algorithm in that family admits a strategic regret more favorable than $\Omega(\sqrt{T})$. We then introduce a new algorithm that achieves a strategic regret differing from the lower bound only by a factor in $O(\log T)$, an exponential improvement upon the previous best algorithm. Our new algorithm admits a natural analysis and simpler proofs, and the ideas behind its design are general. We also report the results of empirical evaluations comparing our algorithm with the previous best algorithm and show a consistent exponential improvement in several different scenarios.

Mehryar Mohri, Andres Munoz Medina
Optimal Teaching for Limited-Capacity Human Learners
Basic decisions, such as judging a person as a friend or foe, involve categorizing novel stimuli. Recent work finds that people's category judgments are guided by a small set of examples that are retrieved from memory at the time of decision. This limited and stochastic retrieval places limits on human performance for probabilistic classification decisions, such as classifying a mammogram as normal or tumorous. In light of this capacity limitation, recent work finds that idealizing training items, such that the saliency of ambiguous cases is reduced, improves human performance on novel test items. These idealization manipulations run contrary to common machine learning practices where the aim is to match training and test distributions. One shortcoming of previous work in idealization is that category distributions were idealized in an ad hoc or heuristic fashion, guided only by the intuitions of the experimenters. In this contribution, we take a first principles approach to constructing idealized training sets. We apply an optimal teaching procedure to a cognitive model that is either limited capacity (as humans are) or unlimited capacity (as most machine learning systems are). As predicted, we find that the optimal teacher recommends idealized training sets. We also find that human learners perform best when training recommendations from the optimal teacher are based on a limited-capacity model of the learner. As predicted, to the extent that the learning model used by the optimal teacher conforms to the true nature of human learners, the recommendations of the optimal teacher prove effective. Our results provide a normative basis (given capacity constraints) for idealization procedures and offer a novel selection procedure for models of human learning.

Kaustubh Patil, Xiaojin Zhu, Lukasz Kopec, Bradley Love
Optimal decision-making with time-varying evidence reliability
Previous theoretical and experimental work on optimal decision-making was restricted to the artificial setting of a reliability of the momentary sensory evidence that remained constant within single trials. The work presented here described the computation and characterization of optimal decision-making in the more realistic case of an evidence reliability that varies across time even within a trial. It shows that, in this case, the optimal behavior is determined by a bound in the decision maker's belief that depends on the current reliability. We furthermore demonstrate that simpler heuristics fail to match the optimal performance for certain characteristics of the process that determines the time-course of this reliability.

Jan Drugowitsch, Ruben Moreno-Bote, Alexandre Pouget
Optimizing Energy Production Using Policy Search and Predictive State Representations
We consider the challenging practical problem of optimizing the power production of a complex of hydroelectric power plants, which involves control over three continuous action variables, uncertainty in the amount of water inflows and a variety of constraints that need to be satisfied. We propose a policy-search-based approach coupled with predictive modelling to address this problem. This approach has some key advantages compared to other alternatives, such as dynamic programming: the policy representation and search algorithm can conveniently incorporate domain knowledge; the resulting policies are easy to interpret, and the algorithm is naturally parallelizable policy search. We demonstrate that our algorithm obtains a policy which outperforms the solution found by dynamic programming both quantitatively and qualitatively.

Yuri Grinberg, Doina Precup, Michel Gendreau
Poisson Process Jumping between an Unknown Number of Rates: Application to Neural Spike Data
We introduce a model where the rate of an inhomogeneous Poisson process is modified by a Chinese restaurant process. Applying a MCMC sampler to this model allows us to do posterior Bayesian inference about the number of states in Poisson-like data. Our sampler is shown to get accurate results for synthetic data and we apply it to V1 neuron spike data to find discrete firing rate states depending on the orientation of a stimulus.

Florian Stimberg, Andreas Ruttor, Manfred Opper
Predictive Entropy Search for Efficient Global Optimization of Black-box Functions
We propose a novel information-theoretic approach for Bayesian optimization called Predictive Entropy Search (PES). At each iteration, PES selects the next evaluation point that maximizes the expected information gained with respect to the global maximum. PES codifies this intractable acquisition function in terms of the expected reduction in the differential entropy of the predictive distribution. This reformulation allows PES to obtain approximations that are both more accurate and efficient than other alternatives such as Entropy Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment of the model hyperparameters while ES cannot. We evaluate PES in both synthetic and real-world applications, including optimization problems in machine learning, finance, biotechnology, and robotics. We show that the increased accuracy of PES leads to significant gains in optimization performance.

José Miguel Hernández-Lobato, Matthew Hoffman, Zoubin Ghahramani
RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning
We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness is introduced to reduce the sensitivity to the approximation error of sub-optimal policies in comparison with methods such as fitted value iteration. This results in reducing the bounds on the \gamma-discounted infinite horizon performance loss by a factor of 1/(1-\gamma) while preserving polynomial-time computational complexity. Our experimental results show that using the robust representation can significantly improve the solution quality with minimal additional computational cost.

Marek Petrik, Dharmashankar Subramanian
Randomized Experimental Design for Causal Graph Discovery
We examine the number of controlled experiments required to discover a causal graph. Hauser and Buhlmann showed that the number of experiments required is logarithmic in the cardinality of maximum undirected clique in the essential graph. Their lower bounds, however, assume that the experiment designer cannot use andomization in selecting the experiments. We show that significant improvements are possible with the aid of randomization -- in an adversarial (worst-case) setting, the designer can then recover the causal graph using at most O(log log n) experiments in expectation. This bound cannot be improved; we show it is tight for some causal graphs. This result can be viewed in the adversary versus designer game framework of Eberhardt; it shows that huge performance improvements are possible when moving from single-variable interventions to multi-variable interventions. We then show that in a non-adversarial (average-case) setting, even larger improvements are possible: if the causal graph is chosen uniformly at random under a Erdös-Rényi model then the expected number of experiments to discover the causal graph is constant. Finally, we present computer simulations to complement our theoretic results. Our work exploits a structural characterization of essential graphs by Andersson et al. Their characterization is based upon a set of orientation forcing operations. Our results show a distinction between which forcing operations are most important in worst-case and average-case settings.

Huining Hu, Zhentao Li, adrian vetta
Rates of Convergence for Nearest Neighbor Classification
Nearest neighbor methods are a popular class of nonparametric estimators with several desirable properties, such as adaptivity to different distance scales in different regions of space. Prior work on convergence rates for nearest neighbor classification has not fully reflected these subtle properties. We analyze the behavior of these estimators in metric spaces and provide finite-sample, distribution-dependent rates of convergence under minimal assumptions. As a by-product, we are able to establish the universal consistency of nearest neighbor in a broader range of data spaces than was previously known. We illustrate our upper and lower bounds by introducing smoothness classes that are customized for nearest neighbor classification.

Kamalika Chaudhuri, Sanjoy Dasgupta
Recovery of Coherent Data via Low-Rank Dictionary Pursuit
The recently established RPCA [4] method provides us a convenient way to restore low-rank matrices from grossly corrupted observations. While elegant in theory and powerful in reality, RPCA may be not an ultimate solution to the low-rank matrix recovery problem. Indeed, its performance may not be perfect even when data are strictly low-rank. This is because RPCA ignores the clustering structures of the data which are ubiquitous in modern applications. As the number of cluster grows, the coherence of data keeps increasing, and accordingly, the recovery performance of RPCA degrades. We show that the challenges raised by coherent data (i.e., the data with high coherence) could be alleviated by Low-Rank Representation (LRR) [15], provided that the dictionary in LRR is configured appropriately. More precisely, we mathematically prove that if the dictionary itself is low-rank then LRR is immune to the coherence parameter which increases with the underlying cluster number. This provides an elementary principle for dealing with coherent data and naturally leads to a practical algorithm for obtaining proper dictionaries in unsupervised environments. Our extensive experiments on randomly generated matrices and real motion sequences verify our claims.

Guangcan Liu, Ping Li
Recurrent Models of Visual Attention
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

Volodymyr Mnih, Nicolas Heess, Koray Kavukcuoglu
Robust Classification Under Sample Selection Bias
In many important machine learning applications, the source distribution used to estimate a probabilistic classifier differs from the target distribution on which the classifier will be used to make predictions. Due to its asymptotic properties, sample-reweighted loss minimization is a commonly employed technique to deal with this difference. However, given finite amounts of labeled source data, this technique suffers from significant estimation errors in settings with large sample selection bias. We develop a framework for robustly learning a probabilistic classifier that adapts to different sample selection biases using a minimax estimation formulation. Our approach requires only accurate estimates of statistics under the source distribution and is otherwise as robust as possible to unknown properties of the conditional label distribution, except when explicit generalization assumptions are incorporated. We demonstrate the behavior and effectiveness of our approach on synthetic and UCI binary classification tasks.

Anqi Liu, Brian Ziebart
Scalable Inference for Neuronal Connectivity from Calcium Imaging
Fluorescent calcium imaging provides a potentially powerful tool for inferring connectivity in neural circuits with up to thousands of neurons. However, a key challenge in using calcium imaging for connectivity detection is that current systems often have a temporal response and frame rate that can be orders of magnitude slower than the underlying neural spiking process. Bayesian inference based on expectation-maximization (EM) have been proposed to overcome these limitations, but they are often computationally demanding since the E-step in the EM procedure typically involves state estimation in a high-dimensional nonlinear dynamical system. In this work, we propose a computationally fast method for the state estimation based on a hybrid of loopy belief propagation and approximate message passing (AMP). The key insight is that a neural system as viewed through calcium imaging can be factorized into simple scalar dynamical systems for each neuron with linear interconnections between the neurons. Using the structure, the updates in the proposed hybrid AMP methodology can be computed by a set of one-dimensional state estimation procedures and linear transforms with the connectivity matrix. This yields a computationally scalable method for inferring connectivity of large neural circuits. Simulations of the method on realistic neural networks demonstrate good accuracy with computation times that are potentially significantly faster than current approaches based on Markov Chain Monte Carlo methods.

Alyson Fletcher, Sundeep Rangan
Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices
Numerous algorithms are used for nonnegative matrix factorization under the assumption that the matrix is nearly separable. In this paper, we show how to make these algorithms scalable for data matrices that have many more rows than columns, so-called "tall-and-skinny matrices." One key component to these improved methods is an orthogonal matrix transformation that preserves the separability of the NMF problem. Our final methods need to read the data matrix only once and are suitable for streaming, multi-core, and MapReduce architectures. We demonstrate the efficacy of these algorithms on terabyte-sized matrices from scientific computing and bioinformatics.

Austin Benson, Jason Lee, Bartek Rajwa, David Gleich
Searching for Higgs Boson Decay Modes with Deep Learning
Particle colliders enable us to probe the fundamental nature of matter by observing exotic particles produced by high-energy collisions. Because the experimental measurements from these collisions are necessarily incomplete and imprecise, machine learning algorithms play a major role in the analysis of experimental data. The high-energy physics community typically relies on standardized machine learning software packages for this analysis, and devotes substantial effort towards improving statistical power by hand crafting high-level features derived from the raw collider measurements. In this paper, we train artificial neural networks to detect the decay of the Higgs boson to tau leptons on a dataset of 82 million simulated collision events. We demonstrate that deep neural network architectures are particularly well-suited for this task with the ability to automatically discover high-level features from the data and increase discovery significance.

Peter Sadowski, Daniel Whiteson, Pierre Baldi
Semi-supervised Learning with Deep Generative Models
The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.

Diederik Kingma, Shakir Mohamed, Danilo Rezende, Max Welling
Sparse Space-Time Deconvolution for Calcium Image Analysis
We describe a unified formulation and algorithm to find an extremely sparse representation for Calcium image sequences in terms of cell locations, cell shapes, spike timings and impulse responses. Solution of a single optimization problem yields cell segmentations and activity estimates that are on par with the state of the art, without the need for heuristic pre- or postprocessing. Experiments on real and synthetic data demonstrate the viability of the proposed method.

Ferran Diego Andilla, Fred Hamprecht
Spatio-temporal Representations of Uncertainty in Spiking Neural Networks
It has been long argued that, because of inherent ambiguity and noise, the brain needs to represent uncertainty in the form of probability distributions. The neural encoding of such distributions remains however highly controversial. Here we present a novel circuit model for representing multidimensional real-valued distributions using a spike based spatio-temporal code. Our model combines the computational advantages of the currently competing models for probabilistic codes and exhibits realistic neural responses along a variety of classic measures. Furthermore, the model highlights the challenges associated with interpreting neural activity in relation to behavioral uncertainty and points to alternative population-level approaches for the experimental validation of distributed representations.

Cristina Savin, Sophie Deneve
Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing
The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods.

Yuchen Zhang, Xi Chen, Dengyong Zhou, Michael Jordan
The Noisy Power Method: A Meta Algorithm with Applications
We provide a new robust convergence analysis of the well-known power method for computing the dominant singular vectors of a matrix that we call noisy power method. Our result characterizes the convergence behavior of the algorithm when a large amount noise is introduced after each matrix-vector multiplication. The noisy power method can be seen as a meta-algorithm that has recently found a number of important applications in a broad range of machine learning problems including alternating minimization for matrix completion, streaming principal component analysis (PCA), and privacy-preserving spectral analysis. Our general analysis subsumes several existing ad-hoc convergence bounds and resolves a number of open problems in multiple applications. A recent work of Mitliagkas et al.~(NIPS 2013) gives a space-efficient algorithm for PCA in a streaming model where samples are drawn from a spiked covariance model. We give a simpler and more general analysis that applies to arbitrary distributions. Moreover, even in the spiked covariance model our result gives quantitative improvements in a natural parameter regime. As a second application, we provide an algorithm for differentially private principal component analysis that runs in nearly linear time in the input sparsity and achieves nearly tight worst-case error bounds. Complementing our worst-case bounds, we show that the error dependence of our algorithm on the matrix dimension can be replaced by an essentially tight dependence on the coherence of the matrix. This result resolves the main problem left open by Hardt and Roth (STOC 2013) and leads to strong average-case improvements over the optimal worst-case bound.

Moritz Hardt, Eric Price
The Residual Bootstrap for High-Dimensional Regression with Low-Rank Designs
We study the residual bootstrap (RB) method in the context of high-dimensional linear regression. Specifically, we analyze the distributional approximation of linear contrasts. When regression coefficients are estimated via least squares or other M-estimation procedures, classical results show that RB consistently approximates the laws of contrasts, provided that p/n < < 1, where n is the number of observations, and p is the dimension of the coefficient vector. Up to now, relatively little work has considered how additional structure in the linear model may extend the validity of the bootstrap to situations where p/n\asymp 1. In this regime, we focus on bootstrapping residuals obtained from ridge regression. Our main structural assumption on the design matrix is that it is nearly low rank --- in the sense that its singular values decay according to a power-law profile. Under a few extra technical assumptions, we derive a simple criterion for ensuring that RB consistently approximates the law of a given contrast. We then apply this result to study how well RB approximates laws of fitted values, conditionally on a Gaussian design matrix. When the design is generated in this way, we show that with high probability, RB successfully approximates the conditional laws of \emph{all} fitted values (simultaneously). This result is also notable insofar as it requires imposes no sparsity assumption the true regression coefficients. Finally, we note that our approach is based on the Mallows (Kantorovich) metric, which allows us to prove consistency even when a limiting distribution may not exist.

Miles Lopes
The limits of squared Euclidean distance regularization
Some of the simplest loss functions considered in Machine Learning are the square loss, the logistic loss and the hinge loss. The most common algorithms, including Gradient Descent (GD) and Weight Decay, find a linear weight vector by trading off the total loss over the example set with the squared Euclidean distance regularizer. We give a random construction for sets of examples where a linear weight vector with good generalization performance is trivial to learn but the common squared Euclidean distance regularization leads to drastically suboptimal algorithms. Our lower bound on the latter algorithms holds even if the algorithms are enhanced with an arbitrary kernel function. This type of result was known for the square loss. However, we develop new techniques that let us prove such hardness results for any loss function satisfying some minimal requirements on the loss function (including the three listed above). We also show that algorithms that regularize with the squared Euclidean distance are easily confused by random features. Finally, we conclude by discussing related open problems regarding feed forward neural networks. We conjecture that our hardness results hold for any training algorithm that is based on the squared Euclidean distance regularization (i.e. the Back Propagation algorithm).

Michal Derezinski, Manfred Warmuth
Trajectory Optimization under Unknown Dynamics for Policy Search
We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These trajectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our method uses iteratively refitted local linear models to speed up learning, but does not rely on learning a global model, which can be difficult when the dynamics are complex and discontinuous. We show that this hybrid approach requires many fewer samples than model-free methods, and can handle complex, nonsmooth dynamics that can pose a challenge for model-based techniques. We present experiments showing that our method can be used to learn complex neural network policies that successfully execute simulated robotic manipulation tasks in partially observed environments with numerous contact discontinuities and underactuation.

Sergey Levine, Pieter Abbeel
Transportability from Multiple Environments with Limited Experiments: Completeness Results
This paper addresses the problem of $mz$-transportability, that is, transferring causal knowledge collected in several heterogeneous domains to a target domain in which only passive observations and limited experimental data can be collected. The paper first establishes a necessary and sufficient condition for deciding the feasibility of $mz$-transportability, i.e., whether causal effects in the target domain are estimable from the information available. It further proves that a previously established algorithm for computing transport formula [1] is in fact complete, that is, failure of the algorithm implies non-existence of a transport formula. Finally, the paper shows that the do-calculus is complete for the $mz$-transportability class.

Elias Bareinboim, Judea Pearl
Tree-structured Gaussian Process Approximations
Gaussian process regression can be accelerated by constructing a small pseudo-dataset to summarise the observed data. This idea sits at the heart of many approximation schemes, but such an approach requires the number of pseudo-datapoints to be scaled with the range of the input space if the accuracy of the approximation is to be maintained. This presents problems in time-series settings or in spatial datasets where large numbers of pseudo-datapoints are required since computation typically scales quadratically with the pseudo-dataset size. In this paper we devise an approximation whose complexity grows linearly with the number of pseudo-datapoints. This is achieved by imposing a tree or chain structure on the pseudo-datapoints and calibrating the approximation using a Kullback-Leibler (KL) minimisation. Inference and learning can then be performed efficiently using the Gaussian belief propagation algorithm. We demonstrate the validity of our approach on a set of challenging regression tasks including missing data imputation for audio and spatial datasets. We trace out the speed-accuracy trade-off for the new method and show that the frontier dominates those obtained from a large number of existing approximation techniques.

Thang Bui, Richard Turner
Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets
Sparse-Group Lasso (SGL) has been shown to be a powerful regression technique for simultaneously discovering group and within-group sparse patterns by using a combination of the l1 and l2 norms. However, in large-scale applications, the complexity of the regularizers entails great computational challenges. In this paper, we propose a novel two-layer feature reduction method (TLFre) for SGL via a decomposition of its dual feasible set. The two-layer reduction is able to quickly identify the inactive groups and the inactive features, respectively, which are guaranteed to be absent from the sparse representation and can be removed from the optimization. Existing feature reduction methods are only applicable for sparse models with one sparsity-inducing regularizer. To our best knowledge, TLFre is the first one that is capable of dealing with multiple sparsity-inducing regularizers. Moreover, TLFre has a very low computational cost and can be integrated with any existing solvers. Experiments on both synthetic and real data sets show that TLFre improves the efficiency of SGL by orders of magnitude.

Jie Wang, Jieping Ye
Two-Stream Convolutional Networks for Action Recognition in Videos
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

Karen Simonyan, Andrew Zisserman
Unsupervised Transcription of Piano Music
We present a novel probabilistic model for transcribing piano music from audio to a symbolic form. Our model follows the process by which discrete musical events give rise to acoustic signals that are then superimposed to produce the observed data. As a result, the inference procedure for our model naturally resolves the source separation problem introduced by the the piano's polyphony. In order to adapt to the acoustic properties of any specific piano being transcribed, we learn instrument-specific spectral profiles and temporal envelopes in an unsupervised fashion. Our system outperforms the best published approaches to this task, achieving a 10.6% relative gain in note onset F1 on a standard set of piano audio.

Taylor Berg-Kirkpatrick, Jacob Andreas, Dan Klein
Unsupervised learning of an efficient short-term memory network
Learning in recurrent neural networks has been a topic fraught with difficulties and problems. We here report substantial progress in the unsupervised learning of recurrent networks that can keep track of an input signal. Specifically, we show how these networks can learn to efficiently represent their present and past inputs, based on local learning rules only. Our results are based on several key insights. First, we develop a local learning rule for the recurrent weights whose main aim is to drive the network into a regime where, on average, feedforward signal inputs are canceled by recurrent inputs. We show that this learning rule minimizes a cost function. Second, we develop a local learning rule for the feedforward weights that, based on networks in which recurrent inputs already predict feedforward inputs, further minimizes the cost. Third, we show how the learning rules can be modified such that the network can directly encode non-whitened inputs. Fourth, we show that these learning rules can also be applied to a network that feeds a time-delayed version of the network output back into itself. As a consequence, the network starts to efficiently represent both its signal inputs and their history. We develop our main theory for linear networks, but then show that the learning rules also hold for nonlinear networks in which firing rates are constrained to be positive. Finally, we sketch how the learning rules can be transfered to balanced, spiking networks.

Pietro Vertechi, Wieland Brendel, Christian Machens

Posters

A (Tag) Structure Regularization Framework for Structured Prediction
In structured prediction applications, many studies emphasize on intensifying structural dependencies. However, this tendency could be misleading, because our study suggests that structure complexity is actually harmful to generalization ability in structured prediction. To control overfitting from structures, we propose a structure regularization framework via \emph{tag structure decomposition}, which decomposes training samples into mini-samples with simpler structures, deriving a model with better generalization power. We show both theoretically and empirically that (tag) structure regularization can effectively control overfitting from structures and lead to better accuracy. Interestingly, as a by-product, the proposed method can also substantially accelerate the training speed. We conduct experiments on several well-known tasks with diversified natures. Results demonstrate that our method is robust and can easily beat state-of-the-art systems on those highly-competitive tasks, achieving record-breaking accuracies yet with substantially faster training speed.

Xu Sun
A Block-Coordinate Descent Approach for Large-scale Sparse Inverse Covariance Estimation
The Sparse Inverse Covariance Estimation problem arises in many statistical applications in Machine Learning and Signal Processing. In this problem, the inverse of a covariance matrix of a multivariate normal distribution is estimated, assuming that it is sparse. An $\ell_1$ regularized log-determinant optimization problem is typically solved to approximate such matrices. Because of memory limitations, most existing algorithms are unable to handle large scale instances of this problem. In this paper we present a new block-coordinate descent approach for solving the problem for large-scale data sets. Our method treats the sought matrix block-by-block using quadratic approximations (Newton's method), and we show that this approach has advantages over existing methods in several aspects. Numerical experiments on both synthetic and real gene expression data demonstrate the efficiency of this approach, especially for large-scale problems.

Eran Treister, Javier Turek
A Boosting Framework on Grounds of Online Learning
By exploiting the duality between boosting and online learning, we present a boosting framework which proven to be extremely powerful thanks to employing the vast knowledge available in online learning area. Using this framework, we develop various algorithms to address multiple practically and theoretically interesting questions including sparse boosting, smooth-distribution boosting, agnostic learning and some generalization to double-projection online learning algorithms, as a side-product.

Tofigh Naghibi, Beat Pfister
A Complete Variational Tracker
We introduce a novel probabilistic tracking algorithm that incorporates combinatorial data association constraints and model-based track management using variational Bayes. We use a Bethe entropy approximation to incorporate data association constraints that are often ignored in previous probabilistic tracking algorithms. Noteworthy aspects of our method include a model-based mechanism to replace heuristic logic typically used to initiate and destroy tracks, and an assignment posterior with linear computation cost in window length as opposed to the exponential scaling of previous MAP-based approaches. We demonstrate the applicability of our method on radar tracking and computer vision problems.

Ryan Turner, Steven Bottone, Bhargav Avasarala
A Drifting-Games Analysis for Online Learning and Applications to Boosting
We provide a general mechanism to design online learning algorithms based on a minimax analysis within a drifting-games framework. Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games. The original minimax analysis for drifting games is then used and generalized by applying a series of relaxations, starting from choosing a convex surrogate of the 0-1 loss function. With different choices of surrogates, we not only recover existing algorithms, but also propose new algorithms that are totally parameter-free and enjoy other useful properties. Moreover, our drifting-games framework naturally allows us to study high probability bounds without resorting to any concentration results, and also a generalized notion of regret that measures how good the algorithm is compared to all but the top small fraction of candidates. Finally, we translate our new Hedge algorithm into a new adaptive boosting algorithm that is computationally faster as shown in experiments, since it ignores a large number of examples on each round.

Haipeng Luo, Robert Schapire
A Dual Algorithm for Olfactory Computation in the Locust Brain
We study the early locust olfactory system in an attempt to explain its wellcharacterized structure and dynamics. We first propose its computational function as recovery of high-dimensional sparse olfactory signals from a small number of measurements. Detailed experimental knowledge about this system rules out standard solutions to this problem. Instead, we show that solving a dual formulation of this problem yields structure and dynamics in good agreement with biological fact. Further biological constraints lead us to a reduced form of this dual formulation in which the system uses independent component analysis to continuously adapt to its olfactory environment to allow accurate sparse recovery. Our work demonstrates the challenges and rewards of attempting detailed understanding of experimentally well-characterized systems.

Sina Tootoonian, Mate Lengyel
A Filtering Approach to Stochastic Variational Inference
Stochastic variational inference (SVI) uses stochastic optimization to scale up Bayesian computation to massive data. We present an alternative perspective on SVI as approximate parallel coordinate ascent. SVI trades-off bias and variance to step close to the unknown true coordinate optimum given by batch variational Bayes (VB). We define a model to automate this process. The model infers the location of the next VB optimum from a sequence of noisy realizations. As a consequence of this construction we update the variational parameters using Bayes rule, rather than a hand-crafted optimization schedule. When our model is a Kalman filter this procedure can recover the original SVI algorithm and SVI with adaptive steps. We may also encode additional assumptions in the model, such as heavy-tailed noise. By doing so, our algorithm outperforms the original SVI schedule and a state-of-the-art adaptive SVI algorithm in two diverse domains.

Neil Houlsby, David Blei
A Framework for Testing Identifiability of Bayesian Models of Perception
Bayesian observer models are very effective in describing human performance in perceptual tasks, so much so that they are trusted to faithfully recover hidden mental representations of priors, likelihoods, or loss functions from the data. However, the intrinsic degeneracy of the Bayesian framework, as multiple combinations of elements can yield empirically indistinguishable results, prompts the question of model identifiability. We propose a novel framework for a systematic testing of the identifiability of a significant class of Bayesian observer models, with practical applications for improving experiment design. We examine the theoretical identifiability of the inferred internal representations in two case studies. First, we show which experimental designs work better to remove the underlying degeneracy in a time interval estimation task. Second, we find that the reconstructed representations in a speed perception task under a slow-speed prior are fairly robust.

Luigi Acerbi, Wei Ji Ma, Sethu Vijayakumar
A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain prediction by a multi-world approach that represents uncertainty about the perceived world in a bayesian framework. Our approach can handle human questions of high complexity about realistic scenes and replies with range of answer like counts, true/false, object classes, instances and lists of them. The system is directly trained from question-answer pairs. We establish a first benchmark for this task that can be seen as a modern attempt at a visual turing test.

Mateusz Malinowski, Mario Fritz
A Multiplicative Model for Learning Distributed Text-Based Attribute Representations
In this paper we propose a general framework for learning distributed representations of attributes: characteristics of text whose representations can be jointly learned with word embeddings. Attributes can correspond to document indicators (to learn sentence vectors), language indicators (to learn distributed language representations), meta-data and side information (such as the age, gender and industry of a blogger) or representations of authors. We describe a third-order model where word context and attribute vectors interact multiplicatively to predict the next word in a sequence. This leads to the notion of conditional word similarity: how meanings of words change when conditioned on different attributes. We perform several experimental tasks including sentiment classification, cross-lingual document classification, and blog authorship attribution. We also qualitatively evaluate conditional word neighbours and attribute-conditioned text generation.

Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov
A Probabilistic Framework for Multimodal Retrieval using Integrative Indian Buffet Process
We propose a multimodal retrieval procedure based on latent feature models. The procedure consists of a nonparametric Bayesian framework for learning underlying semantically meaningful abstract features in a multimodal dataset, a probabilistic retrieval model that allows cross-modal queries and an extension model for relevance feedback. Experiments on two multimodal datasets, PASCAL-Sentence and SUN-Attribute, demonstrate the effectiveness of the proposed retrieval procedure in comparison to the state-of-the-art algorithms for learning binary codes.

Bahadir Ozdemir, Larry Davis
A Representation Theory for Ranking Functions
This paper presents a representation theory for permutation-valued functions, which in their general form can also be called listwise ranking functions. Pointwise ranking functions assign a score to each object independently, without taking into account the other objects under consideration; whereas listwise loss functions evaluate the set of scores assigned to all objects as a whole. In many supervised learning to rank tasks, it might be of interest to use listwise ranking functions instead; in particular, the Bayes Optimal ranking functions might themselves be listwise, especially if the loss function is listwise. A key caveat to using listwise ranking functions has been the lack of an appropriate representation theory for such functions. We show that a natural symmetricity assumption that we call exchangeability allows us to explicitly characterize the set of such exchangeable listwise ranking functions. Our analysis draws from the theories of tensor analysis, functional analysis and De Finetti theorems. We also present experiments using a novel reranking method motivated by our representation theory.

Harsh Pareek, Pradeep Ravikumar
A Safe Screening Rule for Sparse Logistic Regression
The l1-regularized logistic regression (or sparse logistic regression) is a widely used method for simultaneous classification and feature selection. Although many recent efforts have been devoted to its efficient implementation, its application to high dimensional data still poses significant challenges. In this paper, we present a fast and effective sparse logistic regression screening rule (Slores) to identify the zero components in the solution vector, which may lead to a substantial reduction in the number of features to be entered to the optimization. An appealing feature of Slores is that the data set needs to be scanned only once to run the screening and its computational cost is negligible compared to that of solving the sparse logistic regression problem. Moreover, Slores is independent of solvers for sparse logistic regression, thus Slores can be integrated with any existing solver to improve the efficiency. We have evaluated Slores using high-dimensional data sets from different applications. Extensive experimental results demonstrate that Slores outperforms the existing state-of-the-art screening rules and the efficiency of solving sparse logistic regression is improved by one magnitude in general.

Jie Wang, Jiayu Zhou, Jun Liu, Jieping Ye
A State-Space Model for Decoding Auditory Attentional Modulation from MEG in a Competing-Speaker Environment
Humans are able to segregate auditory objects in a complex acoustic scene, through an interplay of bottom-up feature extraction and top-down selective attention in the brain. The detailed mechanism underlying this process is largely unknown and the ability to mimic this procedure is an important problem in artificial intelligence and computational neuroscience. We consider the problem of decoding the attentional state of a listener in a competing-speaker environment from magnetoencephalographic (MEG) recordings from the human brain. We develop a behaviorally inspired state-space model to account for the modulation of the MEG with respect to attentional state of the listener. We construct a decoder based on the maximum a posterori (MAP) estimate of the state parameters via the Expectation-Maximization (EM) algorithm. The resulting decoder is able to track the attentional modulation of the listener with multi-second resolution using only the envelopes of the two speech streams as covariates. We present simulation studies as well as application to real MEG data from two human subjects. Our results reveal that the proposed decoder provides substantial gains in terms of temporal resolution, complexity, and decoding accuracy.

Sahar Akram, jonathan Simon, Shihab Shamma, Behtash Babadi
A Synaptical Story of Persistent Activity with Graded Lifetime in a Neural System
Persistent activity refers to the phenomenon that cortical neurons keep firing even after the stimulus triggering the initial neuronal responses is moved. Persistent activity is widely believed to be the substrate for a neural system retaining a memory trace of the stimulus information. In a conventional view, persistent activity is regarded as an attractor of the network dynamics, but it faces a challenge of how to be closed properly. Here, in contrast to the view of attractor, we consider that the stimulus information is encoded in a marginally unstable state of the network which decays very slowly and exhibits persistent firing for a prolonged duration. We propose a simple yet effective mechanism to achieve this goal, which utilizes the property of short-term plasticity (STP) of neuronal synapses. STP has two forms, short-term depression (STD) and short-term facilitation (STF), which have opposite effects on retaining neuronal responses. We find that by properly combining STF and STD, a neural system can hold persistent activity of graded lifetime, and that persistent activity fades away naturally without relying on an external drive. The implications of these results on neural information representation are discussed.

Yuanyuan Mi, Luozheng Li, Dahui Wang, Si Wu
A Unified Semantic Embedding with Discriminative / Generative Tradeoff
We propose a method that learns a discriminative yet semantic space for object categorization, where we also embed auxiliary semantic entities such as supercategories and attributes. Contrary to prior work which only utilized them as side information, we explicitly embed the semantic entities into the same space where we embed categories, which enables us to represent a category as their linear combination. By exploiting such a unified model for semantics, we enforce each category to be generated as a sparse combination of a supercategory + attributes, with an additional exclusive regularization to learn discriminative composition. The proposed reconstructive regularization guides the discriminative learning process to learn a better generalizing model, as well as generates compact semantic description of each category, which enables humans to analyze what has been learned. We validate our method on the Animals with Attributes dataset for categorization performance and qualitative analysis, which shows that our method is able to improve classification performance while learning discriminative semantic decomposition of each category.

Sung Ju Hwang, Leonid Sigal
A framework for studying synaptic plasticity with neural spike train data
Learning and memory in the brain are implemented by complex, time-varying changes in neural circuitry. The computational rules according to which synaptic weights change over time are the subject of much research, and are not precisely understood. Until recently, limitations in experimental methods have made it challenging to test hypotheses about synaptic plasticity on a large scale. However, as such data become available and these barriers are lifted, it becomes necessary to develop analysis techniques to validate plasticity models. Here, we present a highly extensible framework for modeling arbitrary synaptic plasticity rules on spike train data in populations of interconnected neurons. We treat synaptic weights as a (potentially nonlinear) dynamical system embedded in a fully-Bayesian generalized linear model (GLM). In addition, we provide an algorithm for inferring synaptic weight trajectories alongside the parameters of the GLM and of the learning rules. Using this method, we perform model comparison of two proposed variants of the well-known spike-timing-dependent plasticity (STDP) rule, where nonlinear effects play a substantial role. On synthetic data generated from the biophysical simulator NEURON, we show that we can recover the weight trajectories, the pattern of connectivity, and the underlying learning rules.

Scott Linderman, Chris Stock, Ryan Adams
A provable SVD-based algorithm for learning topics in dominant admixture corpus
Topic models, such as Latent Dirichlet Allocation(LDA), posit that documents are drawn from ad-mixtures of distributions over words, known as topics. The inference problem of recovering topics from a such a collection of documents drawn from admixtures, is NP-hard. Assuming separability, a strong assumption, \cite{AGM} gave the first provable algorithm for inference. Provable algorithms for inference under more realistic assumptions remain an open problem. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific \emph{Catchwords}, a group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption (as we show empirically for real corpora), a Singular value decomposition(SVD) based algorithm can provably recover the topics from a collection of documents drawn from \emph{Dominant admixtures}. Dominant admixtures are convex combination of distributions in which one distribution has a significantly high contribution than the other distributions. It is folklore that SVD can solve inference only if we have documents with pure topics. We overcome this by doing a thresholding procedure first before SVD and a k-means procedure after SVD. Using Random matrix theory and recent results on $k-$means, we show that our algorithm correctly identifies the dominant topic in every document. Apart from the simplicity of the algorithm, the sample complexity (the number of documents needed) has near optimal dependence on $w_0$, the lowest probability that a topic is dominant and is better than \cite{AGM}. Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art \cite{arora}.

Trapit Bansal, Chiranjib Bhattacharyya, Ravindran Kannan
A statistical model for tensor PCA
We consider the Principal Component Analysis problem for large tensors of arbitrary order k under a single-spike (or rank-one plus noise) model. On the one hand, we use information theory, and recent results in probability theory to establish necessary and sufficient conditions under which the principal component can be estimated using unbounded computational resources. It turns out that this is possible as soon as the signal-to-noise ratio beta becomes larger than C\sqrt{k log k} (and in particular beta can remain bounded has the problem dimensions increase). On the other hand, we analyze several polynomial-time estimation algorithms, based on tensor unfolding, power iteration and message passing ideas from graphical models. We show that, unless the signal-to-noise ratio diverges in the system dimensions, none of these approaches succeeds. This is possibly related to a fundamental limitation of computationally tractable estimators for this problem. For moderate dimensions, we propose an hybrid approach that uses unfolding together with power iteration, and show that it outperforms significantly baseline methods. Finally, we consider the case in which additional side information is available about the unknown signal. We characterize the amount of side information that allow the iterative algorithms to converge to a good estimate.

Emile Richard, Andrea Montanari
Accelerated Mini-batch Randomized Block Coordinate Descent Method
We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained. However, such a ``batch" setting may be computationally expensive in practice. In this paper, we propose a mini-batch randomized block coordinate descent (MRBCD) method, which estimates the partial gradient of the selected block based on a mini-batch of randomly sampled data in each iteration. We further accelerate the MRBCD method by exploiting the semi-stochastic optimization scheme, which effectively reduces the variance of the partial gradient estimators. Theoretically, we show that for strongly convex functions, the MRBCD method attains lower overall iteration complexity than existing RBCD methods. As an application, we further trim the MRBCD method to solve the regularized sparse learning problems. Our numerical experiments shows that the MRBCD method naturally exploits the sparsity structure and achieves better computational performance than existing methods.

Tuo Zhao, Mo Yu, Yiming Wang, Raman Arora, Han Liu
Active Learning and Best-Response Dynamics
We consider a setting in which low-power distributed sensors are each making highly noisy measurements of some unknown target function. A center wants to accurately learn this function by querying a small number of sensors, which ordinarily would be impossible due to the high noise rate. The question we address is whether local communication among sensors, together with natural best-response dynamics in an appropriately-deﬁned game, can denoise the system without destroying the true signal and allow the center to succeed from only a small number of active queries. We prove positive (and negative) results on the denoising power of several natural dynamics, and also show experimentally that when combined with recent agnostic active learning algorithms, this process can achieve low error from very few queries, performing substantially better than active or passive learning without these denoising dynamics as well as passive learning with denosing.

Maria-Florina Balcan, Christopher Berlind, Avrim Blum, Emma Cohen, Kaushik Patnaik, Le Song
Active Regression by Stratification
We propose a new active learning algorithm for parametric linear regression with random design. We provide finite sample convergence guarantees for general distributions in the misspecified model. This is the first active learner for this setting that provably can improve over passive learning. Unlike other learning settings (such as classification), in regression the passive learning rate of $O(1/\epsilon)$ cannot in general be improved upon. Nonetheless, the so-called `constant' in the rate of convergence, which is characterized by a distribution-dependent \emph{risk}, can be improved in many cases. For a given distribution, achieving the optimal risk requires prior knowledge of the distribution. Following the stratification technique advocated in Monte-Carlo function integration, our active learner approaches a the optimal risk using piecewise constant approximations.

Sivan Sabato, Remi Munos
Algorithm selection by rational metareasoning as a model of human strategy selection
Selecting the right algorithm is an important problem in computer science, because the algorithm often has to exploit the structure of the input to be efficient. The human mind faces the same challenge. Therefore, solutions to the algorithm selection problem can inspire models of human strategy selection and vice versa. Here, we view the algorithm selection problem as a special case of metareasoning and derive a solution that outperforms existing methods in sorting algorithm selection. We apply our theory to model how people choose between cognitive strategies and test its prediction in a behavioral experiment. We find that people quickly learn to adaptively choose between cognitive strategies. People's choices in our experiment are consistent with our model but inconsistent with previous theories of human strategy selection. Rational metareasoning appears to be a promising framework for reverse-engineering how people select between cognitive strategies and translating the results into better solutions to the algorithm selection problem.

Falk Lieder, Dillon Plunkett, Jessica Hamrick, Stuart Russell, Nicholas Hay, Thomas Griffiths
Algorithms for CVaR Optimization in MDPs
In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in costs in addition to minimizing a standard criterion. Conditional value-at-risk (CVaR) is a relatively new risk measure that addresses some of the shortcomings of the well-known variance-related risk measures, and because of its computational efficiencies has gained popularity in finance and operations research. In this paper, we consider the mean-CVaR optimization problem in MDPs. We first derive a formula for computing the gradient of this risk-sensitive objective function. We then devise policy gradient and actor-critic algorithms that each uses a specific method to estimate this gradient and updates the policy parameters in the descent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem.

Yinlam Chow, Mohammad Ghavamzadeh
Altitude Training: Strong Bounds for Single-Layer Dropout
Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves this gain much like a marathon runner who practices at altitude: once a classifier learns to perform reasonably well on training examples that have been artificially corrupted by dropout, it will do very well on the uncorrupted test set. We also show that, under similar conditions, dropout preserves the Bayes decision boundary and should therefore induce minimal bias in high dimensions.

Stefan Wager, William Fithian, Sida Wang, Percy Liang
An Accelerated Proximal Coordinate Gradient Method
We develop an accelerated randomized proximal coordinate gradient (APCG) method, for solving a broad class of composite convex optimization problems. In particular, our method achieves faster linear convergence rates for minimizing strongly convex functions than existing randomized proximal coordinate gradient methods. We show how to apply the APCG method to solve the dual of the regularized empirical risk minimization (ERM) problem, and devise efficient implementations that can avoid full-dimensional vector operations. For ill-conditioned ERM problems, our method obtains improved convergence rates than the state-of-the-art stochastic dual coordinate ascent (SDCA) method.

Lin Xiao, Qihang Lin, Zhaosong Lu
An Autoencoder Approach to Learning Bilingual Word Representations
Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.

Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas Raykar, Amrita Saha
An Integer Polynomial Programming Based Framework for Lifted MAP Inference
In this paper, we present a new approach for lifted MAP inference in Markov logic networks (MLNs). The key idea in our approach is to compactly encode the MAP inference problem as an Integer Polynomial Program (IPP) by schematically applying three lifted inference steps to the MLN: lifted decomposition, lifted conditioning, and partial grounding. Our IPP encoding is lifted in the sense that an integer assignment to a variable in the IPP may represent a truth-assignment to multiple indistinguishable ground atoms in the MLN. We show how to solve the IPP by first converting it to an Integer Linear Program (ILP) and then solving the latter using state-of-the-art ILP techniques. Experiments on several benchmark MLNs show that our new algorithm is substantially superior to ground inference and existing methods in terms of computational efficiency and solution quality.

Somdeb Sarkhel, Deepak Venugopal, Parag Singla, Vibhav Gogate
Analysis of Brain States from Multi-Region LFP Time-Series
The local field potential (LFP) is a source of information about the broad patterns of brain activity, and the frequencies present in these time-series measurements are often highly correlated between regions. It is believed that these regions may jointly constitute a ``brain state,'' relating to cognition and behavior. An infinite hidden Markov model (iHMM) is proposed to model the evolution of brain states, based on electrophysiological LFP data measured at multiple brain regions. A brain state influences the spectral content of each region in the measured LFP. A new state-dependent tensor factorization is employed across brain regions, and the spectral properties of the LFPs are characterized in terms of Gaussian processes (GPs). The LFPs are modeled as a mixture of GPs, with state- and region-dependent mixture weights, and with the spectral content of the data encoded in GP spectral mixture covariance kernels. The model is able to infer an estimate of the number of brain states and the number of mixture components in the mixture of GPs. A new variational Bayesian split-merge algorithm is employed for inference. The model infers state changes as a function of external covariates in two novel electrophysiological datasets, using LFP data recorded simultaneously from multiple brain regions in mice; the results are validated and interpreted by subject-matter experts.

Kyle Ulrich, David Carlson, Wenzhao LIAN, Jana Borg, Kafui Dzirasa, Lawrence Carin
Analysis of Learning from Positive and Unlabeled Data
Learning a classifier from positive and unlabeled data is an important class of classification problems that are conceivable in many practical applications. In this paper, we first show that this problem can be solved by cost-sensitive learning between positive and unlabeled data. Then we reveal that convex surrogate loss functions such as the hinge loss lead to a wrong classification boundary due to an intrinsic bias, and show that the use of non-convex loss functions such as the ramp loss is essential to avoid this problem. We next analyze the excess risk when the class prior is estimated from data, and show that the classification accuracy is not sensitive to class prior estimation if the unlabeled data is dominated by the positive data (this is naturally satisfied in inlier-based outlier detection because inliers are dominant in the unlabeled dataset). Finally, we provide generalization error bounds and show that the generalization error of learning only from positive and unlabeled samples is no worse than $2\sqrt{2}$ times the fully supervised case. These theoretical findings are also investigated through experiments.

Marthinus Du Plessis, Gang Niu, Masashi Sugiyama
Analysis of Variational Bayesian Latent Dirichlet Allocation: Weaker Sparsity Than MAP
Latent Dirichlet allocation (LDA) is a popular generative model of various objects such as texts and images, where an object is expressed as a mixture of latent topics. In this paper, we theoretically investigate variational Bayesian (VB) learning in LDA. More specifically, we analytically derive the leading term of the VB free energy under an asymptotic setup, and show that there exist transition thresholds in Dirichlet hyperparameters around which the sparsity-inducing behavior drastically changes. Then we further theoretically reveal the notable phenomenon that VB tends to induce weaker sparsity than MAP in the LDA model, which is opposed to other models. We experimentally demonstrate the practical validity of our asymptotic theory on real-world Last.FM music data.

Shinichi Nakajima, Issei Sato, Masashi Sugiyama, Kazuho Watanabe, Hiroko Kobayashi
Approximating Hierarchical MV-sets for Hierarchical Clustering
The goal of hierarchical clustering is to construct a cluster tree, which can be viewed as the modal structure of a density. For this purpose, we use a convex optimization program that can efficiently estimate a family of hierarchical dense sets in high-dimensional distributions. We further extend existing graph-based methods to approximate the cluster tree of a distribution. We present empirical results that demonstrate the superiority of our method over existing ones.

Assaf Glazer, Omer Weissbrod, Michael Lindenbaum, Shaul Markovitch
Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations
We present a method for estimating articulated human pose from a single static image based on a graphical model with novel pairwise relations that make adaptive use of local image measurements. More precisely, we specify a graphical model for human pose which exploits the fact the local image measurements can be used both to detect parts (or joints) and also to predict the spatial relationships between them (Image Dependent Pairwise Relations). These spatial relationships are represented by a mixture model. We use Deep Convolutional Neural Networks (DCNNs) to learn conditional probabilities for the presence of parts and their spatial relationships within image patches. Hence our model combines the representational flexibility of graphical models with the efficiency and statistical power of DCNNs. Our method significantly outperforms the state of the art methods on the LSP and FLIC datasets and also performs very well on the Buffy dataset without any training.

Xianjie Chen, Alan Yuille
Attentional Neural Network: Feature Selection Using Cognitive Feedback
Attentional Neural Network is a new framework that integrates top-down cognitive bias and bottom-up feature extraction in one coherent architecture. The top-down influence is especially effective when dealing with high noise or difficult segmentation problems. Our system is modular making it easy to extend. It is easy to train and cheap to run, and yet can accommodate complex behavior as required. We obtain classification accuracy better than or competitive with state of art results on the MNIST variation dataset, and successfully disentangle overlaid digits with high success rates. We view such a general purpose framework as an essential foundation for a larger system emulating the cognitive abilities of the whole brain.

Qian Wang, Jiaxing Zhang, Sen Song, Zheng Zhang
Augmentative Message Passing for Traveling Salesman Problem and Graph Partitioning
The cutting plane method is an augmentative constrained optimization procedure that is often used with continuous-domain optimization techniques such as linear and convex programs. We investigate the viability of a similar idea within message passing -- which produces integral solutions -- in the context of two combinatorial problems: 1) For Traveling Salesman Problem (TSP), we propose a factor-graph based on Held-Karp formulation, with an exponential number of constraint factors, each of which has an exponential but sparse tabular form. 2) For graph-partitioning (a.k.a. community mining) using modularity optimization, we introduce a binary variable model with a large number of constraints that enforce formation of cliques. In both cases we are able to derive surprisingly simple message updates that lead to competitive solutions on benchmark instances. In particular for TSP we are able to find near-optimal solutions in the time that empirically grows with $N^3$, demonstrating that augmentation is practical and efficient.

Siamak Ravanbakhsh, Reihaneh Rabbany, Russ Greiner
Automated Variational Inference for Gaussian Process Models
We develop an automated variational method for approximate deterministic inference in Gaussian process (GP) models whose posteriors are often intractable. Using a mixture of Gaussians as the variational distribution, we show that (i) the variational objective and its gradients can be approximated efficiently via sampling from univariate Gaussian distributions and (ii) the gradients of the GP hyperparameters can be obtained analytically regardless of the model likelihood. We further propose two instances of the variational distribution whose covariance matrices can be parametrized linearly in the number of observations. These results allow gradient-based optimization to be done efficiently in a black-box manner. Our approach is thoroughly verified on 5 models using 6 benchmark datasets, performing as well as the exact or hard-coded implementations while running orders of magnitude faster than the alternative MCMC sampling approaches. Our method can be a valuable tool for practitioners and researchers to investigate new models with minimal effort in deriving model-specific inference algorithms.

Trung Nguyen, Edwin Bonilla
Automatic Discovery of Cognitive Skills to Improve the Prediction of Student Learning
To master a discipline such as algebra or physics, students must acquire a set of cognitive skills. Traditionally, educators and domain experts manually determine what these skills are and then select practice exercises to hone a particular skill. We propose a technique that uses student performance data to automatically discover the skills needed in a discipline. The technique assigns a latent skill to each exercise such that a student's expected accuracy on a sequence of same-skill exercises improves monotonically with practice. Rather than discarding the skills identified by experts, our technique incorporates a nonparametric prior over the exercise-skill assignments that is based on the expert-provided skills and a weighted Chinese restaurant process. We test our technique on datasets from five different intelligent tutoring systems designed for students ranging in age from middle school through college. We obtain two surprising results. First, in three of the five datasets, the skills inferred by our technique support significantly improved predictions of student performance over the expert-provided skills. Second, the expert-provided skills have little value: our technique predicts student performance nearly as well when it ignores the domain expertise as when it attempts to leverage it. We discuss explanations for these surprising results and also the relationship of our skill-discovery technique to alternative approaches.

Robert Lindsey, Mohammad Khajah, Michael Mozer
Bandit Convex Optimization: Towards Tight Bounds
Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efficient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time.

Elad Hazan, Kfir Levy
Bayes-Adaptive Simulation-based Search with Value Function Approximation
Bayes-adaptive planning offers a principled solution to the exploration-exploitation trade-off under model uncertainty. It finds the optimal policy in belief space, which explicitly accounts for the expected effect on future rewards of reductions in uncertainty. However, the Bayes-adaptive solution is typically intractable in domains with large or continuous state spaces. We present the first tractable method for approximating the Bayes-adaptive solution by combining simulation-based search with a novel value function approximation technique that generalises over belief space. Our method outperforms prior approaches in both discrete bandit tasks and simple continuous navigation and control tasks.

Arthur Guez, Nicolas Heess, David Silver, Peter Dayan
Bayesian Inference for Structured Spike and Slab Priors
Sparse signal recovery addresses the problem of solving underdetermined linear inverse problems subject to a sparsity constraint. We propose a novel prior formulation, the structured spike and slab prior, which allows to incorporate a priori knowledge of the sparsity pattern by imposing a spatial Gaussian process on the spike and slab probabilities. Thus, prior information on the structure of the sparsity pattern can be encoded using generic covariance functions. Furthermore, we provide a Bayesian inference scheme for the proposed model based on the expectation propagation framework. Using numerical experiments on synthetic data, we demonstrate the benefits of the model.

Michael Andersen, Ole Winther, Lars Hansen
Bayesian Nonlinear Support Vector Machines and Discriminative Factor Modeling
A new Bayesian formulation is developed for nonlinear support vector machines (SVMs), based on a Gaussian process and with the SVM hinge loss expressed as a scaled mixture of normals. We then integrate the Bayesian SVM into a factor model, in which feature learning and nonlinear classifier design are performed jointly; almost all previous work on such discriminative feature learning has assumed a linear classifier. Inference is performed with expectation conditional maximization (ECM) and Markov Chain Monte Carlo (MCMC). An extensive set of experiments demonstrate the utility of using a nonlinear Bayesian SVM within discriminative feature learning and factor modeling, from the standpoints of accuracy and interpretability.

Ricardo Henao, Xin Yuan, Lawrence Carin
Bayesian Sampling Using Stochastic Gradient Thermostats
Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables in order to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory.

Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert Skeel, Hartmut Neven
Belief Propagation Recursive Neural Networks
Recursive Neural Networks have recently obtained very good performance on several natural language processing tasks. However, because of their feedforward architecture they cannot correctly predict phrase labels that are determined by context. This is a problem in tasks such as aspect-specific sentiment classification which tries to, for instance, predict that the word ``Android'' is positive in the sentence ``Android beats iOs''. We introduce belief propagation recursive neural networks (BP-RNNs) which are based on the idea of extending purely feedforward neural networks to include one feedbackward step during inference. This allows phrase level predictions and representations to give feedback to words. We show the effectiveness of this model on the task of contextual sentiment analysis. We also show that dropout can improve RNN training and that a concatenation of unsupervised and supervised word representations performs better than either alone. The BP-RNN has 3\% higher F1 than the standard RNN on this task and obtains state-of-the-art performance on the SemEval 2013, Task 2 sentiment challenge.

Romain Paulus, Richard Socher, Chris Manning
Best-Arm Identification in Linear Bandits
We study the best-arm identification problem in linear bandit, where the reward of the arms depends linearly on an unknown parameter $\theta^*$ and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies which pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the $G$-optimality criterion used in optimal experimental design.

Marta Soare, Alessandro Lazaric, Remi Munos
Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling
The beta-negative binomial process (BNBP), an integer-valued stochastic process, is employed to construct an exchangeable partition probability function (EPPF) for mixed-membership modeling. As the marginal probability distribution of the BNBP that governs the exchangeable random partitions of grouped data has not yet been developed, current inference for the BNBP has to truncate the number of atoms of the beta process. This paper introduces an EPPF to explicitly describe how the BNBP clusters the data points of each group into a random number of exchangeable partitions, which are shared across all the groups. A fully collapsed Gibbs sampler is developed for the BNBP, leading to a novel nonparametric Bayesian topic model that is distinct from existing ones, with simple implementation, fast convergence, good mixing, and state-of-the-art predictive performance.

Mingyuan Zhou
Beyond the Birkhoff Polytope: Convex Relaxations for Vector Permutation Problems
The use of the Birkhoff polytope (the convex hull of the set of permutation matricess) is standard in optimization problems over permutations. The Birkhoff polytope requires $\Theta(n^2)$ variables and constraints to represent, significantly more than the $n$ variables one could use to represent a permutation as a vector. Using a recent construction by Goemans, we show that when optimizing over the convex hull of the permutation vectors (the permutahedron), we can reduce the number of variables and constraints to $\Theta(n \log n)$ in theory and $\Theta(n \log^2 n)$ in practice. We apply this technnique to the recent convex formulation of the 2-SUM problem introduced by Fogel et~al., and demonstrate how we can attain results of similar quality in much less computational time. We can also solve larger instances. To our knowledge, this is the first usage of Goemans' compact formulation of the permutahedron in a convex optimization problem.

Cong Han Lim, Stephen Wright
Biclustering by Message Passing
Biclustering is the analog of clustering on a bipartite graph. It is an important problem in network science with several applications in biology (e.g., clustering microarray gene expression data or predicting phenotypes from genotypes) and in other areas (e.g., discovering document clusters containing word clusters). Because most formulations of the biclustering problem are NP hard, existent methods infer biclusters through {\it local} search strategies, finding one bicluster at a time. For instance, one popular iterative approach assigns rows to a bicluster based on the columns, and vice versa. These approaches have some shortcomings: first, they struggle to resolve overlapping clusters because a collection of locally optimal biclusters might not be globally optimal. Second, the lack of a well-defined global objective function precludes an analytical characterization of their expected results. In contrast, the biclustering method that we propose in this paper maximizes a global objective function that closely approximates an optimal likelihood function. Our objective function separates the cluster size penalty term of the likelihood function into the row and column count penalties. This decoupling enables our objective function to be optimized in near-linear time complexity using a message passing approach, allowing messages to be computed using a pair of optimizations: one optimization ranges over all row subsets and the other, over all column subsets. We show that these optimizations can be solved in linearithmic time by sorting the rows and the columns, respectively. Moreover, our approach can resolve overlapping biclusters, which occur frequently in real life applications. We prove the optimality of our objective function in a stochastic block model and under certain more general conditions. Through simulations, we show that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap.

Luke O'Connor, Soheil Feizi
Blossom Tree Graphical Models
Methods for graphical modeling of continuous data make strong assumptions. At one extreme, the Gaussian graphical model allows arbitrary graphs, but makes very strong distributional assumptions. A nonparametric extension called the nonparanormal relaxes the Gaussian assumption by allowing nonparametric single variable marginals, but still requires strong assumptions on the joint distribution. At another extreme, forest-structured graphical models permit arbitrary bivariate marginals, but maintain tractability by restricting to acyclic graphs. A large family of nonparametric and semiparametric graphical models lies between these extremes. In this paper we combine the ideas behind forests and the nonparanormal. Our approach is to attach nonparanormal "blossoms", with arbitrary graphs, to a collection of nonparametric trees. The tree edges are chosen to connect variables that most violate joint Gaussianity. The non-tree edges are partitioned into disjoint groups, and assigned to tree nodes using a nonparametric partial correlation statistic. A nonparanormal blossom is then "grown" for each group using established methods based on the graphical lasso. The result is a factorization with respect to the union of the tree branches and blossoms, defining a high-dimensional joint density that can be efficiently estimated and evaluated on test points. Theoretical properties and experiments with simulated data show that blossom trees can be an effective tool for nonparametric graphical modeling.

Zhe Liu, John Lafferty
Bounded Regret for Finite-Armed Structured Bandits
We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also give problem-dependent lower bounds on the cumulative regret showing that at least in special cases the new algorithm is nearly optimal.

Tor Lattimore, Remi Munos
Bregman Alternating Direction Method of Multipliers
The mirror descent algorithm (MDA) generalizes gradient descent by using a Bregman divergence to replace squared Euclidean distance. In this paper, we similarly generalize the alternating direction method of multipliers (ADMM) to Bregman ADMM (BADMM), which allows the choice of different Bregman divergences to exploit the structure of problems. BADMM provides a unified framework for ADMM and its variants, including generalized ADMM, inexact ADMM and Bethe ADMM. We establish the global convergence and the $O(1/T)$ iteration complexity for BADMM. In some cases, BADMM can be faster than ADMM by a factor of $O(n/\log(n))$. In solving the linear program of mass transportation problem, BADMM leads to massive parallelism and can easily run on GPU. BADMM is several times faster than highly optimized commercial software Gurobi, and takes tens of seconds in solving a linear program with millions of parameters.

Huahua Wang, Arindam Banerjee
Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs
The Admixture of Poisson MRFs (APM) model recently introduced by Inouye et al. (2014) is the first topic model that allows for word dependencies within each topic unlike the usual Multinomial assumption as in other topic models like LDA. Research in both the semantic coherence of a topic models (Mimno 2011, Newman 2010, Stevens 2012) and research that checks for model fit (Mimno 2011a) provide strong support that modeling word dependencies could be both semantically meaningful and and essential for appropriately modeling real text data. Though APM shows significant promise for providing a better topic model, APM has a high computational complexity because an algorithm must estimate $O(p^2)$ parameters where $p$ is the number of words ((Inouye 2014) could only provide results for datasets with $p = 200$). In addition, Inouye et al. (2014) only provided tentative and inconclusive results on the model's utility. Therefore, in this paper, we address the computational issues and more thoroughly evaluate the APM model. First, we develop a new alternating parallel Newton-like algorithm for APM that can train on the BNC corpus with 1646 words and can easily be extended for much larger datasets. Second, we propose a novel evaluation metric based on human evocation scores (i.e. how much one word "brings to mind" another word) motivated by the previous metrics and evaluations for topic models. We provide compelling quantitative and qualitative results on the BNC corpus that demonstrate the superiority of APM over previous independent topic models for identifying semantically meaningful word dependencies.

David Inouye, Pradeep Ravikumar, Inderjit Dhillon
Causal Inference through a Witness Protection Program
One of the most fundamental problems in causal inference is the estimation of a causal effect when variables are confounded. This is difficult in an observational study because one has no direct evidence that all confounders have been adjusted for. We introduce a novel approach for estimating causal effects that exploits observational conditional independencies to suggest ``weak'' paths in a unknown causal graph. The widely used faithfulness condition of Spirtes et al. is relaxed to allow for varying degrees of ``path cancellations'' that will imply conditional independencies but do not rule out the existence of confounding causal paths. The outcome is a posterior distribution over bounds on the average causal effect via a linear programming approach and Bayesian inference. We claim this approach should be used in regular practice along with other default tools in observational studies.

Ricardo Silva, Robin Evans
Communication Efficient Distributed Machine Learning with the Parameter Server
We propose a third generation parameter server framework for distributed machine learning. Both data and workloads are distributed over worker nodes, while the server nodes jointly maintain globally shared parameters. We tackle data distribution, fault tolerance, and synchronization in an \emph{integrated} fashion. In contrast to prior works, our parameter server allows for a \emph{flexible} consistency model including eventual, fully synchronous, and bounded-delay settings. It allows us to perform inference with negligible communication overhead under network constraints. We present a new algorithm taking advantage of the proposed system to solve non-convex non-smooth problems with convergence guarantee. We demonstrate the efficacy on problems ranging from $\ell_1$-regularized logistic regression to reconstruct ICA, using 600TB of real data with hundreds billions of samples and dimensions.

Mu Li, Alex Smola, David Andersen
Communication-Efficient Distributed Dual Coordinate Ascent
Communication remains a significant bottleneck in the performance of distributed convex optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, COCOA, that leverages the primal-dual structure of the optimization problems, together with local computation, to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations on Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges to the same .001-accurate solution quality 25× as fast on average.

Martin Jaggi, Virginia Smith, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, Michael I. Jordan
Compressive Sensing of Signals from a GMM with Sparse Precision Matrices
This paper is concerned with compressive sensing of signals drawn from a Gaussian mixture model (GMM) with sparse precision matrices. Previous work has shown: (i) a signal drawn from a given GMM can be perfectly reconstructed from r noise-free measurements if the (dominant) rank of each covariance matrix is less than r; (ii) a sparse Gaussian graphical model can be efficiently estimated from fully-observed training signals using graphical lasso. This paper addresses a problem more challenging than both (i) and (ii), by assuming that the GMM is unknown and each signal is only partially observed through incomplete linear measurements. Under these challenging assumptions, we develop a hierarchical Bayesian method to simultaneously estimate the GMM and recover the signals using solely the incomplete measurements and a Bayesian shrinkage prior that promotes sparsity of the Gaussian precision matrices. In addition, we provide theoretical performance bounds to relate the reconstruction error to the number of signals for which measurements are available, the sparsity level of precision matrices, and the “incompleteness” of measurements. The proposed method is demonstrated extensively on compressive sensing of imagery and video, and the results with simulated and hardware-acquired real measurements show significant performance improvement over state-of-the-art methods.

Jianbo Yang, Xuejun Liao, Minhua Chen, Lawrence Carin
Computing Nash Equilibria in Generalized Interdependent Security Games
We study the computational complexity of computing Nash equilibria in generalized interdependent-security (IDS) games. Like the traditional IDS games, originally introduced by economists and risk-assessment experts Heal and Kunreuther about a decade ago, generalized IDS games model the individual voluntary investment decision of multiple agents facing some potential direct risk along with transfer-risk exposure from other agents. A distinct feature of the generalized IDS games is that full investment can reduce transfer risk. Generalized IDS games may exhibit strategic complementarity (SC) or strategic substitutability (SS), depending on the transfer-risk reduction level. We consider three variants of generalized IDS games in which players exhibit only SC, only SS, and both SC+SS. We show that determining whether there is a pure-strategy Nash equilibrium (PSNE) in SC+SS-type games is NP-complete, while computing a single PSNE in SC-type games takes worst-case polynomial time. As for the problem of computing all mixed-strategy Nash equilibria (MSNE) efficiently, we produce a partial characterization. Whenever each agent in the game is indiscriminate in terms of the transfer-risk exposure to the other agents, a case that Kearns and Ortiz originally studied in the context of traditional IDS games in their NIPS 2003 paper, we can compute all MSNE that satisfy some ordering constraints in polynomial time in all three game variants. Yet, there is a computational barrier in the general (transfer) case: we show that the problem is as hard as the Pure-Nash-Extension problem, also originally introduced by Kearns and Ortiz, and that it is NP-complete for all three variants. Finally, we experimentally examine and discuss the practical impact that the additional protection from transfer risk allowed in generalized IDS games has on MSNE by solving several randomly-generated instances of SC+SS-type games with graph structures taken from several real-world datasets.

Hau Chan, Luis Ortiz
Concavity of reweighted Kikuchi approximation
We analyze a reweighted version of the Kikuchi approximation for estimating the log partition function of a product distribution defined over a region graph. We establish sufficient conditions for the concavity of our reweighted objective function in terms of weight assignments in the Kikuchi expansion, and show that a reweighted version of the sum product algorithm applied to the Kikuchi region graph will produce global optima of the Kikuchi approximation whenever the algorithm converges. When the region graph has two layers, corresponding to a Bethe approximation, we show that our sufficient conditions for concavity are also necessary. Finally, we provide an explicit characterization of the polytope of concavity in terms of the cycle structure of the region graph. We conclude with simulations that demonstrate the advantages of the reweighted Kikuchi approach.

Po-Ling Loh, Andre Wibisono
Conditional Swap Regret and Conditional Correlated Equilibrium
We introduce a natural extension of the notion of swap regret, conditional swap regret, that allows for action modifications conditioned on the player’s action history. We prove a series of new results for conditional swap regret minimization. We present algorithms for minimizing conditional swap regret with bounded conditioning history. We further extend these results to the case where conditional swaps are considered only for a subset of actions. We also define a new notion of equilibrium, conditional correlated equilibrium, that is tightly connected to the notion of conditional swap regret: when all players follow conditional swap regret minimization strategies, then the empirical distribution approaches this equilibrium. Finally, we extend our result to the bandit setting in the case of a conditional history of length one.

Mehryar Mohri, Scott Yang
Cone-Constrained Principal Component Analysis
Estimating a vector from noisy quadratic observations is a task that arises naturally in many contexts, from dimensionality reduction, to synchronization and phase retrieval problems. It is often the case that additional information is available about the unknown vector (for instance, sparsity, sign or magnitude of its entries). Many authors propose non-convex quadratic optimization problems that aim at exploiting optimally this information. However, solving these problems is typically NP-hard. We consider a simple model for noisy quadratic observation of an unknown vector $\mathbf{v_0}$. The unknown vector is constrained to belong to a cone $\mathcal{C}\ni \mathbf{v_0}$. While optimal estimation appears to be intractable for the general problems in this class, we provide evidence that it is tractable when $\mathcal{C}$ is a convex cone. This is surprising, since the corresponding optimization problem is non-convex and --from a worst case perspective-- often NP hard. We characterize the resulting minimax risk in terms of the statistical dimension of the cone $\delta(\Cone)$. This quantity is already known to control the risk of estimation from linear observations, but its relevance to the case treated here was far from obvious.

Yash Deshpande, Andrea Montanari, Emile Richard
Consistency of Spectral Partitioning of Uniform Hypergraphs under Planted Partition Model
Spectral graph partitioning methods have received significant attention from both practitioners and theorists in computer science. Some notable studies have been carried out regarding the behavior of these methods for infinitely large sample size (von Luxburg et al., 2008; Rohe et al., 2011), which provide sufficient confidence to practitioners about the effectiveness of these methods. On the other hand, recent developments in computer vision have led to a plethora of applications, where the model deals with multi-way affinity relations and can be posed as uniform hyper-graphs. In this paper, we view these models as random m-uniform hypergraphs and establish the consistency of spectral algorithm in this general setting. We develop a planted partition model or stochastic blockmodel for such problems using higher order tensors, present a spectral technique suited for the purpose and study its large sample behavior. The analysis reveals that the algorithm is consistent for m-uniform hypergraphs for larger values of m, and also the rate of convergence improves for increasing m. Our result provides the first theoretical evidence that establishes the importance of m-way affinities.

Debarghya Ghoshdastidar, Ambedkar Dukkipati
Consistency of weighted majority votes
We revisit from a statistical learning perspective the classical decision-theoretic problem of weighted expert voting. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. The bounds we derive are nearly optimal, and several challenging open problems are posed. Experimental results are provided to illustrate the theory.

Daniel Berend, Aryeh Kontorovitch
Constant Nullspace Strong Convexity and Fast Convergence of Proximal Methods under High-Dimensional Settings
State of the art statistical estimators for high-dimensional problems take the form of regularized, and hence non-smooth, convex programs. A key facet of thesestatistical estimation problems is that these are typically not strongly convex under a high-dimensional sampling regime when the Hessian matrix becomes rank-deficient. Under vanilla convexity however, proximal optimization methods attain only a sublinear rate. In this paper, we investigate a novel variant of strong convexity, which we call Constant Nullspace Strong Convexity (CNSC), where we require that the objective function be strongly convex only over a constant subspace. As we show, the CNSC condition is naturally satisfied by high-dimensional statistical estimators. We then analyze the behavior of proximal methods under this CNSC condition: we show global linear convergence of Proximal Gradient and local quadratic convergence of Proximal Newton Method, when the regularization function comprising the statistical estimator is decomposable. We corroborate our theory via numerical experiments, and show a qualitative difference in the convergence rates of the proximal algorithms when the loss function does satisfy the CNSC condition.

En-Hsu Yen, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Constrained convex minimization via model-based excessive gap
We customize Nesterov's excessive gap technique to analyze first-order methods for constrained convex minimization. As a result, we construct first-order primal-dual methods with optimal convergence rates on the primal objective residual and the primal feasibility gap of their iterates separately. Through a dual smoothing strategy, our framework subsumes the augmented Lagrangian, alternating direction, and dual fast-gradient methods as special cases, where our rates apply.

Quoc Tran-Dinh, Volkan Cevher
Content-based recommendations with Poisson factorization
We develop collaborative topic Poisson factorization (CTPF), a generative model of articles and reader preferences. CTPF can be used to build recommender systems by learning from reader histories and content to recommend personalized articles of interest. In detail, CTPF models both reader behavior and article texts with Poisson distributions, connecting the latent topics that represent the texts with the latent preferences that represent the readers. This provides better recommendations than than other methods and gives an interpretable latent space for understanding patterns of readership. Further, we exploit stochastic variational inference to model massive real-world datasets. For example, within hours we can fit CPTF to the full arXiv usage dataset, which contains over 43 million user observations and 54 million words. We demonstrate empirically that our model outperforms several baselines, including the previous state-of-the art approach.

Prem Gopalan, Laurent Charlin, David Blei
Controlling privacy in recommender systems
Recommender systems involve an inherent trade-off between accuracy of recommendations and the extent to which users are willing to release information about their preferences. In this paper, we explore a two-tiered notion of privacy where there is a small set of ``public'' users who are willing to share their preferences openly, and a large set of ``private'' users who require privacy guarantees. We show theoretically and demonstrate empirically that a moderate number of public users with no access to private user information already suffices for reasonable accuracy. Moreover, we introduce a new privacy concept for gleaning relational information from private users while maintaining a first order deniability. We demonstrate gains from controlled access to private user preferences.

Yu Xin, Tommi Jaakkola
Convex Deep Learning via Normalized Kernels
Deep learning has been a long standing pursuit in machine learning, which until recently was hampered by unreliable training methods---prior to the discovery of improved training heuristics for embedded layer training. A complementary research strategy is to develop alternative modeling architectures that admit efficient training methods while expanding the range of representable structures toward deep models. In this paper, we develop a new architecture for nested nonlinearities that allows arbitrarily deep compositions to be trained to global optimality. The approach admits both parametric and nonparametric forms through the use of normalized kernels to represent each latent layer. The outcome is the first fully convex formulation that is able to capture compositions of trainable nonlinear layers to arbitrary depth.

Özlem Aslan, Xinhua Zhang, Dale Schuurmans
Convex Optimization Procedure for Clustering: Theoretical Revisit
In this paper, we present theoretic analysis of COP~--~a convex optimization procedure for clustering recently proposed in \cite{ICML2011Hocking_419,SON, Lindsten650707}. In particular, we show if the samples are drawn from two cubes, each being one cluster, then COP can provably identify the cluster membership provided that the distance between the two cubes is larger than a threshold which (linearly) depends on the size of the cube and the ratio of numbers of samples in each cluster. To the best of our knowledge, this paper is the first to provide a rigorous analysis to understand why and when COP works. We believe this may provide important insights to develop novel convex optimization based algorithms for clustering.

Changbo Zhu, Huan Xu, Chenlei Leng, shuicheng Yan
Convolutional Neural Network Architectures for Matching Natural Language Sentences
Semantic matching is of central importance to many natural language tasks \cite{bordes2014semantic,RetrievalQA}. A successful matching algorithm needs to adequately model the internal structures of language objects and the interaction between them. As a step toward this goal, we propose convolutional neural network models for matching two sentences, by adapting the convolutional strategy in vision and speech. The proposed models not only nicely represent the hierarchical structure in sentences with its layer-by-layer composition and pooling, but also capture the rich matching patterns at different levels. Our models are rather generic, requiring no prior knowledge on language, and can hence be applied to matching tasks of different nature and in different languages. The empirical study on a variety of matching tasks demonstrates the efficacy of the proposed model on a variety of matching tasks and its superiority to competitor models.

Baotian Hu, Zhengdong Lu, Hang Li, Qingcai Chen
Coresets for k-Segmentation of Streaming Data
Life-logging video streams, financial time series, and Twitter tweets are a few examples of high-dimensional signals over practically unbounded time. We consider the problem of computing optimal segmentation of such signals by a $k$-piecewise linear function, using only one pass over the data by maintaining a \emph{coreset} for the signal. %for various optimality criteria. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (core-set) is a compact representation of the data seen so far, which approximates the data well for a specific task -- in our case, segmentation of the stream. We show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments $k$ and independent of both the dimension $d$ of the signal, and its number $n$ of points. More precisely, we construct a representation of size $O(k/\eps^2)$ that provides a $(1+\eps)$-approximation for the sum of squared distances to any given $k$-piecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. Our results rely on a novel reduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud.

Guy Rosman, Mikhail Volkov, Dan Feldman, John Fisher III, Daniela Rus
Covariance shrinkage for autocorrelated data
The accurate estimation of covariance matrices is essential for many signal processing and machine learning algorithms. In high dimensional settings the sample covariance is known to perform poorly, hence regularisation strategies such as analytic shrinkage of Ledoit/Wolf are applied. In the standard setting, iid data is assumed, however, in practice, time series typically exhibit strong autocorrelation structure, which introduces a pronounced estimation bias. Recent work by Sancetta has extended the shrinkage framework beyond iid data. We contribute in this work by showing that the Sancetta estimator, while being consistent in the high-dimensional limit, suffers from a high bias in finite sample sizes. We propose an alternative estimator, which is (1) unbiased, (2) less sensitive to hyperparameter choice and (3) yields superior performance in simulations on toy data and on one real world data set from an EEG-based Brain-Computer-Interfacing experiment.

Daniel Bartz, Klaus-Robert Mueller
DFacTo: Distributed Factorization of Tensors
We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million x 2.5 million x 1.5 million dimensional tensor with 1.2 billion non-zero entries.

Joon Hee Choi, S. Vishwanathan
Decomposing Parameter Estimation Problems
We propose a technique for decomposing the parameter learning problem in Bayesian networks into independent learning problems. Our technique applies to incomplete datasets and exploits variables that are either hidden or observed in the given dataset. We show empirically that the proposed technique can lead to orders-of-magnitude savings in learning time. We explain, analytically and empirically, the reasons behind our reported savings, and compare the proposed technique to related ones that are sometimes used by inference algorithms.

Khaled Refaat, Arthur Choi, Adnan Darwiche
Deconvolution of High Dimensional Mixtures via Boosting, with Application to Diffusion-Weighted MRI of Human Brain
Diffusion-weighted magnetic resonance imaging (DWI) and fiber tractography are the only methods to measure the structure of the white matter in the living human brain. The diffusion signal has been modelled as the combined contribution from many individual fascicles of nerve fibers passing through each location in the white matter. Typically, this is done via basis pursuit, but estimation of the exact directions is limited due to discretization. The difficulties inherent in modeling DWI data are shared by many other problems involving fitting non-parametric mixture models. Ekanadaham et al. proposed an approach, continuous basis pursuit, to overcome discretization error in the 1-dimensional case (e.g., spike-sorting). Here, we propose a more general algorithm that fits mixture models of any dimensionality without discretization. Our algorithm uses the principles of L2-boost, together with refitting of the weights and pruning of the parameters. The addition of these steps to L2-boost both accelerates the algorithm and assures its accuracy. We refer to the resulting algorithm as elastic basis pursuit, or EBP, since it expands and contracts the active set of kernels as needed. We show that in contrast to existing approaches to fitting mixtures, our boosting framework (1) enables the selection of the optimal bias-variance tradeoff along the solution path, and (2) scales with high-dimensional problems. In simulations of DWI, we find that EBP yields better parameter estimates than a non-negative least squares (NNLS) approach, or the standard model used in DWI, the tensor model, which serves as the basis for diffusion tensor imaging (DTI). We demonstrate the utility of the method in DWI data acquired in parts of the brain containing crossings of multiple fascicles of nerve fibers.

Charles Zheng, Franco Pestilli, Ariel Rokem
Decoupled Variational Gaussian Inference
Variational Gaussian (VG) inference methods that optimize a lower bound to the marginal likelihood are a popular approach for Bayesian inference. These methods are fast and easy to use, while being reasonably accurate. A difficulty remains in computation of the lower bound when the latent dimensionality $L$ is large. Even though the lower bound is concave for many models, its computation requires optimization over $O(L^2)$ variational parameters. Efficient reparameterization schemes can reduce the number of parameters, but give inaccurate solutions or destroy concavity leading to slow convergence. We propose decoupled variational inference that brings the best of both worlds together. First, it maximizes a Lagrangian of the lower bound reducing the number of parameters to $O(N)$, where $N$ is the number of data examples. The reparameterization obtained is unique and recovers maxima of the lower-bound even when the bound is not concave. Second, our method maximizes the lower bound using a sequence of convex problems, each of which is parallellizable over data examples and computes gradient efficiently. Overall, our approach avoids all direct computations of the covariance, only requiring its linear projections. Theoretically, our method converges at the same rate as existing methods in the case of concave lower bounds, while remaining convergent at a reasonable rate for the non-concave case.

Mohammad Emtiyaz Khan
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. In addition to a ranking objective seen in previous work, this allows us to add a new fragment alignment objective that learns to directly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions since the inferred inter-modal fragment alignment is explicit.

Andrej Karpathy, Armand Joulin, Fei Fei Li
Deep Learning Multi-View Representation for Face Recognition
Various factors, such as identities, views (poses), and illuminations, are coupled in face images. Disentangling the identity and view representations is a major challenge in face recognition. Existing face recognition systems either use handcrafted features or learn features discriminatively to improve recognition accuracy. This is different from the behavior of human brain. Intriguingly, even without accessing 3D data, human not only can recognize face identity, but can also imagine face images of a person under different viewpoints given a single 2D image, making face perception in the brain robust to view changes. In this sense, human brain has learned and encoded 3D face models from 2D images. To take into account this instinct, this paper proposes a novel deep neural net, named multi-view perceptron (MVP), which can untangle the identity and view features, and infer a full spectrum of multi-view images in the meanwhile, given a single 2D face image. The identity features of MVP achieve superior performance on the MultiPIE dataset. MVP is also capable to interpolate and predict images under viewpoints that are unobserved in the training data.

Zhenyao Zhu, Ping Luo, Xiaogang Wang
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning
The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policy-selection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a useful benchmark set of such applications. Existing approaches to Atari game play fall into two categories. Monte-Carlo Search Tree methods exploit access to the hidden state of the game provided by ALE, and use it to simulate trajectories and plan for the action in every current state. Model-Free approaches use only observations of the game screen and learn representations of policies using some function approximation method. A recent breakthrough in combining Q-learning with deep learning, called DQN, achieves the best model-free agent thus far. The planning-based approaches still achieve far higher scores than all best model-free approaches, but they exploit information that is not available to human players, and they are orders of magnitude slower than needed for real-time play. Our main goal in this work is to build a better real-time Atari game playing agent than DQN. The central idea is to use the slow planning-based agents to provide training data for a deep-learning architecture capable of real-time play. We proposed three new agents based on this idea and show that they outperform DQN.

Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard Lewis, Xiaoshi Wang
Deep Networks with Internal Selective Attention through Feedback Connections
Traditional convolutional neural networks (CNN) are stationary and feedforward. They neither change their parameters during evaluation nor use feedback from higher to lower layers. Real brains, however, do. So does our Deep Attention Selective Network (dasNet) architecture. DasNet’s feedback structure can dynamically alter its convolutional filter sensitivities during classification. It harnesses the power of sequential processing to improve classification performance, by allowing the network to iteratively focus its internal attention on some of its convolutional filters. Feedback is trained through direct policy search in a huge million-dimensional parameter space, through scalable natural evolution strategies (SNES). On the CIFAR-10 and CIFAR-100 datasets, dasNet outperforms the previous state-of-the-art model.

Marijn Stollenga, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber
Deep Recursive Neural Networks for Compositionality in Language
Recursive neural networks comprise a class of architecture that can operate on structural input. They are previously successfully applied to model compositionality in natural language with parse tree based structural representations. Even though these architectures are deep in structure, they lack the capacity for hierarchical representation that exists in conventional deep feed-forward networks as well as in recently investigated deep recurrent neural networks. In this work we introduce a new architecture --- deep recursive neural network (deep RNN) which is constructed by stacking multiple recursive layers. We evaluate the proposed model on the task of fine-grained sentiment classification. Our results show that deep RNNs outperform their shallow counterparts with the same number of parameters. Furthermore, our approach outperforms previous baselines, including a multiplicative variant of RNN as well as the recently introduced paragraph vectors, and achieves new state-of-the-art results on the task. We provide exploratory analyses of the effect of multiple layers and show that they capture different aspects of composition in language.

Ozan Irsoy, Claire Cardie
Deep Symmetry Networks
The chief difficulty in object recognition is that objects' classes are obscured by a large number of extraneous sources of variability, such as pose and part deformation. These sources of variation can be represented by symmetry groups, sets of composable transformations that preserve object identity. Convolutional neural networks (convnets) achieve a degree of translational invariance by computing feature maps over the translation group, but cannot handle other groups. As a result, these groups' effects have to be approximated by small translations, which often requires augmenting datasets and leads to high sample complexity. In this paper, we introduce deep symmetry networks (symnets), a generalization of convnets that forms feature maps over arbitrary symmetry groups. Symnets use kernel-based interpolation to tractably tie parameters and pool over symmetry spaces of any dimension. Like convnets, they are trained with backpropagation. The composition of feature transformations through the layers of a symnet provides a new approach to deep learning. Experiments on NORB and MNIST-rot show that symnets over the affine group greatly reduce sample complexity relative to convnets by better capturing the symmetries in the data.

Robert Gens, Pedro Domingos
Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning
We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then we analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.

Brendan McMahan, Matthew Streeter
Dependent nonparametric trees for dynamic hierarchical clustering
Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections of time-dependent, infinite-dimensional trees that can be used to model evolving hierarchies, and present an efficient and scalable algorithm for performing approximate inference in such a model. We demonstrate the efficacy of our model and inference algorithm on both synthetic data and real-world document corpora.

Kumar Dubey, Qirong Ho, Sinead Williamson, Eric Xing
Deterministic Symmetric Positive Semidefinite Matrix Completion
We consider the problem of recovering of a full symmetric, positive semidefinite (SPSD) matrix from a subset of its entries, possibly corrupted by noise. In contrast to previous matrix recovery work, we drop the assumption of a random sampling of entries in favor of a deterministic sampling of principal minors of the matrix. We develop necessary conditions for the recovery of a SPSD matrix by any method under this assumption, and then present an algorithm that can recover the exact, low-rank matrix when the minimal necessary conditions are met. The proposed algorithm is naturally generalized to the problem of noisy matrix recovery, and we provide worse-case bounds on reconstruction error for this scenario. Finally, we demonstrate the algorithm's utility on noise free and noisy datasets.

William Bishop, Byron Yu
Dimensionality Reduction with Subspace Structure Preservation
Modeling data as being sampled from a union of independent subspaces has been widely applied to a number of real world applications. However, dimensionality reduction approaches that theoretically preserve this independence assumption have not been well studied. Our key contribution is to show that $2K$ projection vectors are sufficient for the independence preservation of any $K$ class data sampled from a union of independent subspaces. It is this non-trivial observation that we use for designing our dimensionality reduction technique. In this paper, we propose a novel dimensionality reduction algorithm that theoretically preserves this structure for a given dataset. We support our theoretical analysis with empirical results on both synthetic and real world data achieving \textit{state-of-the-art} results compared to popular dimensionality reduction techniques.

Devansh Arpit, Ifeoma Nwogu, Venu Govindaraju
Discovering Structure in High-Dimensional Data Through Correlation Explanation
We introduce a method to learn a hierarchy of successively more abstract representations of complex data based on optimizing an information-theoretic objective. Intuitively, the optimization searches for the simplest set of factors that can explain the correlations in the data as measured by multivariate mutual information. The method is unsupervised, requires no model assumptions, and scales linearly with the number of variables which makes it an attractive approach for very high dimensional systems. We demonstrate that Correlation Explanation (CorEx) automatically discovers meaningful structure for data from diverse sources including personality tests, DNA, and human language.

Greg Ver Steeg, Aram Galstyan
Discovering, Learning and Exploiting Relevance
In this paper we consider the problem of learning online what is the information to consider when making sequential decisions. We formalize this as a contextual multi-armed bandit problem where a high dimensional ($D$-dimensional) context vector arrives to a learner which needs to select an action to maximize its expected reward at each time step. Each dimension of the context vector is called a type. We assume that there exists an unknown relation between actions and types, called the relevance relation, such that the reward of an action only depends on the contexts of the relevant types. When the relation is a function, i.e., the reward of an action only depends on the context of a single type, and the expected reward of an action is Lipschitz continuous in the context of its relevant type, we propose an algorithm that achieves $\tilde{O}(T^{\gamma})$ regret with a high probability, where $\gamma=2/(1+\sqrt{2})$. Our algorithm achieves this by learning the unknown relevance relation, whereas prior contextual bandit algorithms that do not exploit the existence of a relevance relation will have $\tilde{O}(T^{(D+1)/(D+2)})$ regret. Our algorithm alternates between exploring and exploiting, it does not require reward observations in exploitations, and it guarantees with a high probability that actions with suboptimality greater than $\epsilon$ are never selected in exploitations. Our proposed method can be applied to a variety of learning applications including medical diagnosis, recommender systems, popularity prediction from social networks, network security etc., where at each instance of time vast amounts of different types of information are available to the decision maker, but the effect of an action depends only on a single type.

Cem Tekin, Mihaela van der Schaar
Discriminative Metric Learning by Neighborhood Gerrymandering
We formulate the problem of metric learning for k nearest neighbor classification as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classification error. We describe an efficient algorithm for exact loss augmented inference,and a fast gradient descent algorithm for learning in this model. The objective drives the metric to establish neighborhood boundaries that benefit the true class labels for the training points. Our approach, reminiscent of gerrymandering (redrawing of political boundaries to provide advantage to certain parties), is more direct in its handling of optimizing classification accuracy than those previously proposed. In experiments on a variety of data sets our method is shown to achieve excellent results compared to current state of the art in metric learning.

Shubhendu Trivedi, David McAllester, Greg Shakhnarovich
Discriminative Unsupervised Feature Learning with Convolutional Neural Networks
Current methods for training convolutional neural networks depend on large amounts of labeled examples for supervised training. In this paper we present an approach for training a convolutional neural network using only unlabeled data. We train the network to discriminate between a set of surrogate classes. Each surrogate class is formed by applying a variety of transformations to a randomly sampled 'seed' image patch. We find that this simple feature learning algorithm is surprisingly successful when applied to visual object recognition. The feature representation learned by our algorithm achieves classification results matching or outperforming the current state-of-the-art for unsupervised learning on several popular datasets (STL-10, CIFAR-10, Caltech-101).

Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, Thomas Brox
Distance-based Network Recovery under Feature Correlation
We present an inference method for Gaussian Graphical Models when only pairwise distances of n objects are observed. Formally, this is a problem of estimating an n x n covariance matrix from the Mahalanobis distances d_MH(x_i, x_j), where objects x_i live in a latent feature space. We solve the estimation problem in fully Bayesian fashion by integrating the Matrix-Normal distribution over a Matrix-Gamma prior. The resulting Matrix-T distribution enables network recovery even under strongly correlated features. Hereby, we generalize TiWnet, which assumes Euclidean distances with strict feature independence. In spite of the greatly increased flexibility, our model neither loses statistical power nor entails more computational cost. We argue that the extension is highly relevant as it yields significantly better results in both synthetic and real-world experiments. We demonstrate this by a successful application to biological pathways in cancer patients.

David Adametz, Volker Roth
Distributed Balanced Clustering via Mapping Coresets
Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the ``balanced clustering'' problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound distributed algorithm remains an open problem. In the present paper we develop a general framework based on ``mapping coresets'' to tackle this issue. Our technique results in first distributed approximation algorithms for balanced clustering problems for a wide range of clustering objective functions such as k-center, k-median, and k-means.

MohammadHossein Bateni, Aditya Bhaskara, Silvio Lattanzi, Vahab Mirrokni
Distributed Context-Aware Bayesian Posterior Sampling via Expectation Propagation
In this paper we propose a distributed parallel inference algorithm for large-scale Bayesian posterior estimation. Our procedure is based on combining both Markov chain Monte Carlo sampling (MCMC) and Expectation Propagation (EP). Specifically, each node in the cluster locally performs MCMC sampling based on its subset of data and a context prior. At the same time, statistics are collected from the samples and passed around among the nodes, updating the context. We find such context awareness beneficial for improving estimation accuracy. Furthermore, our algorithm bears very low communication cost and is naturally asynchronous. We demonstrate the advantages of our algorithm through empirical studies on Bayesian logistic regression.

Minjie Xu, Yee Whye Teh, Jun Zhu, Bo Zhang
Distributed Estimation, Information Loss and Exponential Families
Distributed learning of probabilistic models from multiple data repositories with minimum communication is increasingly important. We study a simple communication-efficient learning framework that first calculates the local maximum likelihood estimates (MLE) based on the data subsets, and then combines the local MLEs to achieve the best possible approximation to the global MLE, based on the whole dataset jointly. We study the statistical properties of this framework, showing that the loss of efficiency compared to the global setting relates to how much the underlying distribution families deviate from full exponential families, drawing connection to the theory of information loss by Fisher, Rao and Efron. We show that the "full-exponential-family-ness" represents the lower bound of the error rate of arbitrary combinations of local MLEs, and is achieved by a KL-divergence-based combination method but not by a more common linear combination method. We also study the empirical properties of the KL and linear combination methods, showing that the KL method significantly outperforms linear combination in practical settings with issues such as model misspecification, non-convexity, and heterogeneous data partitions.

Qiang Liu, Alex Ihler
Distributed Parameter Estimation in Probabilistic Graphical Models
This paper presents foundational theoretical results on distributed parameter estimation for undirected probabilistic graphical models. It introduces a general condition on composite likelihood decompositions of these models which guarantees the global consistency of distributed estimators, provided the local estimators are consistent.

Yariv Mizrahi, Misha Denil, Nando de Freitas
Distributed Power-law Graph Computing: Theoretical and Empirical Analysis
Typically, a large-scale natural graph follows a skewed power law. In distributed graph-structured computations, the skewness usually makes a bad partitioning, which leads to high communication cost and workload imbalance. Thus, graph partitioning~(GP) is a challenging issue. To tackle this challenge, we introduce degree-based hashing techniques into GP via vertex-cut. Accordingly, we develop a novel GP approach called \textit{PowerLore}. PowerLore is attractive because it naturally makes use of the skewed degree distribution. In addition, we conduct the theoretical analysis. And experiments on several large skewed graphs further show that our \textit{PowerLore} outperforms the state-of-the-art baselines in both decreasing communication costs and guaranteeing good balance.

Cong Xie, Ling Yan, Wu-Jun Li, Zhihua Zhang
Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models
Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

Yarin Gal, Mark van der Wilk, Carl Rasmussen
Diverse Randomized Agents Vote to Win
We investigate the power of voting among diverse, randomized software agents. With teams of computer Go agents in mind, we develop a novel theoretical model of two-stage noisy voting that builds on recent work in machine learning. This model allows us to reason about a collection of agents with different biases (determined by the first-stage noise models), which, furthermore, apply randomized algorithms to evaluate alternatives and produce votes (captured by the second-stage noise models). We analytically demonstrate that a uniform team, consisting of multiple instances of any single agent (with different random seeds), must make a significant number of mistakes, whereas a diverse team converges to perfection as the number of agents grows. Our experiments, which pit teams of computer Go agents against strong agents, provide evidence for the effectiveness of voting when agents are diverse.

Albert Jiang, Leandro Marcolino, Ariel Procaccia, Tuomas Sandholm, Nisarg Shah, Milind Tambe
Diverse Sequential Subset Selection for Supervised Video Summarization
Video summarization is a challenging problem with great application potential. Whereas prior approaches, largely unsupervised in nature, focus on sampling useful frames (or subshots) and assembles them as summaries, we consider video summarization as a supervised subset selection problem. Our idea is to teach the system to learn from human-created summaries on how to select informative and diverse subsets to best meet evaluation metrics derived from human quality perception. To this end, we propose the sequential determinantal point process (seqDPP), a new probabilistic model for diverse sequential subset selection. Our novel seqDPP heeds to the inherent sequential structures in video data, thus overcoming the deficiency of the standard DPP in treating video frames as randomly permutable items. Meanwhile, seqDPP retains the power of modeling diverse subset. Our extensive empirical studies on summarizing videos from 3 datasets demonstrate the superior performance of our method, not only to unsupervised methods but also to naive applications of the standard DPP model.

Boqing Gong, Wei-Lun Chao, Kristen Grauman, Fei Sha
Divide-and-Conquer Learning by Anchoring a Conical Hull
We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ ``anchors'' lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme ``DCA'' that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.

Tianyi Zhou, Jeffrey Bilmes, Carlos Guestrin
Do Convnets Learn Correspondence?
Convolutional neural nets trained from massive labeled datasets \cite{ImageNet} have substantially improved the state-of-the-art in image classification \cite{Krizhevsky} and object detection \cite{RossJeff}. However, visual understanding often entails establishing correspondence on a finer level than object category. Given their large pooling regions and training from whole-image labels, it is not clear whether convnets derive their success from an accurate correspondence model which could be used for localization. In this paper, we study the effectiveness of convnet activation features for tasks requiring correspondence. We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass aligment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011 \cite{pascal}.

Jonathan Long, Ning Zhang, Trevor Darrell
Do Deep Nets Really Need to be Deep?
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we show that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there may be better algorithms for training shallow nets than those currently available.

Jimmy Ba, Rich Caruana
Dynamic Topic Modeling via Rank Factor Analysis
We propose a semi-parametric and dynamic rank factor model for topic modeling, capable of (1) discovering the time-evolving importance of topics, (2) learning contemporary multi-scale dependence structures, and (3) providing topic- and word-correlations as a byproduct. The high-dimensional, time-evolving ordinal/rank observations (such as word counts), after an arbitrary monotone transformation, are well accommodated through an underlying dynamic sparse factor model. The framework naturally admits heavy-tailed innovations, capable of inferring abrupt temporal jumps in the importance of topics. The proposed framework provides an alternative to dynamic and correlated topic modeling, appropriate for ordinal time series analysis. Posterior inference is performed through straightforward Gibbs sampling, based on the forward-filtering backward-sampling algorithm. Moreover, an efficient data subsampling scheme is leveraged to speed up inference on massive datasets. The modeling framework is illustrated on two real datasets: the US State of the Union Address and the JSTOR collection from Science.

Lin Du, Shaobo Han, Lawrence Carin, Esther Salazar
Effective Deep Face Representation Comes from both Identification and Verification Tasks
Developing effective feature representations which reduce intra-personal variations while enlarge inter-personal differences is the key challenge of face recognition. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals as supervision. The Deep Identification-Verification features (DeepIVs) are learned with carefully designed deep convolutional networks. The face identification task increases the inter-personal variations by drawing DeepIVs extracted from different identities apart, while the face verification task reduces the intra-personal variations by pulling DeepIVs extracted from the same identity together, both of which are essential to face recognition. The learned DeepIVs can be well generalized to new identities unseen in the training data. On the challenging LFW dataset, 99.15% verification accuracy is achieved. Compared with the best deep learning result on LFW, the error rate has been significantly reduced by 67%. The code will be released to the public.

Yi Sun, Xiaogang Wang, Xiaoou Tang
Efficient Inference of Continuous Markov Random Fields with Polynomial Potentials
In this paper, we prove that every multivariate polynomial with even degree can be decomposed into a sum of convex and concave polynomials. Motivated by this property, we exploit the concave-convex procedure to perform inference on continuous Markov random fields with polynomial potentials. In particular, we show that the concave-convex decomposition of polynomials can be expressed as a sum-of-squares optimization, which can be efficiently solved via semidefinite programming. We demonstrate the effectiveness of our approach in the context of 3D reconstruction, shape from shading and image denoising, and show that our approach significantly outperforms existing approaches in terms of efficiency as well as the quality of the retrieved solution.

Shenlong Wang, Alex Schwing, Raquel Urtasun
Efficient Minimax Signal Detection on Graphs
Several problems such as network intrusion, community detection, and disease outbreak can be described by observations attributed to nodes or edges of a graph. In these applications presence of intrusion, community or disease outbreak is characterized by novel observations on some unknown connected subgraph. These problems can be formulated in terms of optimization of suitable objectives on connected subgraphs, a problem which is generally computationally difficult. We overcome the combinatorics of connectivity by embedding connected subgraphs into linear matrix inequalities (LMI). Computationally efficient tests are then realized by optimizing convex objective functions subject to these LMI constraints. We prove, by means of a novel Euclidean embedding argument, that our tests are nearly minimax optimal for exponential family of distributions and for graphs satisfying polynomial growth property. We prove that internal conductance of the connected subgraph family plays a fundamental role in characterizing detectability. We then experiment with synthetic and real datasets to demonstrate different features of our method.

Venkatesh Saligrama, Jing Qian
Efficient Minimax Strategies for Square Loss Games
We consider online prediction problems where the loss between the prediction and the outcome is measured by the squared Euclidean distance and its generalization, the squared Mahalanobis distance. We derive the minimax solutions for the case where the prediction and action spaces are the simplex (this setup is sometimes called the Brier game) and the $\ell_2$ ball (this setup is related to Gaussian density estimation). We show that in both cases the value of each sub-game is a quadratic function of a simple statistic of the state, with coefficients that can be efficiently computed using an explicit recurrence relation. The resulting deterministic minimax strategy and randomized maximin strategy are linear functions of the statistic.

Alan Malek, Wouter Koolen, Peter Bartlett
Efficient Optimization for Average Precision SVM
The accuracy of information retrieval systems is often measured using average precision (AP). Given a set of positive (relevant) and negative (non-relevant) samples, the parameters of a retrieval system can be estimated using the AP-SVM framework, which minimizes a regularized convex upper bound on the empirical AP loss. However, the high computational complexity of loss-augmented inference, which is required for learning an AP-SVM, prohibits its use on large training datasets. To alleviate this deficiency, we propose three complementary approaches. The first approach guarantees an asymptotic decrease in the computational complexity of loss-augmented inference by exploiting the problem structure. The second approach takes advantage of the fact that we do not require a full ranking during loss-augmented inference. This helps us to avoid the expensive step of sorting the negative samples according to their individual scores. The third approach approximates the AP loss over all samples by the AP loss over difficult samples (for example, those that are incorrectly classified by a binary SVM), while ensuring the correct classification of the remaining samples. Using the PASCAL VOC action classification dataset, we show that our approaches provide significant speed-ups during training without degrading the test accuracy of AP-SVM.

Pritish Mohapatra, C.V. Jawahar, M. Pawan Kumar
Efficient Partial Monitoring with Prior Information
Partial monitoring is a general model for online learning with limited feedback: a learner chooses actions in a sequential manner while an opponent chooses outcomes. In every round, the learner suffers some loss and receives some feedback based on the action and the outcome. The goal of the learner is to minimize her cumulative loss. Applications range from dynamic pricing to label-efficient prediction to dueling bandits. In this paper, we assume that we are given some prior information about the distribution based on which the opponent generates the outcomes. We propose BPM, a family of new efficient algorithms whose core is to track the outcome distribution with an ellipsoid centered around the estimated distribution. We show that our algorithm provably enjoys near-optimal regret rate for locally observable partial-monitoring problems against stochastic opponents. As demonstrated with experiments on synthetic as well as real-world data, the algorithm outperforms previous approaches, even for very uninformed priors, with an order of magnitude smaller regret and lower running time.

Hastagiri Vanchinathan, Gabor Bartok, Andreas Krause
Efficient Sampling for Learning Sparse Additive Models in High Dimensions
We consider the problem of learning sparse additive models, i.e., functions of the form: $f(\vecx) = \sum_{l \in S} \phi_{l}(x_l)$, $\vecx \in \matR^d$ from point queries of $f$. Here $S$ is an unknown subset of coordinate variables with $\abs{S} = k \ll d$. Assuming $\phi_l$'s to be smooth, we propose a set of points at which to sample $f$ and an efficient randomized algorithm that recovers a \textit{uniform approximation} to each unknown $\phi_l$. We provide a rigorous theoretical analysis of our scheme along with sample complexity bounds. Our algorithm utilizes recent results from compressive sensing theory along with a novel convex quadratic program for recovering robust uniform approximations to univariate functions, from point queries corrupted with arbitrary bounded noise. Lastly we theoretically analyze the impact of noise -- either arbitrary but bounded, or stochastic -- on the performance of our algorithm.

Hemant Tyagi, Bernd Gärtner, Andreas Krause
Efficient Structured Matrix Rank Minimization
We study the problem of finding structured low-rank matrices using nuclear norm regularization where the structure is encoded by a linear map. In contrast to most known approaches for linearly structured rank minimization, we do not (a) use the full SVD; nor (b) resort to augmented Lagrangian techniques; nor (c) solve linear systems per iteration. Instead, we formulate the problem differently so that it is amenable to a generalized conditional gradient method, which results in a practical improvement with low per iteration computational cost. Numerical results show that our approach significantly outperforms state-of-the-art competitors in terms of running time, while effectively recovering low rank solutions in stochastic system realization and 2-D compressed sensing problems.

Adams Wei Yu, Wanli Ma, Yaoliang Yu, Jaime Carbonell, Suvrit Sra
Efficient learning by implicit exploration in bandit problems with side observations
We consider a partial observability model capturing online learning problems where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.

Tomáš Kocák, Gergely Neu, Michal Valko, Remi Munos
Efficiently Iterative Weighted Majority Voting for Crowdsourcing with Performance Guarantee
A key component of crowdsourcing is to aggregate noisy crowd inputs into final answers of high quality. Many commonly used aggregation methods such as expectation maximization (EM) based approaches are essentially weighted majority method, in which the inputs from workers are combined together such that each worker is weighted according to the estimated worker reliability. In this paper, we consider the weighted majority voting method for multi-class labeling in crowdsourcing. By studying error rate bounds of weighted majority voting, we derived a bound-optimal aggregation rule, which can be shown to be close to the Bayes optimal rule. Based on the derived aggregation rule, we propose an intuitive and easy-to-implement algorithm called iterative weighted majority voting (IWMV) and give performance guarantee on its first iteration. Compared to the traditional EM approach, IWMV is more robust to noise in estimating worker reliabilities and model misspecification. Experimental results on both simulation and real data show that IWMV performs at least on par with the state-of-the-art methods, and it has low computation cost (around one hundred times faster the state-of-the-art methods).

Hongwei Li, Bin Yu
Elementary Estimators for Graphical Models
We propose a class of closed-form estimators for sparsity-structured graphical models, expressed as exponential family distributions, under high-dimensional settings. Our approach builds on observing the precise manner in which the classical graphical model MLE ``breaks down'' under high-dimensional settings. Our estimator uses a carefully constructed, well-defined and closed-form backward map, and then performs thresholding operations to ensure the desired sparsity structure. We provide a rigorous statistical analysis that shows that surprisingly our simple class of estimators recovers the same asymptotic convergence rates as those of the $\ell_1$-regularized MLEs that are much more difficult to compute. We corroborate this statistical performance, as well as significant computational advantages via simulations of both discrete and Gaussian graphical models.

Eunho Yang, Aurelie Lozano, Pradeep Ravikumar
Encoding High Dimensional Local Features by Sparse Coding Based Fisher Vectors
Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for image classification. Most, if not all, FVC implementations employ the Gaussian mixture model (GMM) to characterize the generation process of local features. This choice is proved to be sufficient for traditional low dimensional local features, e.g., SIFT; and typically, good performance can be achieved using a mixture of only a few hundred Gaussian distributions. However, the same number of Gaussian distributions could become insufficient to model the feature space spanned by higher dimensional local features, which have become popular recently. In order to improve the modeling capacity for high dimensional features, it turns out to be inefficient and computationally impractical to simply increase the number of Gaussian distributions. In this paper, we propose a generation process in which each local feature is drawn from a Gaussian distribution whose mean vector is sampled from a subspace. With certain approximation, the resulting model can be converted to a sparse coding procedure and the learning/inference problems can be readily solved by standard sparse coding methods. By calculating the gradient vector of the proposed model, we derive a new fisher vector encoding strategy, termed Sparse Coding based Fisher Vector Coding (SCFVC). Moreover, we adopt the recently developed Deep Convolutional Neural Network (CNN) descriptor as high dimensional local features and implement image classification with the proposed SCFVC. Our experimental evaluations demonstrate that our method not only significantly outperforms traditional GMM based Fisher vector encoding but also achieves state-of-the-art performance in generic object recognition, indoor scene and fine-grained image classification problems.

Lingqiao Liu, Chunhua Shen, Lei Wang, Anton van den Hengel, Chao Wang
Estimation with Norm Regularization
Analysis of estimation error and associated structured statistical recovery based on norm regularized regression, e.g., Lasso, needs to consider four aspects: the norm, the loss function, the design matrix, and the noise vector. This paper presents generalizations of such estimation error analysis on all four aspects, compared to the existing literature. We characterize the restricted error set, establish relations between error sets for the constrained and regularized problems, and present an estimation error bound applicable to {\em any} norm. Precise characterizations of the bound are presented for a variety of noise vectors, design matrices, including sub-Gaussian, anisotropic, and dependent samples, and loss functions, including least squares and generalized linear models. Gaussian widths, as a measure of size of suitable sets, and associated tools play a key role in our generalized analysis.

Arindam Banerjee, Farideh Fazayeli, Vidyashankar Sivakumar
Exact Post Model Selection Inference for Marginal Screening
We develop a framework for post model selection inference, via marginal screening, in linear regression. At the core of this framework is a result that characterizes the exact distribution of linear functions of the response $y$, conditional on the model being selected (``condition on selection" framework). This allows us to construct valid confidence intervals and hypothesis tests for regression coefficients that account for the selection procedure. In contrast to recent work in high-dimensional statistics, our results are exact (non-asymptotic) and require no eigenvalue-like assumptions on the design matrix $X$. Furthermore, the computational cost of marginal regression, constructing confidence intervals and hypothesis testing is negligible compared to the cost of linear regression, thus making our methods particularly suitable for extremely large datasets. Although we focus on marginal screening to illustrate the applicability of the condition on selection framework, this framework is much more broadly applicable. We show how to apply the proposed framework to several other selection procedures including orthogonal matching pursuit and marginal screening+Lasso.

Jason Lee, Jonathan Taylor
Exclusive Feature Learning on Arbitrary Structures
Group lasso is widely used to enforce the structural sparsity, which achieves the sparsity at inter-group level. In this paper, we propose a new formulation called ``exclusive group lasso'', which brings out sparsity at intra-group level in the context of feature selection. The proposed exclusive group lasso is applicable on any feature structures, regardless of their overlapping or non-overlapping structures. We give analysis on the properties of exclusive group lasso, and propose an effective iteratively re-weighted algorithm to solve the corresponding optimization problem with rigorous convergence analysis. We show applications of exclusive group lasso for uncorrelated feature selection. Extensive experiments on both synthetic and real-world datasets indicate the good performance of proposed methods.

Deguang Kong, Ryohei Fujimaki, Ji Liu, Feiping Nie, Chris Ding
Expectation Backpropagation: parameter-free training of multilayer neural networks with real and discrete weights
Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a “mean-field” factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in seven binary text classification tasks. In all tasks, except one, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, in these cases, EBP-trained MNNs with binary weights perform better than MNNs with real weights - if we average the MNN output over the ensemble of weight configurations (using the inferred approximate posterior). Such a “probabilistic” binary MNN could be realized in hardware with synaptic weights implemented as stochastic switches.

Daniel Soudry, Ron Meir, Itay Hubara
Expectation-Maximization for Learning Determinantal Point Processes
A determinantal point process (DPP) is a probabilistic model of set diversity compactly parameterized by a positive semi-definite kernel matrix. To fit a DPP to a given task, we would like to learn the entries of its kernel matrix by maximizing the log likelihood of the available data. However, log likelihood is non-convex in the entries of the kernel matrix, and this learning problem is conjectured to be NP-hard. Thus, previous work has instead focused on more restricted convex learning settings: learning only a single weight for each row of the kernel matrix, or learning weights for a linear combination of DPPs with fixed kernel matrices. In this work we propose a novel algorithm for learning the full kernel matrix. By changing the kernel parameterization from matrix entries to eigenvalues and eigenvectors, and then lower-bounding the likelihood in the manner of expectation-maximization algorithms, we obtain an elegant and effective optimization procedure. We test our method on both synthetic data and a real-world product recommendation task. On the latter we achieve relative gains of up to 10.9% in test log likelihood compared to the naive approach of maximizing likelihood by projected gradient ascent on the entries of the kernel matrix.

Jennifer Gillenwater, Alex Kulesza, Emily Fox, Ben Taskar
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2×, while keeping the accuracy within 1% of the original model.

Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus
Exponential Concentration of a Density Functional Estimator
We analyse a plug-in estimator for a large class of integral functionals of one or more continuous probability densities. This class includes important families of entropy, divergence, mutual information, and their conditional versions. For densities on the d-dimensional unit cube [0,1]^d that lie in a beta-Holder smoothness class, we prove our estimator converges at the rate O(n^(1/(beta+d))). Furthermore, we prove that the estimator obeys an exponential concentration inequality about its mean, whereas most previous related results have bounded only expected error of estimators. Finally, we apply our bounds to the important case of Renyi conditional mutual information.

Shashank Singh, Barnabas Poczos
Extracting Certainty from Uncertainty: Transductive Pairwise Classification from Pairwise Similarities
In this work, we study the problem of transductive pairwise classification from pairwise similarities~\footnote{The pairwise similarities are usually derived from some side information instead of the underlying class labels.}. The goal of transductive pairwise classification from pairwise similarities is to infer the pairwise class relationships, to which we refer as pairwise labels, between all examples given a subset of class relationships for a small set of examples, to which we refer as labeled examples. We propose a very simple yet effective algorithm that consists of two simple steps: the first step is to complete the sub-matrix corresponding to the labeled examples and the second step is to reconstruct the label matrix from the completed sub-matrix and the provided similarity matrix. Our analysis exhibits that under several mild preconditions we can recover the label matrix with a small error, if the top eigen-space that corresponds to the largest eigenvalues of the similarity matrix covers well the column space of label matrix and is subject to a low coherence, and the number of observed pairwise labels is sufficiently enough. We demonstrate the effectiveness of the proposed algorithm by several experiments.

Tianbao Yang
Extracting Latent Structure From Multiple Interacting Neural Populations
Developments in neural recording technology are rapidly enabling the recording of populations of neurons in multiple brain areas simultaneously, as well as the identification of the types of neurons being recorded (e.g., excitatory vs. inhibitory). There is a growing need for statistical methods to study the interaction among multiple, labeled populations of neurons. Rather than attempting to identify direct interactions between neurons (where the number of interactions grows with the number of neurons squared), we propose to extract a smaller number of latent variables from each population and study how the latent variables interact. Specifically, we propose extensions to probabilistic canonical correlation analysis (pCCA) to capture the temporal structure of the latent variables, as well as to distinguish within-population dynamics from across-population interactions (termed Group Latent Auto-Regressive Analysis, gLARA). We then applied these methods to populations of neurons recorded simultaneously in visual areas V1 and V2, and found that gLARA provides a better description of the recordings than pCCA. This work provides a foundation for studying how multiple populations of neurons interact and how this interaction supports brain function.

Joao Semedo, Amin Zandvakili, Adam Kohn, Christian Machens, Byron Yu
Extremal Mechanisms for Local Differential Privacy
Local differential privacy has been proposed as a strong measure of privacy under data collection scenarios, where individuals are willing to share their data but concerned about revealing sensitive information. Both data providers and data collectors want to maximize the benefit of statistical inference performed on the data but at the same time need to protect the privacy of the participating individuals. We address a general problem of utility maximization under local differential privacy. Our main result is a characterization of the combinatorial structure of the optimal solution resulting in a family of extremal mechanisms we call staircase mechanisms. These data dependent mechanisms add to the few basic privatization schemes known in the literature (eg: the exponential mechanism and noise-adding mechanisms). Finally, we show that two simple staircase mechanisms (the binary mechanism and the randomized response) are optimal in the high and low privacy regimes, respectively. Our results also characterize lower bounds in differential privacy, providing some new lower bounds as well as recovering some known results. As a motivating example, we provide optimal mechanisms and show that the effective sample size reduces from $n$ to $\varepsilon^2 n$ under differential privacy in the context of hypothesis testing.

Peter Kairouz, Sewoong Oh, Pramod Viswanath
Extreme bandits
In many areas of medicine, security, and life sciences we want to allocate limited resources to different channels in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are interested in detecting the most extreme value output by the sources. Therefore, in our work we study extreme regret which measures the efficiency of an algorithm compared to the oracle policy selecting the channel with the heaviest tail. We propose the ExtremeHunter algorithm, provide its analysis, and demonstrate its efficiency on synthetic and real-world experiments.

Alexandra Carpentier, Michal Valko
Factoring Variations in Natural Images with Deep Gaussian Mixture Models
Generative models can be seen as the swiss army knives of machine learning, as many problems can be written probabilistically in function of the distribution of the data, including prediction, reconstruction, imputation and simulation. One of the most promising directions for unsupervised learning may lie in Deep Learning methods, given their success in supervised learning. However, one of the current problems with deep unsupervised learning methods, is that they often are harder to scale. As a result there are some easier, more scalable shallow methods, such as the Gaussian Mixture Model and the Student-t Mixture Model, that remain surprisingly competitive. In this paper we propose a new scalable deep generative model for images, called the Deep Gaussian Mixture Model, that is a straightforward but powerful generalization of GMMs to multiple layers. The parametrization of a Deep GMM allows it to efficiently capture products of variations in natural images. We propose a new EM-based algorithm that scales well to large datasets, and we show that both the Expectation and the Maximization steps can easily be distributed over multiple machines. In our density estimation experiments we show that deeper GMM architectures generalize better than more shallow ones, with results in the same ballpark as the state of the art.

Aaron Van den Oord, Benjamin Schrauwen
Fairness in Multi-Agent Sequential Decision-Making
We define a fairness solution criterion for multi-agent decision-making problems, where agents have local interests. This new criterion aims to maximize the worst performance of agents with consideration on the overall performance. We develop a simple linear programming approach and a more scalable game-theoretic approach for computing an optimal fairness policy. This game-theoretic approach formulates this fairness optimization as a two-player, zero-sum game and employs an iterative algorithm for finding a Nash equilibrium, corresponding to an optimal fairness policy. We scale up this approach by exploiting problem structure and value function approximation. Our experiments on resource allocation problems show that this fairness criterion provides a more favorable solution than the utilitarian criterion, and that our game-theoretic approach is significantly faster than linear programming.

Chongjie Zhang, Julie Shah
Fast Kernel Learning for Multidimensional Pattern Extrapolation
The ability to automatically discover patterns and perform extrapolation is an essential quality of intelligent systems. Kernel methods, such as Gaussian processes, have great potential for pattern extrapolation, since the kernel flexibly and interpretably controls the generalisation properties of these methods. However, automatically extrapolating large scale multidimensional patterns is in general difficult, and developing Gaussian process models for this purpose involves several challenges. A vast majority of kernels, and kernel learning methods, currently only succeed in smoothing and interpolation. This difficulty is compounded by the fact that Gaussian processes are typically only tractable for small datasets, and scaling an expressive kernel learning approach poses different challenges than scaling a standard Gaussian process model. One faces additional computational constraints, and the need to retain significant model structure for expressing the rich information available in a large dataset. In this paper, we propose a Gaussian process approach for large scale multidimensional pattern extrapolation. We recover sophisticated out of class kernels, perform texture extrapolation, inpainting, and video extrapolation, and long range forecasting of land surface temperatures, all on large multidimensional datasets, including a problem with 383,400 training points. The proposed method significantly outperforms alternative scalable and flexible Gaussian process methods, in speed and accuracy. Moreover, we show that a distinct combination of expressive kernels, a fully non-parametric representation, and scalable inference which exploits existing model structure, are critical for large scale multidimensional pattern extrapolation.

Andrew Wilson, Elad Gilboa, John Cunningham, Arye Nehorai
Fast Prediction for Large-Scale Kernel Machines
Kernel machines such as kernel SVM and kernel ridge regression usually construct high quality models; however, their use in real-world applications remains limited due to the high prediction cost. In this paper, we present two novel insights for improving the prediction efficiency of kernel machines. First, we show that by adding “pseudo landmark points” to the classical Nystrom kernel approximation in an elegant way, we can significantly reduce the prediction error without much additional prediction cost. Second, we provide a new theoretical analysis on bounding the error of the solution computed by using Nystrom kernel approximation method, and show that the error is related to the weighted kmeans objective function where the weights are given by the model computed from the original kernel. This theoretical insight suggests a new landmark point selection tech-nique for the situation where we have knowledge of the original model. Based on these two insights, we provide a divide-and-conquer framework for improving the prediction speed. First, we divide the whole problem into smaller local sub-problems to reduce the problem size. In the second phase, we develop a kernel approximation based fast prediction approach within each subproblem. We apply our algorithm to real world large-scale classification and regression datasets, and show that the proposed algorithm is consistently and significantly better than other competitors. For example, on the Covertype classification problem, our algorithm achieves more than 10000 times speedup over the full kernel SVM, and a two-fold speedup over the state-of-the-art LDKL approach, in terms of prediction time, while obtaining much higher prediction accuracy (95.2% vs. 89.53%).

Cho-Jui Hsieh, Si Si, Inderjit Dhillon
Fast Sampling-Based Inference in Balanced Neuronal Networks
Multiple lines of evidence support the notion that the brain performs probabilistic inference in multiple cognitive domains, including perception and decision making. There is also evidence that probabilistic inference may be implemented in the brain through the (quasi-)stochastic activity of neural circuits, producing samples from the appropriate posterior distributions. However, time becomes a fundamental bottleneck in such sampling-based probabilistic representations: the quality of inferences depends on how fast the neural circuit generates new, uncorrelated samples from its stationary distribution (the posterior). We explore this bottleneck in a simple, linear-Gaussian latent variable model, in which posterior sampling can be achieved by linear stochastic neural network dynamics. The well-known Langevin sampling (LS) recipe, so far the only sampling algorithm for continuous variables of which a neural implementation has been suggested, naturally fits into this dynamical framework. However, we first show analytically and through simulations that the symmetry of the synaptic weight matrix implied by LS yields critically slow mixing when the posterior is high-dimensional. Next, using methods from control theory, we construct and inspect networks that are optimally fast, and hence orders of magnitude faster than LS, non-symmetric, and in which neurons may even split into separate classes of excitatory and inhibitory cells (Dale's law). In these networks, strong – but transient – selective amplification of external noise sources generates the spatially correlated activity fluctuations prescribed by the posterior. Intriguingly, although a detailed balance of excitation and inhibition is dynamically maintained, detailed balance of Markov chain steps in the resulting sampler is violated, consistent with recent findings on how irreversibility in other domains can overcome the speed limitation of random walks.

Guillaume Hennequin, Laurence Aitchison, Mate Lengyel
Fast Training of Pose Detectors in the Fourier Domain
In many datasets, the samples are related by a known image transformation, such as rotation, or a repeatable non-rigid deformation. This applies to both datasets with the same objects under different viewpoints, and datasets augmented with virtual samples. Such structured datasets possess a high degree of redundancy, since geometrically-induced transformations should preserve intrinsic properties of the objects. Likewise, ensembles of classifiers related by a geometric transformation, which are useful for pose estimation, should also share many characteristics. We show that by assuming that this transformation is norm-preserving and cyclic, many redundancies can be eliminated in closed-form by training in the Fourier domain. With the same technique, we can also learn several pose classifiers related by a transformation simultaneously at no extra cost. Our experiments show that training a sliding-window object detector and pose estimator can be sped up by orders of magnitude, for transformations as diverse as planar rotation, the walking motion of pedestrians, and out-of-plane rotations of cars.

João F. Henriques, Pedro Martins, Rui Caseiro, Jorge Batista
Feature Cross-Substitution in Adversarial Classification
The success of machine learning, particularly in supervised settings, has led to numerous attempts to apply it in adversarial settings such as spam and malware detection. The core challenge in this class of applications is that adversaries are not static data generators, but make a deliberate effort to evade the classifiers deployed to detect them. We investigate both the problem of modeling the objectives of such adversaries, as well as the algorithmic problem of accounting for rational, objective-driven adversaries. In particular, we demonstrate severe shortcomings of feature reduction in adversarial settings using several natural adversarial objective functions, an observation that is particularly pronounced when the adversary is able to substitute across similar features (for example, replace words with synonyms or replace letters in words). We offer a simple heuristic method for making learning more robust to feature cross-substitution attacks. We then present a more general approach based on mixed-integer linear programming with constraint generation, which implicitly trades off overfitting and feature selection in an adversarial setting using a sparse regularizer along with an evasion model. Our approach is the first method for combining an adversarial classification algorithm with a very general class of models of adversarial classifier evasion. We show that our algorithmic approach significantly outperforms state-of-the-art alternatives.

Bo Li, yevgeniy Vorobeychik
Feedback Detection for Live Predictors
A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze statistical feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.

Stefan Wager, Nick Chamandy, Omkar Muralidharan, Amir Najmi
Finding a sparse vector in a subspace: Linear sparsity using alternating directions
We consider the problem of recovering the sparsest vector in a subspace $ \mathcal{S} \in \R^p $ with $ \text{dim}\paren{\mathcal{S}}=n$. This problem can be considered a homogeneous variant of the sparse recovery problem, and finds applications in sparse dictionary learning, sparse PCA, and other problems in signal processing and machine learning. Simple convex heuristics for this problem provably break down when the fraction of nonzero entries in the target sparse vector substantially exceeds $1/ \sqrt{n}$. In contrast, we exhibit a relatively simple nonconvex approach based on alternating directions, which provably succeeds even when the fraction of nonzero entries is $\Omega(1)$. To our knowledge, this is the first practical algorithm to achieve this linear scaling. This result assumes a planted sparse model, in which the target sparse vector is embedded in an otherwise random subspace. Empirically, our proposed algorithm also succeeds in more challenging data models arising, e.g., from sparse dictionary learning.

Qing Qu, Ju Sun, John Wright
Flexible Transfer Learning under Support and Model Shift
Transfer learning algorithms are used when one has sufficient training data for one supervised learning task (the source/training domain) but only very limited training data for a second task (the target/test domain) that is similar but not identical to the first. Previous work on transfer learning has focused on relatively restricted settings, where specific parts of the model are considered to be carried over between tasks. Recent work on covariate shift focuses on matching the marginal distributions on observations $X$ across domains. Similarly, work on target/conditional shift focuses on matching marginal distributions on labels $Y$ and adjusting conditional distributions $P(X|Y)$, such that $P(X)$ can be matched across domains. However, covariate shift assumes that the support of test $P(X)$ is contained in the support of training $P(X)$, i.e., the training set is richer than the test set. Target/conditional shift makes a similar assumption for $P(Y)$. Moreover, not much work on transfer learning has considered the case when a few labels in the test domain are available. Also little work has been done when all marginal and conditional distributions are allowed to change while the changes are smooth. In this paper, we consider a general case where both the support and the model change across domains. We transform both $X$ and $Y$ by a location-scale shift to achieve transfer between tasks. Since we allow more flexible transformations, the proposed method yields better results on both synthetic data and real-world data.

Xuezhi Wang, Jeff Schneider
From Large-Scale Object Classifiers to Large-Scale Object Detectors: An Adaptation Approach
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNN) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose a Deep Detection Adaptation (DDA) algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach.

Judy Hoffman, Sergio Guadarrama, Eric Tzeng, Jeff Donahue, Trevor Darrell, Kate Saenko, Ross Girshick
From MAP to Marginals: Variational Inference in Bayesian Submodular Models
Submodular optimization has found many applications in machine learning and beyond. We carry out the first systematic investigation of inference in probabilistic models defined through submodular functions, generalizing regular pairwise MRFs and Determinantal Point Processes. In particular, we present L-Field, a variational approach to general log-submodular and log-supermodular distributions based on sub- and supergradients. We obtain both lower and upper bounds on the log-partition function, which enables us to compute probability intervals for marginals, conditionals and marginal likelihoods. We also obtain fully factorized approximate posteriors, at the same computational cost as ordinary submodular optimization. Our framework results in convex problems for optimizing over differentials of submodular functions, which we show how to optimally solve. We provide theoretical guarantees of the approximation quality with respect to the curvature of the function. We further establish natural relations between our variational approach and the classical mean-field method. Lastly, we empirically demonstrate the accuracy of our inference scheme on several submodular models.

Josip Djolonga, Andreas Krause
Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation
Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where \emph{any} algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this paper, we describe how a single set of results implies positive answers to the above, for several different settings.

Ohad Shamir
Gaussian Process Volatility Model
The prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the evolution of the variance. Moreover, functional parameters are usually learned by maximum likelihood, which can lead to overfitting. To address these problems we introduce GP-Vol, a novel non-parametric model for time-changing variances based on Gaussian Processes. This new model can capture highly flexible functional relationships for the variances. Furthermore, we introduce a new online algorithm for fast inference in GP-Vol. This method is much faster than current offline inference procedures and it avoids overfitting problems by following a fully Bayesian approach. Experiments with financial data show that GP-Vol performs significantly better than current standard alternatives.

Yue Wu, José Miguel Hernández-Lobato, Zoubin Ghahramani
General Stochastic Networks for Classification
We extend general stochastic networks (GSNs) to supervised learning of representations. In particular, we introduce a hybrid training objective considering a generative and discriminative cost function governed by a trade-off parameter λ. We use a new variant of network training involving noise injection, i.e. walkback training, to jointly optimize multiple network layers. Neither additional regularization constraints, such as l1, l2 norms or dropout variants, nor pooling- or convolutional layers were added. Nevertheless, we are able to obtain state-of-the-art performance on the MNIST dataset, without using permutation invariant digits and outperform baseline models on sub-variants of the MNIST and rectangles dataset significantly.

Matthias Zöhrer, Franz Pernkopf
General Table Completion using a Bayesian Nonparametric Model
Even though heterogeneous databases can be found in a broad variety of applications, there exists a lack of tools for estimating missing data in such databases. In this paper, we provide an efficient and robust table completion tool, based on a Bayesian nonparametric latent feature model. In particular, we propose a general observation model for the Indian buffet process (IBP) adapted to mixed continuous (real-valued and positive real-valued) and discrete (categorical, ordinal and count) observations. Then, we propose an inference algorithm that scales linearly with the number of observations. Finally, our experiments over five real databases show that the proposed approach provides more robust and accurate estimates than the standard IBP and the Bayesian probabilistic matrix factorization with Gaussian observations.

Isabel Valera, Zoubin Ghahramani
Generalized Dantzig Selector: Application to the k-support norm
We propose a Generalized Dantzig Selector (GDS) for linear models, in which any norm encoding the parameter structure can be leveraged for estimation. We investigate both computational and statistical aspects of the GDS. Based on conjugate proximal operator, a flexible inexact ADMM framework is designed for solving GDS, and non-asymptotic high-probability bounds are established on the estimation error, which relies on Gaussian width of unit norm ball and suitable set encompassing estimation error. Further, we consider a non-trivial example of the GDS using $k$-support norm. We derive an efficient method to compute the proximal operator for $k$-support norm since existing methods are inapplicable in this setting. For statistical analysis, we provide upper bounds for the Gaussian widths needed in the GDS analysis, yielding the first statistical recovery guarantee for estimation with the $k$-support norm. The experimental results confirm our theoretical analysis.

Soumyadeep Chatterjee, Sheng Chen, Arindam Banerjee
Generalized Higher-Order Orthogonal Iteration for Tensor Decomposition and Completion
Low-rank tensor estimation has been frequently applied in many real-world problems. Despite successful applications, existing Schatten 1-norm minimization (SNM) methods may become very slow or even not applicable for large-scale problems. To address this difficulty, we therefore propose an efficient and scalable tensor Schatten 1-norm minimization method for simultaneous tensor decomposition and completion, with a much lower computational complexity. We first induce the equivalence relation of Schatten 1-norm of a low-rank tensor and its core tensor. Then the Schatten 1-norm of the core tensor is used to replace that of the whole tensor, which leads to a much smaller-scale matrix SNM problem. Finally, an efficient algorithm with a rank-increasing scheme is developed to solve the proposed problem with a convergence guarantee. Extensive experimental results show that our method is usually more accurate than the state-of-the-art methods, and is orders of magnitude faster.

Yuanyuan Liu, Fanhua Shang, Wei Fan, James Cheng, Hong Cheng
Generalized Unsupervised Manifold Alignment
In this paper, we propose a Generalized Unsupervised Manifold Alignment (GUMA) method to build the connections between different but correlated datasets without any known correspondences. Based on the assumption that datasets of the same theme usually have similar manifold structures, GUMA is formulated into an explicit integer optimization problem considering the structure matching and preserving criteria, as well as the features comparability of the corresponding points in the mutual embedding space. The main benefits of this model include: (1) fully unsupervised matching without any pre-specified correspondences; (2) simultaneous discovery and alignment of manifold structures; (3) efficient iterative projection algorithm for finding the counterparts without computations in all permutation cases. Experimental results on dataset matching and real-world applications demonstrate the effectiveness and the practicability of our manifold alignment method.

Zhen Cui, Hong Chang, Shiguang Shan, Xilin Chen
Generative Adversarial Nets
We propose a new framework for estimating generative models via adversarial nets, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitatively evaluation of the generated samples.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
Gibbs-type Indian Buffet Processes
We investigate a class of feature allocation models that generalize the Indian buffet process and are parameterized by Gibbs-type random measures. Two existing classes are contained as special cases: the original two-parameter Indian buffet process (corresponding to the Dirichlet process) and the stable-beta (aka three-parameter) Indian buffet process (corresponding to the Pitman--Yor process). Asymptotic behavior of the Gibbs-type partitions, such as power-laws holding for the number of latent clusters, translates into power-law behavior for this class of Gibbs-type feature allocation models. We derive a number of properties of these models, discuss inference in this expressive modeling class, and elucidate the subtle differences between this super-class and its better known subclasses.

Creighton Heaukulani, Daniel Roy
Global Sensitivity Analysis for MAP Inference in Graphical Models
We study the sensitivity of a MAP configuration of a discrete probabilistic graphical model with respect to perturbations of its parameters. These perturbations are global, in the sense that simultaneous perturbations of all the parameters (or any chosen subset of them) are allowed. Our main contribution is an exact algorithm that can check whether the MAP configuration is robust with respect to given perturbations. Its complexity is essentially the same as that of obtaining the MAP configuration itself, so it can be promptly used with minimal effort. We use our algorithm to identify the largest global perturbation that does not induce a change in the MAP configuration, and we successfully apply this robustness measure in two practical scenarios: the prediction of facial action units with posed images and the classification of multiple real public data sets. A strong correlation between the proposed robustness measure and accuracy is verified in both scenarios.

Jasper De Bock, Cassio de Campos, Alessandro Antonucci
Graph Clustering With Missing Data: Convex Algorithms and Analysis
We consider the problem of finding clusters in an unweighted graph, when the graph is partially observed. We analyze two programs, one which works for dense graphs and one which works for both sparse and dense graphs, but requires some a priori knowledge of the total cluster size, that are based on the convex optimization approach for low-rank matrix recovery using nuclear norm minimization. For the commonly used Stochastic Block Model, we obtain \emph{explicit} bounds on the parameters of the problem (size and sparsity of clusters, the amount of observed data) and the regularization parameter characterize the success and failure of the programs. We corroborate our theoretical findings through extensive simulations. We also run our algorithm on a real data set obtained from crowd sourcing an image classification task on the Amazon Mechanical Turk, and observe significant performance improvement over traditional methods such as k-means.

Ramya Korlakai Vinayak, Samet Oymak, Babak Hassibi
Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data
We address the problem of deciding whether a causal or probabilistic query is estimable from data corrupted by missing entries, given a model of missingness process. We extend the results of Mohan et al, 2013 by presenting more general conditions for recovering probabilistic queries of the form P(y|x) and P(y,x) as well as causal queries of the form P(y|do(x)). We show that causal queries may be recoverable even when the factors in their identifying estimands are not recoverable. Specifically, we derive graphical conditions for recovering causal effects of the form P(y|do(x)) when Y and its missingness mechanism are not d-separable. Finally, we apply our results to problems of attrition and characterize the recovery of causal effects from data corrupted by attrition.

Karthika Mohan, Judea Pearl
Greedy Subspace Clustering
We consider the problem of subspace clustering: given points that lie on or near the {\em union} of many low-dimensional linear subspaces, recover the subspaces. To this end, one first identifies sets of points close to the same subspace and uses the sets to estimate the subspaces. We provide new simple and efficient algorithms for this problem. Our theoretical results show that our algorithms are guaranteed exact (perfect) clustering performance, under certain conditions that are weaker than those considered in the standard statistical literature. Simulation results show that our algorithms outperform existing algorithms on both synthetic and real data.

Dohyung Park, Constantine Caramanis, Sujay Sanghavi
Grouping-Based Low-Rank Video Completion and 3D Reconstruction
Extracting the 3D shape of deforming objects in monocular videos, a task known as non-rigid structure-from-motion (NRSfM), has so far been studied only on synthetic datasets and controlled environments, in which objects are pre-segmented. Typically, full-length temporal correspondences are assumed, or the objects exhibit limited rotations and occlusions. In order to integrate NRSfM into current video analysis pipelines, one needs to consider as input realistic -thus incomplete- tracking, and perform spatio-temporal grouping to segment the objects from their surroundings. Furthermore, NRSfM needs to be robust to noise in both segmentation and tracking, e.g., drifting, segmentation ``leaking'', optical flow ``bleeding'' etc. In this paper, we make a first attempt towards this goal, and propose a method that combines dense flow tracking, motion trajectory clustering and NRSfM for 3D reconstruction of objects in videos. For each trajectory cluster, we compute multiple reconstruction hypotheses by minimizing the reprojection error and the rank of the 3D shape under different rank bounds for the trajectory matrix. We show that dense trajectories are completed across occlusions and low textured regions, and 3D shape is extracted even under mild relative motion between the object and the camera, in contrast to approaches based on sparse corner trajectories. Camera rotations are recovered by considering only the rank 3 component of the trajectory matrix, as opposed to $3K$ in existing work, bypassing the problem of non-rigid Euclidean upgrade. We achieve competitive results on a public NRSfM benchmark while using fixed parameters across all the sequences and handling incomplete trajectories, in contrast to existing algorithms. We further test our method on popular video segmentation datasets. To the best of our knowledge, our method is the first to extract dense object models from realistic video datasets, such as Hollywood movies or Youtube, without object-specific priors.

Aikaterini Fragkiadaki, Marta Salas, Pablo Arbelaez, Jitendra Malik
Hardness of parameter estimation in graphical models
We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).

Guy Bresler, David Gamarnik, Devavrat Shah
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.

Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Cho KyungHyun, Surya Ganguli, Yoshua Bengio
Improved Distributed Principal Component Analysis
We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as $k$-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as input-sparsity subspace embeddings with high correctness probability with a dimension and sparsity independent of the error probability, may be of independent interest.

Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, David Woodruff
Improved Multimodal Deep Learning with Variation of Information
In multimodal representation learning, it is important to capture high-level associations between multiple data modalities with a compact set of latent variables. Deep learning has been successfully applied to this problem, with a common strategy to learning joint representations that are shared between multiple modalities at the higher layer after learning several layers of modality-specific features in the lower layers. Nonetheless, there still remains an important question how to learn a good association between multiple data modalities, in particular, to reason or predict the missing data modalities effectively in the testing time. In this paper, we propose a novel multimodal representation learning objective that explicitly aims this goal. Specifically, instead of maximum likelihood learning, we train the networks to minimize the variation of information, an information theoretic measure that computes the information distance between data modalities. We describe our method based on restricted Boltzmann machines and propose learning algorithms based on contrastive divergence and multi-prediction training. Furthermore, we propose an extension to deep networks, which we refer to as multimodal deep recurrent neural networks. In experiments, we demonstrate the state-of-the-art visual-textual and visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database.

Kihyuk Sohn, Honglak Lee
Incremental Clustering: The Case for Extra Clusters
The explosion in the amount of data available for analysis often necessitates a transition from batch to incremental clustering methods, which process one element at a time and typically store only a small subset of the data. In this paper, we initiate the formal analysis of incremental clustering methods focusing on the types of cluster structure that they are able to detect. We find that the incremental setting is strictly weaker than the batch model, proving that a fundamental class of cluster structures that can readily be detected in the batch setting is impossible to identify using any incremental method. Furthermore, we show how the limitations of incremental clustering can be overcome by allowing additional clusters.

Margareta Ackerman, Sanjoy Dasgupta
Incremental Local Gaussian Regression
Locally weighted regression (LWR) was created as a nonparametric method that can approximate a wide range of functions, is computationally efficient, and can learn continually from very large amounts of incrementally collected data. As an interesting feature, LWR can regress on non-stationary functions, a beneficial property, for instance, in control problems. However, it does not provide a proper generative model for function values, and existing algorithms have a variety of manual tuning parameters that strongly influence bias, variance and learning speed of the results. Gaussian (process) regression, on the other hand, does provide a generative model with rather black-box automatic parameter tuning, but it has higher computational cost, especially for big data sets and if a non-stationary model is required. In this paper, we suggest a path from Gaussian (process) regression to locally weighted regression, where we retain the best of both approaches. Using a localizing function basis and approximate inference techniques, we build a Gaussian (process) regression algorithm of increasingly local nature and similar computational complexity to LWR. Empirical evaluations are performed on several synthetic and real robot datasets of increasing complexity and (big) data scale, and demonstrate that we consistently achieve on par or superior performance compared to current state-of-the-art methods while retaining a principled approach to fast incremental regression with minimal manual tuning parameters.

Franziska Meier, Philipp Hennig, Stefan Schaal
Inferring sparse representations of continuous signals with continuous orthogonal matching pursuit
Many signals, such as spikes recorded simultaneously from multiple neurons, may be represented as the sparse sum of translated and scaled copies of some known waveforms whose timing and amplitudes are of interest. From the aggregate signal one may seek to sort out the identities, amplitudes, and translations of the waveforms of which the signal is composed. We present a fast method for recovering these identities, amplitudes, and translations. The method proceeds first by greedily identifying rough estimates of the component waveforms and then refining the estimates, moving iteratively between these steps in a process analogous to the well known Orthogonal Matching Pursuit algorithm. We also draw on Continuous Basis Pursuit (CBP), which we extend in several ways: by selecting a subspace that optimally captures translated copies of the waveforms, replacing the convex optimization problem with a greedy framework, and moving to the Fourier domain to more precisely estimate time shifts. We test the resulting method, which we call Continuous Orthogonal Matching Pursuit on simulated data, where it show gains over CBP in both speed and accuracy.

Karin Knudson, Jacob Yates, Alexander Huk, Jonathan Pillow
Inferring synaptic conductances from spike trains with a biophysically inspired point process model
A popular approach to neural characterization describes neural responses in terms of a cascade of linear and nonlinear stages: a linear filter to describe stimulus integration, followed by a nonlinear function to convert the filter output to spike rate. However, real neurons respond to stimuli in a manner that depends on the nonlinear integration of excitatory and inhibitory synaptic inputs. Here we introduce a biophysically inspired point process model that explicitly incorporates stimulus-induced changes in synaptic conductance in a dynamical model of neuronal membrane potential. Our work makes two important contributions. First, on a theoretical level, it offers a novel interpretation of the popular generalized linear model (GLM) for neural spike trains. We show that the classic GLM is a special case of our conductance-based model in which the stimulus linearly modulates excitatory and inhibitory conductances in an equal and opposite “push-pull” fashion. Our model can therefore be viewed as a direct extension of the GLM in which we relax these constraints; the resulting model can exhibit shunting as well as hyperpolarizing inhibition, and time-varying changes in both gain and membrane time constant. Second, on a practical level, we show that our model provides a tractable model of spike responses in early sensory neurons that is both more accurate and more interpretable than the GLM. Most importantly, we show that we can accurately infer intracellular synaptic conductances from extracellularly recorded spike trains. We validate these estimates using direct intracellular measurements of excitatory and inhibitory conductances in parasol retinal ganglion cells. We show that the model fit to extracellular spike trains can predict excitatory and inhibitory conductances elicited by novel stimuli with nearly the same accuracy as a model trained directly with intracellular conductances.

Kenneth Latimer, E. J. Chichilnisky, Fred Rieke, Jonathan Pillow
Information-based learning by agents in unbounded state spaces
The idea that animals might use information-driven planning to explore an unknown environment and build an internal model of it has been proposed for quite some time. Recent work has demonstrated that agents using this principle can efficiently learn models of probabilistic environments with discrete, bounded state spaces. However, animals and robots are commonly confronted with unbounded environments. To address this more challenging situation, we study information-based learning strategies of agents in unbounded state spaces using non-parametric Bayesian models. Specifically, we demonstrate that the Chinese Restaurant Process model is able to solve this problem and that an Empirical Bayes version is able to efficiently explore bounded and unbounded worlds by relying on little prior information.

Shariq Mobin, James Arnemann, Fritz Sommer
Iterative Neural Autoregressive Distribution Estimator NADE-k
Training of the neural autoregressive density estimator (NADE) can be viewed as doing one step of probabilistic inference on missing values in data. We propose a new model that extends this inference scheme to multiple steps, arguing that it is easier to learn to improve a reconstruction in $k$ steps rather than to learn to reconstruct in a single inference step. The proposed model is an unsupervised building block for deep learning that combines the desirable properties of NADE and multi-predictive training: (1) Its test likelihood can be computed analytically, (2) it is easy to generate independent samples from it, and (3) it uses an inference engine that is a superset of variational inference for Boltzmann machines. The proposed NADE-k has state-of-the-art performance in density estimation on the two datasets tested.

Tapani Raiko, Yao Li, Cho KyungHyun, Yoshua Bengio
Joint Task Learning via Deep Neural Networks with Application to Generic Object Extraction
This paper investigates how to extract objects-of-interest without relying on hand-craft features and sliding windows approaches, that aims to jointly solve two sub-tasks: (i) rapidly localizing salient objects from images, and (ii) accurately segmenting the objects based on the localizations. We present a general joint task learning framework, in which each task (either object localization or object segmentation) is tackled via a multi-layer convolutional neural network, and the two networks work collaboratively to boost performance. In particular, we propose to incorporate latent variables bridging the two networks in a joint optimization manner. The first network directly predicts the positions and scales of salient objects from raw images, and the latent variables adjust the object localizations to feed the second network that produces pixelwise object masks. An EM-type method is then studied for the joint optimization, iterating with two steps: (i) by using the two networks, it estimates the latent variables via employing an MCMC-based sampling method; (ii) it optimizes the parameters of the two networks unitedly by back propagation, with the fixed latent variables. Extensive experiments demonstrate that our joint learning framework significantly outperforms state-of-the-art methods in both accuracy and efficiency (e.g., 1000 times faster than competing approaches).

Xiaolong Wang, Liliang Zhang, Zhujin Liang, Liang Lin
Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.

Jonathan Tompson, Arjun Jain, Yann LeCun, Christoph Bregler
Just-In-Time Learning for Fast and Flexible Inference
Much of research in machine learning has centered around the search for inference algorithms that are both general-purpose and efficient. The problem is extremely challenging and general inference remains computationally expensive. We seek to address this problem by observing that in most specific applications of a model, we typically only need to perform a small subset of all possible inference computations. Motivated by this, we introduce just-in-time learning, a framework for fast and flexible inference that learns to speed up inference at run-time. Through a series of experiments, we show how this framework can allow us to combine the flexibility of sampling with the efficiency of deterministic message-passing.

S. M. Ali Eslami, Daniel Tarlow, Pushmeet Kohli, John Winn
Kernel Mean Estimation via Spectral Filtering
This paper leverages the spectral filtering to construct shrinkage estimators of a mean element–known as a kernel mean–in a reproducing kernel Hilbert space (RKHS). We show that in theory there exists a wide class of shrinkage strategies that improve upon the standard empirical estimator. We adopt the spectral filtering approach as one such strategy. The proposed estimators allow us to incorporate meaningful information about the RKHS when estimating the kernel mean in a flexible and efficient manners. Moreover, based on the RKHS-valued regression perspective of the proposed estimators, our theoretical analysis also reveals a fundamental connection to the supervised learning framework. Our estimators are simple to implement and can outperform existing ones in term of both the quality and the computational complexity.

Krikamol Muandet, Bharath Sriperumbudur, Bernhard Schoelkopf
Large Scale Canonical Correlation Analysis with Iterative Least Squares
Canonical Correlation Analysis (CCA) is a widely used statistical tool with both well established theory and favorable performance for a wide range of machine learning problems. However, computing CCA for huge datasets can be very slow since it involves implementing QR decomposition or singular value decomposition of huge matrices. In this paper we introduce L-CCA, an iterative algorithm which can compute CCA fast on huge sparse datasets. Theory on both the asymptotic convergence and finite time accuracy of L-CCA are established. The experiments also show that L-CCA outperform other fast CCA approximation schemes on two real datasets.

Yichao Lu, Dean Foster
Large-Margin Convex Polytope Machine
We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-of-the-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting.

Alex Kantchelian, Michael Tschantz, Ling Huang, Peter Bartlett, Anthony Joseph, J. D. Tygar
Latent Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification
We present the Latent Case Model (LCM), a general framework for Bayesian case-based reasoning (CBR) and prototype classification and clustering. LCM brings the intuitive power of CBR to a Bayesian generative framework. The LCM learns prototypes, the ``quintessential" observations that best represent clusters in a dataset, by performing joint inference on cluster labels, prototypes and important features. Simultaneously, LCM pursues sparsity by learning subspaces, the sets of features that play important roles in the characterization of the prototypes. The prototype and subspace representation provides quantitative benefits in interpretability while preserving classification accuracy. Human subject experiments verify statistically significant improvements to participants' understanding when using explanations produced by LCM, compared to those given by prior art.

Been Kim, Cynthia Rudin, Julie Shah
Latent Support Measure Machines for Bag-of-Words Data Classification
In many classification problems, the input is represented as a set of features, e.g. bag-of-words (BoW) representation of documents. Support vector machines (SVMs) are widely used tools for such classification problems. The performance of the SVMs is generally determined by whether kernel values between data points can be defined properly. However, SVMs for BoW representation have a major weakness that the co-occurrence of different but semantically similar words cannot be reflected in the kernel calculation. To overcome the weakness, we propose a kernel-based discriminative classifier for BoW data, which we call the latent support measure machine (latent SMM). With the latent SMM, a latent vector is associated with each vocabulary term, and each document is represented as a distribution of the latent vectors for words appearing in the document. To represent the distributions efficiently, we use the framework of kernel embeddings that holds high order moment information of distributions. Then the latent SMM finds a separating hyperplane that maximizes margins between distributions of different classes while estimating latent vectors for words so as to improve the classification performance. In the experiments, we shows that the latent SMM achieves state-of-the-art accuracy on BoW text classification and is robust for its own hyper-parameters.

Yuya Yoshikawa, Tomoharu Iwata, Hiroshi Sawada
Learning Chordal Markov Networks by Dynamic Programming
We present an algorithm for finding a chordal Markov network that maximizes any given decomposable scoring function. The algorithm is based on a recursive characterization of clique trees, and it runs in $O(4^n)$ time for $n$ vertices. On an eight-vertex benchmark instance, our implementation turns out to be about ten million times faster than a recently proposed, constraint satisfaction based algorithm (Corander et al., NIPS 2013). Within a few hours, it is able to solve instances up to $18$ vertices, and beyond if we restrict the maximum clique size. We also study the performance of a recent integer linear programming algorithm (Bartlett and Cussens, UAI~2013). Our results suggest that, unless we bound the clique sizes, currently only the dynamic programming algorithm is guaranteed to solve instances with around $15$ or more vertices.

Kustaa Kangas, Mikko Koivisto, Teppo Niinimäki
Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm
In many situations we have some measurement of confidence on "positiveness" in a binary label. The "positiveness" is a continues value whose range is a bounded interval. It quantified the affiliation of each training data to the completely positive samples. We propose a novel learning algorithm called expectation loss SVM (e-SVM) that can well deal with the situation where only the "positiveness" instead of a binary label of each training sample is available. Our e-SVM algorithm can also be easily extended to learn segment classifiers under weak supervision where the exact value of the positiveness for each sample is unobserved. In the experiments, we show that the e-SVM algorithm can well handled the segment proposal classification task under both strong supervision (e.g. the pixel level annotations are available) and the weak supervision (e.g. only bounding box annotations are available), and outperforms the alternative approaches. We further validate this method on two major tasks of computer vision: semantic segmentation and object detection. Our method achieves the state-of-the-art performance in both tasks.

Jun Zhu, Junhua Mao, Alan Yuille
Learning Mixed Multinomial Logit Model from Ordinal Data
Motivated by generating personalized recommendations using ordinal (or preference) data, we study the question of learning a mixture of MultiNomial Logit (MNL) model, a parameterized class of distributions over permutations, from partial ordinal or preference data (e.g. pair-wise comparisons). Despite its long standing importance across disciplines including social choice, operations research and revenue management, little is known about this question. In case of single MNL models (no mixture), computationally and statistically tractable learning from pair-wise comparisons is feasible. However, even learning mixture of two MNL model is infeasible in general. Given this state of affairs, we seek conditions under which it is feasible to learn the mixture model in both computationally and statistically efficient manner. To that end, we present a sufficient condition as well as an efficient algorithm for learning mixed MNL models from partial preferences/comparisons data. In particular, a mixture of $r$ MNL components over $n$ objects can be learnt using samples whose size scales polynomially in $n$ and $r$ (concretely, $n^3 r^{3.5} \log^4 n$, with $r \ll n^{2/7}$ when the model parameters are sufficiently {\em incoherent}). The algorithm has two phases: first, learn the pair-wise marginals for each component using tensor decomposition; second, learn the model parameters for each component using RankCentrality introduced by Negahban et al. In the process of proving these results, we obtain a generalization of existing analysis for tensor decomposition to a more realistic regime where only partial information about each sample is available.

Sewoong Oh, Devavrat Shah
Learning Mixtures of Submodular Functions for Image Collection Summarization
We address the problem of image collection summarization by learning mixtures of submodular functions. We argue that submodularity is very natural to this problem, and we show that a number of previously used scoring functions are submodular — a property not explicitly mentioned in these publications. We provide classes of submodular functions capturing the necessary properties of summaries, namely coverage, likelihood, and diversity. To learn mixtures of these submodular functions as scoring functions, we formulate summarization as a supervised learning problem using large-margin structured prediction. Furthermore, we introduce a novel evaluation metric, which we call V-ROUGE, for automatic summary scoring. While a similar metric called ROUGE has been successfully applied to document summarization [14], no such metric was known for quantifying the quality of image collection summaries. We provide a new dataset consisting of 14 real-world image collections along with many human-generated ground truth summaries collected using mechanical turk. We also extensively compare our method with previously explored methods for this problem and show that our learning approach outperforms all competitors on this new dataset. This paper provides, to our knowledge, the first systematic approach for quantifying the problem of image collection summarization, along with a new dataset of image collections and human summaries.

Sebastian Tschiatschek, Rishabh Iyer, Haochen Wei, Jeffrey Bilmes
Learning Multiple Tasks in Parallel with a Shared Annotator
We introduce a new multi-task framework, in which $K$ online learners are sharing a single annotator with limited bandwidth. On each round, each of the $K$ learners receives an input, and makes a prediction about the label of that input. Then, a shared (stochastic) mechanism decides which of the $K$ inputs will be annotated. The learner that receives the feedback (label) may update its prediction rule, and we proceed to the next round. We develop an online algorithm for multi-task binary classification that learns in this setting, and bound its performance in the worst-case setting. Additionally, we show that our algorithm can be used to solve two bandits problems: contextual bandits, and dueling bandits with context, both allowed to decouple exploration and exploitation. Empirical study with OCR data, vowel prediction (VJ project) and document classification, shows that our algorithm outperforms other algorithms, one of which uses uniform allocation, and essentially makes more (accuracy) for the same labour of the annotator.

Haim Cohen, Koby Crammer
Learning Shuffle Ideals Under Restricted Distributions
The class of shuffle ideals is a fundamental sub-family of regular languages. The shuffle ideal generated by a restricted string set $U$ is the collection of all strings containing some $u \in U$ as a (not necessarily contiguous) subsequence. Angluin et al. have shown learning a shuffle ideal is a computationally infeasible task under general distributions, but in a positive direction principal shuffle ideals, a special subclass where $U$ contains only one string, are PAC learnable under the uniform distribution. In this paper, we study the general class of shuffle ideals. We present positive results on the PAC learnability of shuffle ideals under some popular string distributions, including position-wise independent and identical distributions, Markovian distributions and product distributions. A discussion on learning shuffle ideals under general distributions using heuristic methods is also provided.

Dongqu Chen
Learning Time-Varying Coverage Functions
Coverage functions are an important class of discrete functions that capture laws of diminishing returns. In this paper, we propose a new problem of learning time-varying coverage functions which arise naturally from applications in social network analysis, machine learning, and algorithmic game theory. We develop a novel parametrization of the time-varying coverage function by illustrating the connections with counting processes. We present an efficient algorithm to learn the parameters by maximum likelihood estimation, and provide a rigorous theoretic analysis of its sample complexity. Empirical experiments from information diffusion in social network analysis demonstrate that with few assumptions about the underlying diffusion process, our method performs significantly better than existing approaches on both synthetic and real world data.

Nan Du, Yingyu Liang, Maria-Florina Balcan, Le Song
Learning a Concept Hierarchy from Multi-labeled Documents
While topic models can discover patterns of word usage in large corpora, it is difficult to meld this unsupervised structure with noisy, human-provided labels. We present a model---Label to Hierarchy---that can induce a hierarchy of human labels and the topics associated with those labels. The model is robust enough to account for missing labels from untrained, disparate annotators and provide an interpretable summary of an otherwise unwieldy label set. We present results on held-out word prediction and the ability of our model to predict sets of labels for unseen documents.

Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, Jonathan Chang
Learning convolution filters for inverse covariance estimation of neural network connectivity
We consider the problem of inferring direct neural network connections from Calcium imaging time series. Inverse covariance estimation has proven to be a fast and accurate method for learning macro- and micro-scale network connectivity in the brain and in a recent Kaggle Connectomics competition inverse covariance was the main component of several top ten solutions, including our own and the winning team's algorithm. However, the accuracy of inverse covariance estimation is highly sensitive to signal preprocessing of the Calcium fluorescence time series. Furthermore, brute force optimization methods such as grid search and coordinate ascent over signal processing parameters is a time intensive process, where learning may take several days and parameters that optimize one network may not generalize to networks with different size and parameters. In this paper we show how inverse covariance estimation can be dramatically improved using a simple convolution filter prior to applying sample covariance. Furthermore, these signal processing parameters can be learned quickly using a supervised optimization algorithm. In particular, we maximize a binomial log-likelihood loss function with respect to a convolution filter of the time series and the inverse covariance regularization parameter. Our proposed algorithm is relatively fast on networks the size of those in the competition (1000 neurons), producing AUC scores with similar accuracy to the winning solution in training time under 2 hours on a cpu. Prediction on new networks of the same size is carried out in less than 15 minutes, the time it takes to read in the data and write out the solution.

George Mohler
Learning on graphs using Orthonormal Representation is Statistically Consistent
Existing research\cite{reg} suggests that embedding graphs on unit sphere can be beneficial in learning labels on the graph vertices. Unfortunately current analysis tools does not enable us to choose the right embedding over the unit sphere. \emph{Orthonormal representation} of graphs, a class of embeddings over the unit sphere, was introduced by Lovasz\cite{lovasz_shannon}. In this paper we show that, there exist orthonormal representations which are statistically consistent over large class of graphs, including power law and random graphs. This result is achieved by extending the notion of consistency designed in the inductive setting to graph transduction. As part of the analysis we derive Rademacher complexity measures on graphs which relate to structural properties of the graph -$\chi(G)$. We also relate labeled sample complexity to $\vartheta(G)$, which gives novel insights on the sample complexity and density of graphs. In multiview setting it is a well known heuristic that each view be described by a graph and the graphs are combined by convex combination of Laplacians \cite{lap_mv1}. The analysis presented here easily extends to Multiple graph transduction and helps develop a sound statistical understanding, previously unavailable, of the well multiview setting.

Rakesh Shivanna, Chiranjib Bhattacharyya
Learning the Learning Rate for Prediction with Expert Advice
Most standard algorithms for prediction with expert advice depend on a parameter called the learning rate. This learning rate needs to be large enough to fit the data well, but small enough to prevent overfitting. For the exponential weights algorithm, a sequence of prior work has established theoretical guarantees for higher and higher data-dependent tunings of the learning rate, which allow for increasingly aggressive learning. But in practice such theoretical tunings often still perform worse (as measured by their regret) than ad hoc tuning with an even higher learning rate. To close the gap between theory and practice we introduce an approach to learn the learning rate. Up to a factor that is at most (poly)logarithmic in the number of experts and the inverse of the learning rate, our method performs as well as if we would know the empirically best learning rate from a large range that includes both conservative small values and values that are much higher than those for which formal guarantees were previously available.

Wouter Koolen, Tim van Erven, Peter Grunwald
Learning to Optimize via Information-Directed Sampling
We propose information-directed sampling -- a new algorithm for online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between the square of expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli and linear bandit models, we demonstrate simulation performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and knowledge gradient. Further, we present simple analytic examples illustrating that information-directed sampling, due to the way it measures information gain, can dramatically outperform upper confidence bound algorithms and Thompson sampling.

Daniel Russo, Benjamin Van Roy
Learning to Search in Branch and Bound Algorithms
Branch and Bound is a widely used method in combinatorial optimization, including mixed integer programming, structured prediction and MAP inference. While most work has been focused on developing problem-specific techniques, little is known about how to systematically design the node searching strategy on a branch and bound tree. We address the key challenge of learning an adaptive node searching order for any class of problem solvable by branch and bound. The node searching strategies are learned by imitation learning using a simple oracle that takes the shortest path to the optimal solution. We apply our algorithm to linear programming based branch and bound for solving integer programs. We compare our method with standard heuristic searching baselines and a very efficient commercial mixed integer programming solver, Gurobi. We demonstrate that our approach achieves better solutions faster on four mixed integer linear programming libraries.

He He, Hal Daume, Jason Eisner
Learning to Think Like a Drug Dealer: Efficient Optimization Against Unknown Attackers
Game-theoretic algorithms for physical security have made an impressive real-world impact. In order to build a game model, though, the payoffs of potential attackers for various outcomes must be estimated; inaccurate estimates can lead to significant inefficiencies. We design an algorithm that optimizes the security agency's strategy with no prior information, by observing the attacker's responses to randomized deployments of resources and learning his priorities. In contrast to previous work, our algorithm requires a number of queries that is polynomial in the representation of the game.

Avrim Blum, Nika Haghtalab, Ariel Procaccia
Learning with Fredholm Kernels
In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation. Our approach fits naturally into the kernel framework and can be interpreted as constructing new data-dependent kernels, which we call Fredholm kernels. We proceed to discuss the "noise assumption" for semi-supervised learning and provide evidence evidence both theoretical and experimental that Fredholm kernels can effectively utilize unlabeled data under the noise assumption. We demonstrate that methods based on Fredholm learning show very competitive performance in the standard semi-supervised learning setting.

Qichao Que, Mikhail Belkin, Yusu Wang
Learning with Pseudo-Ensembles
We formalize the notion of a pseudo-ensemble, which comprises a possibly infinite collection of child models spawned from some parent model by perturbing it according to some noise process. E.g., dropout (Hinton et al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We define pseudo-ensembles, which involve perturbation in model-space, and examine their relation to standard ensemble methods and existing notions of robustness, which focus on perturbation in observation-space. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process which generates it. In the fully-supervised setting our regularizer matches the performance of dropout. Unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We conclude with a case study in which we transform the Recursive Neural Tensor Network of (Socher et al, 2013) into a pseudo-ensemble, which significantly improve its performance on a real-world sentiment analysis benchmark.

Phil Bachman, Ouais Alsharif, Doina Precup
Local Decorrelation For Improved Detection
Even with the advent of more sophisticated, data-hungry methods, boosted decision trees remain extraordinarily successful for fast rigid object detection, achieving top accuracy on numerous datasets. While effective, most boosted detectors use decision trees with orthogonal (single feature) splits, and the topology of the resulting decision boundary may not be well matched to the natural topology of the data. Given highly correlated data, decision trees with oblique (multiple feature) splits can be effective. Use of oblique splits, however, comes at considerable computational expense. Inspired by recent work on discriminative decorrelation of HOG features, we instead propose an efficient feature transform that removes correlations in local neighborhoods. The result is an overcomplete but locally decorrelated representation ideally suited for use with orthogonal decision trees. In fact, orthogonal trees with our locally decorrelated features outperform oblique trees trained over the original features at a fraction of the computational cost. The overall improvement in accuracy is dramatic: on the Caltech Pedestrian Dataset, we reduce false positives nearly tenfold over the previous state-of-the-art.

Woonhyun Nam, Piotr Dollar, Joon Hee Han
Local Linear Convergence of Forward--Backward under Partial Smoothness
In this paper, we consider the Forward--Backward proximal splitting algorithm to minimize the sum of two proper closed convex functions, one of which having a Lipschitz--continuous gradient and the second being partly smooth relatively to an active manifold $\mathcal{M}$. We propose a unified framework in which we show that the Forward--Backward (i) correctly identifies the active manifold $\mathcal{M}$ in a finite number of iterations, and then (ii) enters a local linear convergence regime that we characterise precisely. This explains the typical behaviour that has been observed numerically for many problems encompassed in our framework, including the Lasso, the group Lasso, the fused Lasso and the nuclear norm regularization to name a few. These results may have numerous applications including in signal/image processing processing, sparse recovery and machine learning.

Jingwei Liang, Jalal Fadili, Gabriel Peyré
Localized Data Fusion for Kernel k-Means Clustering with Application to Cancer Biology
In many modern applications from, for example, bioinformatics and computer vision, samples have multiple feature representations coming from different data sources. Multiview learning algorithms try to exploit all these available information to obtain a better learner in such scenarios. In this paper, we propose a novel multiple kernel learning algorithm that extends kernel k-means clustering to the multiview setting, which combines kernels calculated on the views in a localized way to better capture sample-specific characteristics of the data. We demonstrate the better performance of our localized data fusion approach on a human colon and rectal cancer data set by clustering patients. Our method finds more relevant prognostic patient groups than global data fusion methods when we evaluate the results with respect to three commonly used clinical biomarkers.

Mehmet Gonen, Adam Margolin
Low Rank Approximation Lower Bounds in Row-Update Streams
We study low-rank approximation in the streaming model in which the rows of an $n \times d$ matrix $A$ are presented one at a time in an arbitrary order. At the end of the stream, the streaming algorithm should output a $k \times d$ matrix $R$ so that $\|A-AR^{\dagger}R\|_F^2 \leq (1+\eps)\|A-A_k\|_F^2$, where $A_k$ is the best rank-$k$ approximation to $A$. A deterministic streaming algorithm of Ghashami and Phillips (SODA, 2014), building upon an earlier algorithm of Liberty (KDD, 2013), provides such a streaming algorithm using $O(dk/\epsilon)$ words of space. A natural question is if smaller space is possible. We give an almost matching lower bound of $\Omega(dk/\epsilon)$ bits of space, even for randomized algorithms which succeed only with constant probability. Our lower bound matches the upper bound of Ghashami and Phillips up to the word size, improving on a simple $\Omega(dk)$ space lower bound.

David Woodruff
Low-Rank Time-Frequency Synthesis
Many single-channel signal decomposition techniques rely on a low-rank factorization of a time-frequency transform. In particular, nonnegative matrix factorization (NMF) of the spectrogram -- the magnitude of the short-time Fourier transform (STFT) -- has been considered in many audio applications. In this setting, the Itakura-Saito NMF technique proposed by Fevotte et al. was shown to underly a generative Gaussian composite model (GCM) of the STFT, a step forward from more empirical approaches based on ad-hoc transform and divergence specifications. Still, the GCM is not yet a generative model of the raw signal itself, but only of its STFT. The work presented in this paper fills in this ultimate gap by proposing a novel signal synthesis model with low-rank time-frequency structure. In particular, our new approach opens doors to multi-resolution representations, that were not possible in the traditional NMF setting. We describe two expectation-maximization algorithms for estimation in the new model and report audio signal processing results with music decomposition and speech enhancement.

Cédric Févotte, Matthieu Kowalski
Low-dimensional models of neural population activity in sensory cortical circuits
Neural responses in visual cortex are influenced by visual stimuli and by ongoing spiking activity in local circuits. An important challenge in computational neuroscience is to develop models that can account for both of these features in large multi-neuron recordings and to reveal how stimulus representations interact with and depend on cortical dynamics. Here we introduce a statistical model of neural population activity that integrates a nonlinear receptive field model with a latent dynamical model of ongoing cortical activity. This model captures the temporal dynamics, effective network connectivity in large population recordings, and correlations due to shared stimulus drive as well as common noise. Moreover, because the nonlinear stimulus inputs are mixed by the ongoing dynamics, the model can account for a relatively large number of idiosyncratic receptive field shapes with a small number of nonlinear inputs to a low-dimensional latent dynamical model. We introduce a fast estimation method using online expectation maximization with Laplace approximations. Inference scales linearly in both population size and recording duration. We apply this model to multi-channel recordings from primary visual cortex and show that it accounts for a large number of individual neural receptive fields using a small number of nonlinear inputs and a low-dimensional dynamical model.

Evan Archer, Urs Köster, Jonathan Pillow, Jakob Macke
MCMC Sampling in HDPs using Sub-Clusters
We develop a sampling technique for Hierarchical Dirichlet process models. The parallel algorithm builds upon [Chang&Fisher 2013] by proposing large split and merge moves based on learned sub-clusters. The additional global split and merge moves drastically improve convergence in the experimental results. Furthermore, we discover that cross-validation techniques do not adequately determine convergence, and that previous sampling methods converge slower than were previously expected.

Jason Chang, John Fisher III
Magnitude-sensitive preference formation`
Our understanding of the neural computations that underlie the ability of animals to choose among options has advanced through a synthesis of computational modeling, brain imaging and behavioral choice experiments. Yet, there remains a gulf between theories of preference learning and accounts of the real, economic choices that humans face in daily life, choices that are usually between some amount of money and an item. In this paper, we develop a theory of magnitude-sensitive preference learning that permits an agent to rationally infer its preferences for items compared with money options of different magnitudes. We show how this theory yields classical and anomalous supply-demand curves and predicts choices for a large panel of risky lotteries. Accurate replications of such phenomena without recourse to utility functions suggest that the theory proposed is both psychologically realistic and econometrically viable.

Nisheeth Srivastava, Ed Vul, Paul Schrater
Message Passing Inference for Large Scale Graphical Models with High Order Potentials
To keep up with the Big Data challenge, parallelized algorithms based on dual decomposition have been proposed to perform inference in Markov random fields. Despite this parallelization, current algorithms struggle when the energy has high order terms and the graph is densely connected. In this paper we propose a partitioning strategy followed by a novel message passing algorithm which is able to exploit pre-computations to only update the high-order factors when passing messages across machines. We demonstrate the effectiveness of our approach on the task of joint layout and semantic segmentation estimation from single images, and show that our approach is orders of magnitude faster than current methods.

Jian Zhang, Alex Schwing, Raquel Urtasun
Metric Learning for Temporal Sequence Alignment
In this paper, we propose to learn a Mahalanobis distance to perform alignment of multivariate time series. The learning examples for this task are time series for which the true alignment is known. We cast the alignment problem as a structured prediction task, and propose realistic losses between alignments for which the optimization is tractable. We provide experiments on real data in the audio to audio context, where we show that the learning of a similarity measure leads to improvements in the performance of the alignment task. We also propose to use this metric learning framework to perform feature selection and, from basic audio features, build a combination of these with better performance for the alignment.

Rémi Lajugie, Damien Garreau, Francis Bach, Sylvain Arlot
Mind the Nuisance: Gaussian Process Classification using Privileged Noise
The learning with privileged information setting has recently attracted a lot of attention within the machine learning community, as it allows the integration of additional knowledge into the training process of a classifier, even when this comes in the form of a data modality that is not available at test time. Here, we show that privileged information can naturally be treated as noise in the latent function of a Gaussian Process classifier (GPC). That is, in contrast to the standard GPC setting, the latent function is not just a nuisance but a feature: it becomes a natural measure of confidence about the training data by modulating the slope of the GPC sigmoid likelihood function. Extensive experiments on public datasets show that the proposed GPC method using privileged noise, called GPC+, improves over a standard GPC without privileged knowledge, and also over the current state-of-the-art SVM-based method, SVM+. Moreover, we show that advanced neural networks and deep learning methods can be compressed as privileged information.

Daniel Hernandez-Lobato, Viktoriia Sharmanska, Kristian Kersting, Christoph Lampert, Novi Quadrianto
Minimax-optimal Inference from Partial Rankings
This paper studies the problem of inferring a global preference based on the partial rankings provided by many users over different subsets of items according to the Plackett-Luce model. A question of particular interest is how to optimally assign items to users for ranking and how many item assignments are needed to achieve a target estimation error. For a given assignment of items to users, we first derive an oracle lower bound of the estimation error that holds even for the more general Thurstone models. Then we show that the Cram\'er-Rao lower bound and our upper bounds inversely depend on the spectral gap of the Laplacian of an appropriately defined comparison graph. When the system is allowed to choose the item assignment, we propose a random assignment scheme. Our oracle lower bound and upper bounds imply that it is minimax-optimal up to a logarithmic factor among all assignment schemes and the lower bound can be achieved by the maximum likelihood estimator as well as popular rank-breaking schemes that decompose partial rankings into pairwise comparisons. The numerical experiments corroborate our theoretical findings.

Bruce Hajek, sewoong Oh, Jiaming Xu
Model-based Reinforcement Learning and the Eluder Dimension
We consider the problem of learning to optimize an unknown Markov decision process (MDP). We show that, if the MDP can be parameterized within some known function class, we can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system. We characterize this dependence explicitly as $\tilde{O}(\sqrt{d_K d_E T})$ where $T$ is time elapsed, $d_K$ is the Kolmogorov dimension and $d_E$ is the \emph{eluder dimension}. This represents the first unified framework for model-based reinforcement learning and provides state of the art guarantees in several important settings. Moreover, we present a simple and computationally efficient algorithm \emph{posterior sampling for reinforcement learning} (PSRL) that satisfies these bounds.

Ian Osband, Benjamin Van Roy
Modeling sequences with a predictive gating network
We propose an approach to modeling time series in terms of the transformations that take one frame to the next. To this end we show how a bi-linear model of transformations, such as a gated autoencoder, can be turned into a predictive model, by training it to predict a future observation from the current observation and an inferred transformation using backprop-through-time. We also show how a stack of multiple layers of this predictive bilinear model can learn to represent complicated time series, such as videos of objects rotating in 3-D, and that it can outperform existing predictive models, including standard recurrent neural networks, in terms of prediction accuracy.

Vincent Michalski, Roland Memisevic, Kishore Konda
Mondrian Forests: Efficient Online Random Forests
Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as Breiman's random forest and extremely randomized trees) operate on batches of training data. Online methods are now in greater demand. Existing online random forests, however, require more training data than their batch counterpart to achieve comparable predictive performance. In this work, we use Mondrian processes (Roy and Teh, 2009) to construct ensembles of random decision trees we call Mondrian forests. Mondrian forests can be grown in an incremental/online fashion and remarkably, the distribution of online Mondrian forests is the same as that of batch Mondrian forests. Mondrian forests achieve competitive predictive performance comparable with existing online random forests and periodically re-trained batch random forests, while being more than an order of magnitude faster, thus representing a better computation vs accuracy tradeoff.

Balaji Lakshminarayanan, Daniel Roy, Yee Whye Teh
Multi-Class Deep Boosting
We present new ensemble learning algorithms for multi-class classification. Our algorithms can use as a base classifier set the family of deep decision trees or other rich or complex families and yet benefit from strong generalization guarantees. We give new data-dependent learning bounds for convex ensembles in the multi-class classification setting expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. These bounds are finer than existing ones both thanks to an improved dependency on the number of classes and, more crucially, by virtue of a more favorable complexity term expressed as an average of the Rademacher complexities based on the ensemble’s mixture weights. We introduce and discuss several new multi-class ensemble algorithms benefiting from these guarantees, prove positive results for the H-consistency of several of them, and report the results of experiments showing that their performance compares favorably with that of multi-class versions of AdaBoost and Logistic Regression.

Vitaly Kuznetsov, Mehryar Mohri, Umar Syed
Multi-Resolution Cascades for Multiclass Object Detection
An algorithm for learning fast multiclass object detection cascades is introduced. It produces multi-resolution (MRes) cascades, whose early stages are binary target vs. non-target detectors that eliminate false positives, late stages multiclass classifiers that finely discriminate target classes, and middle stages have intermediate numbers of classes, determined in a data-driven manner. This MRes structure is achieved with a new structurally biased boosting algorithm (SBBoost). SBBost extends previous multiclass boosting approaches, whose boosting mechanisms are shown to implement two complementary data-driven biases: 1) the standard bias towards examples difficult to classify, and 2) a bias towards difficult classes. It is shown that structural biases can be implemented by generalizing this class-based bias, so as to encourage the desired MRes structure. This is accomplished through a generalized definition of multiclass margin, which includes a set of bias parameters. SBBoost is a boosting algorithm for maximization of this margin. It can also be interpreted as standard multiclass boosting algorithm augmented with margin thresholds or a cost-sensitive boosting algorithm with costs defined by the bias parameters. A stage adaptive bias policy is then introduced to determine bias parameters in a data driven manner. This is shown to produce MRes cascades that have high detection rate and are computationally efficient. Experiments on multiclass object detection show improved performance over previous solutions.

Mohammad Saberian, Nuno Vasconcelos
Multi-Scale Spectral Decomposition of Massive Graphs
Computing the $k$ dominant eigenvalues and eigenvectors of massive graphs is a key operation in numerous machine learning applications; however, popular solvers suffer from slow convergence, especially when $k$ is reasonably large. In this paper, we propose and analyze a novel multi-scale spectral decomposition method (MSEIGS), which first clusters the graph into smaller clusters whose spectral decomposition can be computed efficiently and independently. We theoretically as well as empirically show that the union of all cluster’s subspaces has significant overlap with the dominant subspace of the original graph, provided that the graph is clustered appropriately. Thus, eigenvectors of the clusters serve as good initializations to a block Lanczos algorithm that is used to compute spectral decomposition of the original graph. We further use hierarchical clustering to speed up the computation and adopt a fast early termination strategy to compute quality approximations. Our method outperforms widely used solvers in terms of convergence speed and approximation quality. Furthermore, our method is naturally parallelizable and exhibits significant speedups in shared-memory parallel settings. For example, on a graph with more than 82 million nodes and 3.6 billion edges, MSEIGS takes less than 3 hours on a single-core machine while Randomized SVD takes more than 6 hours, for obtaining a similar approximation of the top-50 eigenvectors. Using 16 cores, we can reduce this time to less than 40 minutes.

Si Si, Donghyuk Shin, Inderjit Dhillon, Beresford Parlett
Multi-Step Stochastic ADMM in High Dimensions: Applications to Sparse Optimization and Matrix Decomposition
In this paper, we consider a multi-step version of the stochastic ADMM method with efficient guarantees for high-dimensional problems. We first analyze the simple setting, where the optimization problem consists of a loss function and a single regularizer (e.g. sparse optimization), and then extend to the multi-block setting with multiple regularizers and multiple variables (e.g. matrix decomposition into sparse and low rank components). For the sparse optimization problem, our method achieves the minimax bound of $O(s\log d/T)$ for $s$-sparse problems in $d$ dimensions in $T$ steps, and is thus, unimprovable by any method up to constant factors. For the matrix decomposition problem with a general loss function, we analyze the multi-step ADMM with multiple blocks. We establish $O(1/T)$ rate and efficient scaling as the size of matrix grows. For natural noise models (e.g. independent noise), our regret achieves the minimax rate. Thus, we establish tight convergence guarantees for multi-block ADMM in high dimensions. Experiments show that for both sparse optimization and matrix decomposition problems, our algorithm outperforms the state-of-the-art methods.

Hanie Sedghi, Anima Anandkumar, Edmond Jonckheere
Multi-scale Graphical Models for Spatio-Temporal Processes
Learning the dependency structure between spatially distributed observations of a spatio-temporal process is an important problem in many fields such as geology, geophysics, atmospheric sciences, oceanography, etc. . However, estimation of such systems is complicated by the fact that they exhibit dynamics at multiple scales of space and time arising due to a combination of diffusion and convection/advection. As we show, time-series graphical models based on vector auto-regressive processes are inefﬁcient in capturing such multi-scale structure. In this paper, we present a hierarchical graphical model with physically derived priors that better represents the multi-scale character of these dynamical systems. We also propose algorithms to efﬁciently estimate the interaction structure from data. We demonstrate results on a general class of problems arising in exploration geophysics by discovering graphical structure that is physically meaningful and provide evidence of its advantages over alternative approaches.

Firdaus Janoos, Huseyin Denli, Niranjan Subrahmanya
Multilabel Structured Output Learning with Random Spanning Trees of Max-Margin Markov Networks
We show that the usual score function for conditional Markov networks can be written as the expectation over the scores of their spanning trees. We also show that a small random sample of these output trees can attain a significant fraction of the margin obtained by the complete graph and we provide conditions under which we can perform tractable inference. The experimental results confirm that practical learning is scalable to realistic datasets using this approach.

Mario Marchand, Hongyu Su, Emilie Morvant, Juho Rousu, John Shawe-Taylor
Multiscale Fields of Patterns
We describe a general framework for representing and learning high-order image models that can be used in a variety of applications. The approach involves modeling local patterns in a multiscale representation of an image. Local properties of a coarse image capture non-local properties of the original image. In the case of binary images local properties are defined in terms of binary patterns observed over small neighborhoods around each pixel. With the multiscale representation we capture the frequency of patterns observed at different scales of an image pyramid. Our framework leads to expressive priors that depend on a relatively small number of parameters. For inference and learning we use MCMC methods based on block sampling with large blocks. We evaluate the approach with two example applications. One involves contour detection. The other involves estimation of segmentation masks.

Pedro Felzenszwalb, John Oberlin
Multitask learning meets tensor factorization: task imputation via convex optimization
We study a multitask learning problem in which each task is parametrized by a weight vector and indexed by a pair of indices, which can be e.g, (consumer, time). The weight vectors can be collected into a tensor and the (multilinear-)rank of the tensor controls the amount of sharing of information among tasks. Two types of convex relaxations have recently been proposed for the tensor multilinear rank. However, we argue that both of them are not optimal in the context of multitask learning in which the dimensions or multilinear rank are typically inhomogeneous. We propose a new norm, which we call the scaled latent trace norm and analyze the excess risk of all the three norms. The results apply to various settings including matrix and tensor completion, multitask learning and multilinear multitask learning. Both the theory and experiments support the advantage of the new norm when the tensor is not equal-sized and we do not a priori know which mode is low rank.

Kishan Wimalawarne, Masashi Sugiyama, Ryota Tomioka
Multivariate Regression with Calibration
We propose a new method named calibrated multivariate regression (CMR) for fitting high dimensional multivariate regression models. Compared to existing methods, CMR calibrates the regularization for each regression task with respect to its noise level so that it is simultaneously tuning insensitive and achieves an improved finite-sample performance. Computationally, we develop an efficient smoothed proximal gradient algorithm which has a worst-case iteration complexity $O(1/\epsilon)$, where $\epsilon$ is a pre-specified numerical accuracy. Theoretically, we prove that CMR achieves the optimal rate of convergence in parameter estimation. We illustrate the usefulness of CMR by thorough numerical simulations and show that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR on a brain activity prediction problem and find that CMR is as competitive as the handcrafted model created by human experts.

Han Liu, Lie Wang, Tuo Zhao
Multivariate f-divergence Estimation With Confidence
Abstract The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recently proposed ensemble estimator of f-divergence between two distributions from a finite number of samples. This estimator has MSE convergence rate of O(1/T), is simple to implement, and performs well in high dimensions. This theory enables us to perform divergence-based inference tasks such as testing equality of pairs of distributions based on empirical samples. We experimentally validate our theoretical and, as an illustration, use them to empirically bound the best achievable classification error.

Kevin Moon, Alfred Hero
Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms
Let $p$ be an unknown and arbitrary probability distribution over $[0 ,1)$. We consider the problem of \emph{density estimation}, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any $k$ and $\eps$, we give an algorithm that makes $\tilde{O}(k/\eps^2)$ draws from $p$, runs in $\tilde{O}(k/\eps^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \log^2(1/\eps))$ pieces. With high probability the hypothesis $h$ satisfies $\dtv(p,h) \leq C \cdot \opt_k(p) + \eps$, where $\dtv$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\opt_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are both optimal up to logarithmic factors. The ``approximation factor'' $C$ that is present in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $\eps$ can achieve $C < 2$ regardless of what kind of hypothesis distribution it uses.

Siu On Chan, Ilias Diakonikolas, Rocco Servedio, Xiaorui Sun
Near-Optimal-Sample Estimators for Spherical Gaussian Mixtures
Many important distributions are high dimensional, and often they can be modeled as Gaussian mixtures. We derive the first sample-efficient polynomial-time estimator for high-dimensional spherical Gaussian mixtures. Based on intuitive spectral reasoning, it approximates mixtures of $k$ spherical Gaussians in $d$-dimensions to within$\ell_1$ distance $\epsilon$ using $\mathcal{O}({dk^9(\log^2 d)}/{\epsilon^4})$ samples and $\mathcal{O}_{k,\epsilon}(d^3\log^5 d)$ computation time. Conversely, we show that any estimator requires $\Omega\bigl({dk}/{\epsilon^2}\bigr)$ samples, hence the algorithm's sample complexity is nearly optimal in the dimension. The implied time-complexity factor $\mathcal{O}_{k,\epsilon}$ is exponential in $k$, but much smaller than previously known. In the process of deriving these results, we also construct a simple estimator for one-dimensional Gaussian mixtures that uses $\widetilde\mathcal{O}({k }/{\epsilon^2})$ samples and $\widetilde\mathcal{O}(({k}/{\epsilon})^{3k+1})$ computation time, and demonstrate a faster algorithm for finding which out of many distributions roughly minimizes the $\ell_1$ distance to an unknown underlying distribution.

Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, Ashkan Jafarpour
Near-optimal sample compression for nearest neighbors
We present the first sample compression algorithm for nearest neighbors with non-trivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.

Lee-Ad Gottlieb, Aryeh Kontorovitch, Pinhas Nisnevitch
Neural Word Embedding as Implicit Matrix Factorization
We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs (shifted by a global constant). Inspired by the result, we derive a novel association measure, Shifted Positive PMI, which, when used in a sparse word-context matrix, improves results on two word similarity tasks and one of two analogy tasks. In cases where dense, low-dimensional vectors are preferred, exact factorization with SVD can, in many cases, achieve solutions that are on par with or slightly superior to SGNS's solutions for word similarity tasks. On analogy questions, SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS's factorization.

Omer Levy, Yoav Goldberg
Neurons as Monte Carlo Samplers: Bayesian Inference and Learning in Spiking Networks
We propose a two-layer spiking network capable of performing approximate inference and learning for a hidden Markov model. The lower layer sensory neurons detect noisy measurements of hidden world states. The higher layer neurons with recurrent connections infer a posterior distribution over world states from spike trains generated by sensory neurons. We show how such a neuronal network with synaptic plasticity can implement a form of Bayesian inference similar to Monte Carlo methods such as particle filtering. Each spike in the population of inference neurons represents a sample of a particular hidden world state. The spiking activity across the neural population approximates the posterior distribution of hidden state. The model provides a functional explanation for the Poisson-like noise commonly observed in cortical responses. Uncertainties in spike times provide the necessary variability for sampling during inference. Unlike previous models, the hidden world state is not observed by the sensory neurons, and the temporal dynamics of the hidden state is unknown. We demonstrate how this network can sequentially learn the hidden Markov model using a spike-timing dependent Hebbian learning rule and achieve power-law convergence rates.

Yanping Huang, Rajesh Rao
New Rules for Domain Independent Lifted MAP Inference
We present two new rules for lifting MAP inference in a large class of Markov Logic Network (MLN) models. We identify equivalence classes of variables which have at most a single variable appearing in any given formula and are referred to as single occurrence equivalence classes. Our first inference rule states that MAP inference over the original theory can be equivalently formulated over a reduced theory where single occurrence classes have been reduced to unary sized domains. Our approach is domain independent when every equivalence class in the theory is single occurrence. The MAP solution in such cases is found at extreme i.e. when every grounding of a predicate takes the same (true/false) value. Our second inference rule states that any formula which becomes tautology at extreme assignments can be removed from the theory for the purpose of MAP inference when the remaining theory is single occurrence. This includes many difficult to lift formulas such as symmetry and transitivity. Experiments over benchmark MLNs validate that our approach results in superior performance and highly scalable solutions compared to the state of the art.

Happy Mittal, Prasoon Goyal, Vibhav Gogate, Parag Singla
Nonparametric Bayesian inference on multivariate exponential families
We develop a model by choosing the maximum entropy distribution from the set of models satisfying certain smoothness and independence criteria; we show that inference on this model generalizes local kernel estimation to the context of Bayesian inference on stochastic processes. Our model enables Bayesian inference in contexts when standard techniques like Gaussian process inference are too expensive or difficult to apply. Exact inference on this model is possible for any likelihood function from the exponential family. Inference is then highly efficient, requiring only O (log N ) time and O (N ) space at run time. We demonstrate our algorithm on several problems and show quantifiable improvement in both speed and performance relative to models based on the Gaussian process.

William Vega-Brown, Marek Doniec, Nicholas Roy
Nonparametric Pairwise Similarity for Clustering
Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, and kernel similarity is broadly used. In this paper, we present a novel pairwise clustering framework by bridging the gap between clustering and multi-class classification. This pairwise clustering framework learns unsupervised nonparametric classifier from unlabeled data, and search for the optimal partition of the data by minimizing the generalization error of the learned classifier associated with the data partitions. Modeling the underlying distribution of the data by nonparametric kernel density estimation, the generalization error bound for the unsupervised nonparametric classifier is the sum of pairwise terms between the data points, which are defined as the nonparametric pairwise similarity for the purpose of clustering. In case of uniform distribution, this nonparametric similarity induced by the unsupervised plug-in classifier exhibits a well known form of kernel similarity. We also prove that the generalization error bound for the unsupervised plug-in classifier asymptotically equals to the weighted volume of cluster boundary for Low Density Separation, a widely used criteria for semi-supervised learning and clustering.

Yingzhen Yang, Feng Liang, shuicheng Yan, Zhangyang Wang, Thomas Huang
Object Localization based on Structural SVM using Privileged Information
We propose a structured prediction algorithm for object localization based on Support Vector Machines (SVMs) using privileged information. Privileged information provides useful high-level knowledge for image understanding and facilitates learning a reliable model even with a small number of training examples. In our setting, we assume that such information is available only at training time since it may be difficult to obtain from visual data accurately without human supervision. Our goal is to improve performance by incorporating privileged information into ordinary learning framework and adjusting model parameters for better generalization. We tackle object localization problem based on a novel structural SVM using privileged information, where an alternative loss-augmented inference procedure is employed to handle the term in the objective function corresponding to privileged information. We apply the proposed algorithm to the Caltech-UCSD Birds 200-2011 dataset, and obtain encouraging results suggesting further investigation into the benefit of privileged information in structured prediction.

Jan Feyereisl, Suha Kwak, Jeany Son, Bohyung Han
On Integrated Clustering and Outlier Detection
We model the joint clustering and outlier detection problem using an extension of the facility location formulation. The advantages of combining clustering and outlier selection include: (i) the resulting clusters tend to be compact and semantically coherent (ii) the clusters are more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable. We provide a practical subgradient-based algorithm for the problem and also study the theoretical properties of algorithm in terms of approximation and convergence. Extensive evaluation on synthetic and real data sets attest to both the quality and scalability of our proposed method.

Lionel Ott, Linsey Pang, Fabio Ramos, Sanjay Chawla
On Iterative Hard Thresholding Methods for High-dimensional M-Estimation
The use of M-estimators in generalized linear regression models in high dimensional settings requires risk minimization with hard L 0 constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current state-of-the-art is only able to analyze these methods in very restrictive settings which do not hold in high dimensional statistical models. In this work we bridge this gap by providing the first analysis for IHT-style methods in the high dimensional statistical setting. Our bounds are tight and match known minimax lower bounds. Our results rely on a general analysis framework that enables us to analyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting. Finally, we extend our analysis to the problem of low-rank matrix recovery.

Prateek Jain, Ambuj Tewari, Purushottam Kar
On Multiplicative Multitask Feature Learning
We investigate a general framework of multiplicative multitask feature learning which decomposes each task's model parameters into a multiplication of two components. One of the components is used across all tasks and the other component is task-specific. Several previous methods have been proposed as special cases of our framework. We study the theoretical properties of this framework when different regularization conditions are applied to the two decomposed components. We prove that this framework is mathematically equivalent to the widely used multitask feature learning methods that are based on a joint regularization of all model parameters, but with a more general form of regularizers. Further, an analytical formula is derived for the across-task component as related to the task-specific component for all these regularizers, leading to a better understanding of the shrinkage effect. Study of this framework motivates new multitask learning algorithms. We propose two new learning formulations by varying the parameters in the proposed framework. Empirical studies have been performed that reveal the relative advantages of these different learning formulations by comparing with the state of the art, which provides instructive insights into the feature learning problem with multiple tasks.

Xin Wang, Jinbo Bi, Shipeng Yu, Jiangwen Sun
On Prior Distributions and Approximate Inference for Structured Variables
We present a general framework for constructing prior distributions with structured variables. The prior is defined as the information projection of a base distribution unto distributions supported on the constraint set of interest. In cases where this projection is intractable, we propose a family of parameterized approximations indexed by subsets of the domain. We further analyze the special case of sparse structure. While the optimal prior is intractable in general, we show that approximate inference using convex subsets is tractable, and is equivalent to maximizing a submodular function subject to cardinality constraints. As a result, inference using greedy forward selection provably achieves within a factor of (1-1/e) of the optimal objective value. Our work is motivated by the predictive modeling of high-dimensional functional neuroimaging data. For this task, we employ the Gaussian base distribution induced by local partial correlations and consider the design of priors to capture the domain knowledge of sparse support. Experimental results on simulated data and high dimensional neuroimaging data show the effectiveness of our approach in terms of support recovery and predictive accuracy.

Oluwasanmi Koyejo, Rajiv Khanna, Joydeep Ghosh, Russell Poldrack
On Sparse Gaussian Chain Graph Models
In this paper, we address the problem of learning the structure of Gaussian chain graph models in a high-dimensional space. Chain graph models are generalizations of undirected and directed graphical models that contain a mixed set of directed and undirected edges. While the problem of sparse structure learning has been studied extensively for Gaussian graphical models and more recently for conditional Gaussian graphical models (CGGMs), there has been little previous work on the structure recovery of Gaussian chain graph models. We consider linear regression models and a re-parameterization of the linear regression models as CGGMs as building blocks of chain graph models. We argue that when the goal is to recover model structures, there are many advantages of using CGGMs as chain component models over linear regression models, including convexity of the optimization problem, computational efficiency, recovery of structured sparsity, and ability to leverage the model structure for semi-supervised learning. We demonstrate our approach on simulated and genomic datasets.

Calvin McCarter, Seyoung Kim
On the Computational Efficiency of Training Neural Networks
It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training neural networks.

Roi Livni, Shai Shalev-Shwartz, Ohad Shamir
On the Information Theoretic Limits of Learning Ising Models
We provide a general framework for computing lower-bounds on the sample complexity of recovering the underlying graphs of Ising models, given i.i.d. samples. While there have been recent results for specific graph classes, these involve fairly extensive technical arguments that are specialized to each specific graph class. In contrast, in our paper, we isolate two key graph-structural ingredients that can then be used to specify sample complexity lower-bounds. Presence of these structural properties makes the graph class hard to learn. We derive corollaries of our main result that not only recover existing recent results, but also provide lower bounds for novel graph classes not considered till now. We also extend our framework to the random graph setting, and derive corollaries for Erdos-Renyi graphs in a certain dense setting.

Rashish Tandon, Karthikeyan Shanmugam, Pradeep Ravikumar, Alexandros Dimakis
On the Number of Linear Regions of Deep Neural Networks
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.

Guido Montufar, Razvan Pascanu, Cho KyungHyun, Yoshua Bengio
On the Statistical Consistency of Plug-in Classifiers for Non-decomposable Performance Measures
We study consistency properties of algorithms for non-decomposable performance measures that cannot be expressed as a sum of losses on individual data points, such as the F-measure used in information retrieval and several other performance measures used in class imbalance settings. While there has been much work on designing algorithms for such performance measures, there has been limited understanding of the theoretical properties of these algorithms. Recently, Ye et al. (2012) showed consistency results for two algorithms that optimize the F-measure, but their results apply only to an idealized setting, where precise knowledge of the underlying probability distribution (in the form of the `true' posterior class probability) is available to a learning algorithm. In this work, we consider plug-in algorithms that learn a classifier by applying an empirically determined threshold to a suitable `estimate' of the class probability obtained from training data, and provide a general methodology to show consistency of these methods for any non-decomposable performance measure that can be expressed as a continuous function of true positive rate (TPR) and true negative rate (TNR), and for which the Bayes optimal classifier is obtained by thresholding the class probability function suitably. We use this template to derive consistency results for plug-in algorithms for the F-measure and for the geometric mean of TPR and precision; to our knowledge, these are the first such results for these performance measures. In addition, for continuous distributions, we show consistency of plug-in algorithms for any performance measure that is a continuous and monotonically increasing function of TPR and TNR. Experimental results confirm our theoretical findings.

Harikrishna Narasimhan, Rohit Vaish, Shivani Agarwal
On the relations of LFPs & Neural Spike Trains
One of the goals of neuroscience is to identify neural networks that correlate with important behaviors, environments, or genotypes. This work proposes a strategy for identifying neural networks characterized by time- and frequency-dependent connectivity patterns, using convolutional dictionary learning that links spike-train data to local field potentials (LFPs) across multiple areas of the brain. Analytical contributions are: (i) modeling dynamic relationships between LFPs and spikes; (ii) describing the relationships between spikes and LFPs, by analyzing the ability to predict LFP data from one region based on spiking information from across the brain; and (iii) development of a clustering methodology that allows inference of similarities in neurons from multiple regions. Results are based on data sets in which spike and LFP data are recorded simultaneously from up to 16 brain regions in a mouse.

David Carlson, Jana Schaich Borg, Kafui Dzirasa, Lawrence Carin
Online Decision-Making in General Combinatorial Spaces
In many settings, one must make sequential decisions in some combinatorial space, without knowing in advance the cost of decisions on each trial; the goal is to minimize the total regret over some sequence of trials relative to the best fixed decision in hindsight. Such online combinatorial decision problems have been studied mostly in settings where elements of the decision space are represented by Boolean vectors and costs are linear in this representation. In this paper, we study a general setting where costs are linear in any suitable low-dimensional vector representation of elements of the decision space. We give a general algorithm for such problems that we call low-dimensional online mirror descent (LDOMD) and analyze its regret in various settings; the algorithm generalizes both the Component Hedge algorithm of Koolen et al. (2010) that applies to Boolean representations, and a recent algorithm of Suehiro et al. (2012) that applies to settings where decisions are represented as vertices of a submodular base polytope. Our study emphasizes the role of the convex polytope arising from the vector representation of the decision space; we study several examples of such polytopes, including both 0-1 polytopes arising from Boolean representations and more general polytopes. Finally, we apply our algorithm to an online transportation problem; the associated transportation polytopes generalize the Birkhoff polytope of doubly stochastic matrices, and the resulting algorithm generalizes the PermELearn algorithm of Helmbold and Warmuth (2009) that applies to online permutation learning.

Arun Rajkumar, Shivani Agarwal
Online Optimization for Max-Norm Regularization
Max-norm regularizer has been extensively studied in the last decade as it promotes a low rank estimation of the underlying data. However, max-norm regularized problems are typically formulated and solved in a batch manner, which prevents it from processing big data due to possible memory bottleneck. In this paper, we propose an online algorithm for solving max-norm regularized problems that is scalable to large problems. Particularly, we consider the matrix decomposition problem as an example, although our analysis can also be applied in other problems such as matrix completion. The key technique in our algorithm is to reformulate the max-norm into a matrix factorization form, consisting of a basis component and a coefficients one. In this way, we can solve the optimal basis and coefficients alternatively. We prove that the basis produced by our algorithm converges to a stationary point asymptotically. Experiments demonstrate encouraging results for the effectiveness and robustness of our algorithm.

Jie Shen, Huan Xu, Ping Li
Online Prediction with Bradley-Terry Models and Logistic Models
We consider an online density estimation problem under the Bradley-Terry model which determines the probability of the order between any pair in the set of $n$ teams. An annoying issue is that the loss function is not convex. A standard solution to the avoid the non-convexity is to change variables so that the new loss function w.r.t. new variables is convex. But, then the radius of the new domain might be huge or unknown in general, for which standard algorithms such as OGD and ONS have suboptimal regret bounds. We propose two algorithms with regret $O(\ln T)$. Our first algorithm achieves the best regret so far and can be applied to the online logistic regression models as well. As a result, we solve an open problem posed by McMahan and Streeter. Our second algorithm has a weaker regret bound, but it works without the knowledge of the radius.

Issei Matsumoto, Kohei Hatano, Eiji Takimoto
Online and Stochastic Gradient methods for Non-decomposable Loss Functions
Modern applications in sensitive domains such as biometrics and medicine frequently require the use of \emph{non-decomposable} loss functions such as precision$@k$, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such non-decomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds. Our model is a provable extension of existing online learning models for point loss functions. We instantiate two popular losses, namely \preck and pAUC in our model and prove sublinear regret bounds for both of them. Our proofs require a novel structural lemma over ranked lists which may be of independent interest. We then develop scalable stochastic gradient descent solvers for non-decomposable loss functions. We show that for loss functions satisfying a certain uniform convergence property (that includes \preck and pAUC), our methods provably converge to the empirical risk minimizer. We use extensive experimentation on real life and benchmark datasets to establish that our method can be orders of magnitude faster than a recently proposed cutting plane method.

Harikrishna Narasimhan, Prateek Jain, Purushottam Kar
Online combinatorial optimization with stochastic decision sets and adversarial losses
Most work on sequential learning assumes a fixed set of actions that are available all the time to choose from. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader prediction method for several learning settings differing in the feedback provided to the learner. Our algorithms rely on a novel loss estimation technique that we call Counting Awake Times. We deliver regret bounds for our algorithms for the previously studied full information and (semi-)bandit settings, as well as a natural middle point between the two that we call the restricted information setting. A special consequence of our results is a significant improvement of the best known performance guarantees achieved by an efficient algorithm for the sleeping bandit problem with stochastic availability. Finally, we evaluate our algorithms empirically and show their improvement over the known approaches.

Gergely Neu, Michal Valko
Optimal Neural Codes for Control and Estimation
Agents acting in the natural world aim at selecting appropriate actions based on noisy and partial sensory observations. Many behaviors leading to decision making and action selection in a closed loop setting are naturally phrased within a control theoretic framework. Within the framework of optimal Control Theory, one is usually given a cost function which is minimized by selecting a control law based on the observations. While in standard control settings the sensors are assumed fixed, biological systems often gain from the extra flexibility of optimizing the sensors themselves. However, this sensory adaptation is geared towards control rather than perception, as is often assumed. In this work we show that sensory adaptation for control differs from sensory adaptation for perception, even for simple control setups. This implies, consistently with recent experimental results, that when studying sensory adaptation, it is essential to account for the task being performed.

Alex Susemihl, Ron Meir, Manfred Opper
Optimal prior-dependent neural population coding under shared input noise
The brain uses population codes to form distributed, noise-tolerant representations of sensory and motor variables. Recent work has explored the optimality of such codes in order to understand the principles governing population codes found in the brain. However, the majority of this literature considers either conditionally independent neurons or neurons with noise governed by a stimulus-independent covariance matrix. Here we analyze population coding under a simple alternative model in which latent "input noise" corrupts the stimulus prior to encoding by the population. This provides a convenient and tractable description for "irreducible" uncertainty that cannot be overcome by adding neurons, and induces stimulus-dependent correlations that mimic certain aspects of the correlations observed in real populations. We examine prior-dependent, Bayesian optimal coding in such populations using exact analyses of cases in which the posterior is exactly or approximately Gaussian. These analyses extend previous results on independent Poisson population codes, many of which relied on approximate formulas involving Fisher information, and yield an analytic expression for squared loss and a tight upper bound for mutual information. We show that, for homogeneous populations, optimal tuning curve width depends on the prior, the loss function, and grows with amount of input noise. Finally, we examine the implications for the second order statistics of multi-neuron responses, for arbitrarily shaped tuning curves, and compare them to the correlations observed in multi-neuron data.

Agnieszka Grabska-Barwinska, Jonathan Pillow
Optimal rates for $k$-NN density and mode estimation
We present two related contributions of yet independent interest: (1) high-probability finite sample rates for $k$-NN density estimation, and (2) practical mode estimators -- based on $k$-NN -- which are provably minimax-optimal under surprisingly general distributional conditions.

Samory Kpotufe, Sanjoy Dasgupta
Optimistic Planning in Markov Decision Processes Using a Generative Model
We consider the problem of online planning in a Markov decision process with discounted rewards for any given initial state. We consider the PAC sample complexity problem of computing, with probability $1-\delta$, an $\epsilon$-optimal action using the smallest possible number of calls to the generative model (which provides reward and next-state samples). We design an algorithm, called StOP (for Stochastic-Optimistic Planning), based on the "optimism in the face of uncertainty" principle. StOP can be used in the general setting, requires only a generative model, and enjoys a complexity bound that only depends on the local structure of the MDP.

Balázs Szörényi, Gunnar Kedenburg, Remi Munos
Optimization Methods for Sparse Pseudo-Likelihood Graphical Model Selection
Sparse high dimensional graphical model selection is a popular topic in contemporary machine learning. To this end, various useful approaches have been proposed in the context of $\ell_1$ penalized estimation in the Gaussian framework. Though many of these approaches are demonstrably scalable and have leveraged recent advances in convex optimization, they still depend on the Gaussian functional form. To address this gap, a convex pseudo-likelihood based partial correlation graph estimation method (CONCORD) has been recently proposed. This method uses cyclic coordinate-wise minimization of a regression based pseudo-likelihood, and has been shown to have robust model selection properties in comparison with the Gaussian approach. In direct contrast to the parallel work in the Gaussian setting however, this new convex pseudo-likelihood framework has not leveraged the extensive array of methods that have been proposed in the machine learning literature for convex optimization. In this paper, we address this crucial gap by proposing two proximal gradient methods (CONCORD-ISTA and CONCORD-FISTA) for performing $\ell_1$-regularized inverse covariance matrix estimation in the pseudo-likelihood framework. We present timing comparisons with coordinate-wise minimization and demonstrate that our approach yields tremendous pay offs for $\ell_1$-penalized partial correlation graph estimation outside the Gaussian setting, thus yielding the fastest and most scalable approach for such problems. We undertake a theoretical analysis of our approach and rigorously demonstrate convergence, and also derive rates thereof.

Sang-Yun Oh, Onkar Dalal, Kshitij Khare, Bala Rajaratnam
Optimizing F-Measures by Cost-Sensitive Classification
We present a theoretical analysis of F-measures for binary, multiclass and multilabel classification. These performance measures are non-linear, but in many scenarios they are pseudo-linear functions of the per-class false negative/false positive rate. Based on this observation, we present a general reduction of F-measure maximization to cost-sensitive classification with unknown costs. We then propose an algorithm with provable guarantees to obtain an approximately optimal classifier for the F-measure by solving a series of cost-sensitive classification problems. The strength of our analysis is to be valid on any dataset and any class of classifiers, extending the existing theoretical results on F-measures, which are asymptotic in nature. We present numerical experiments to illustrate the relative importance of weighting and thresholding when learning linear classifiers on various F-measure optimization tasks.

Shameem Puthiya Parambath, Nicolas Usunier, Yves Grandvalet
Oracle Sparse PCA and Its Inference
In this paper, we study the estimation of the $k$-dimensional sparse principal subspace of covariance matrix $\Sigma$ in the high-dimensional setting. We aim to recover the oracle principal subspace solution, i.e., the principal subspace estimator obtained assuming the true support is known a priori. To this end, we propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations. In particular, under a weak assumption on the magnitude of the population projection matrix, one estimator within this family exactly recovers the true support with high probability, has exact rank-$k$, and attains a $\sqrt{s/n}$ statistical rate of convergence with $s$ being the subspace sparsity level and $n$ the sample size. We also derive the asymptotic distribution of this estimator for statistical inference. Compared to existing support recovery results for sparse PCA, our approach does not hinge on the spiked covariance model or the limited correlation condition. As a complement to the first estimator that enjoys the oracle property, we prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA, even when the previous assumption on the magnitude of the projection matrix is violated. We validate the theoretical results by numerical experiments on synthetic datasets.

Quanquan Gu, Zhaoran Wang, Han Liu
Orbit Regularization
We propose a general framework for regularization based on group majorization. In this framework, a group is defined to act on the parameter space and an orbit is fixed; to control complexity, the model parameters are confined to lie in the convex hull of this orbit (the orbitope). Common regularizers are recovered as particular cases, and a connection is revealed between the recent sorted 1 -norm and the hyperoctahedral group. We derive the properties a group must satisfy for being amenable to optimization with conditional and projected gradient algorithms. Finally, we suggest a continuation strategy for orbit exploration, presenting simulation results for the symmetric and hyperoctahedral groups.

Renato Negrinho, Andre Martins
PAC-Bayesian AUC classification and scoring
We develop a scoring and classification procedure based on the PAC-Bayesian approach and the AUC (Area Under Curve) criterion. We focus initially on the class of linear score functions. We derive PAC-Bayesian non-asymptotic bounds for two types of prior for the score parameters: a Gaussian prior, and a spike-and-slab prior; the latter makes it possible to perform feature selection. One important advantage of our approach is that it is amenable to powerful Bayesian computational tools. We derive in particular a Sequential Monte Carlo algorithm, as an efficient method which may be used as a gold standard, and an Expectation-Propagation algorithm, as a much faster but approximate method. We also extend our method to a class of non-linear score functions, essentially leading to a nonparametric procedure, by considering a Gaussian process prior.

James Ridgway, Pierre Alquier, Nicolas Chopin, Feng Liang
PEWA: Patch-based Exponentially Weighted Aggregation for image denoising
Patch-based methods have been widely used for noise reduction in recent years. In this paper, we propose a general statistical aggregation method which combines image patches denoised with several commonly-used algorithms. We show that weakly denoised versions of the input image obtained with standard methods, can serve to compute an efficient patch-based aggregated estimator. Any collection of denoising methods are allowed. In our approach, we evaluate the Stein's Unbiased Risk Estimator (SURE) of each denoised candidate image patch and use this information to compute the exponential weighted aggregation (EWA) estimator. The resulting approach (PEWA) is based on a MCMC sampling and has a nice statistical foundation while producing denoising results that are comparable to the current state-of-the-art. We demonstrate the performance of the denoising algorithm on real images and we compare the results to several competitive methods.

Charles Kervrann
Parallel Direction Method of Multipliers
We consider the problem of minimizing block-separable convex functions subject to linear constraints. While the Alternating Direction Method of Multipliers (ADMM) for two-block linear constraints has been intensively studied both theorectically and emprically, ADMM for multiple blocks is still largely open. In this paper, we propose a randomized block coordinate method named Parallel Direction Method of Multipliers (PDMM) to solve the optimization problems with multi-block linear constraints. PDMM randomly updates some blocks in parallel, behaving like parallel randomized block coordinate descent. We establish the global convergence and the iteration complexity for PDMM with constant step size. We also show that PDMM can do randomized block coordiante descent on overlapping blocks, which is still an open problem in randomized block coordinate descent. Experimental results show that PDMM performs better than state-of-the-arts methods in two applications, robust principal component analysis and overlapping group lasso.

Huahua Wang, Arindam Banerjee
Parallel Double Greedy Submodular Maximization
Many machine learning problems can be reduced to the maximization of submodular functions. Although well understood in the serial setting, the parallel maximization of submodular functions remains an open area of research with recent results only addressing monotone functions. The optimal algorithm for maximizing the more general class of non-monotone submodular functions was introduced by Buchbinder et al. and follows a strongly serial double-greedy logic and program analysis. In this work, we propose two methods to parallelize the double-greedy algorithm. The first, coordination-free approach emphasizes speed at the cost of a weaker approximation guarantee. The second, concurrency control approach guarantees a tight 1/2-approximation, at the quantifiable cost of additional coordination and reduced parallelism. As a consequence we explore the trade off space between guaranteed performance and objective optimality. We implement and evaluate both algorithms on multi-core hardware and billion edge graphs, demonstrating both the scalability and tradeoffs of each approach.

Xinghao Pan, Stefanie Jegelka, Joseph Gonzalez, Joseph Bradley, Michael Jordan
Parallel Feature Selection Inspired by Group Testing
This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with state-of-the-art feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a pre-processing step for most other existing methods.

Yingbo Zhou, Ce Zhang, Utkarsh Porwal, Hung Ngo, XuanLong Nguyen, Christopher Re, Venu Govindaraju
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multi-core parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and non-asymptotic convergence behavior of the algorithm for both convex and non-convex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.

Meisam Razaviyayn, Mingyi Hong, Zhi-Quan Luo, Jong-Shi Pang
Partition-wise Linear Models
Region-specific linear models have been well used in many practical applications due to their non-linear but highly interpretable model representations. One of key challenges is non-convexity of simultaneous optimization of regions and region-specific models. This paper proposes novel convex region-specific linear models, which we refer to as partition-wise linear models. Our key ideas are 1) assigning linear models not to regions but to partitions (region-specifiers) and representing region-specific linear models by linear combinations of partition-specific models, and 2) optimizing regions via partition selection from a large number of given partition candidates by using convex sparsity-inducing structured regularizations. In addition to the initialization-free globally-optimal solution, our convex formulation makes it possible to derive a generalization bound and to use advanced optimization techniques such as proximal methods and decomposition of the proximal map. Experimental results demonstrate that our models perform better than or at least competitive with state-of-the-art region-specific or locally linear models.

Hidekazu Oiwa, Ryohei Fujimaki
Permutation Diffusion Maps (PDM) with Application to the Image Association Problem in Computer Vision
Consistently matching keypoints across images, and the related problem of finding clusters of nearby images, are critical components of various tasks in Computer Vision, including Structure from Motion (SfM). Unfortunately, occlusion and large repetitive structures tend to mislead most currently used matching algorithms, leading to characteristic pathologies in the final output. The same problem, albeit in slighty different form, also occurs in statistical genomics, where partial fragments of the genome need to be globally assembled using pairwise information. In this paper we introduce a new method, Permutations Diffusion Maps (PDM), to solve the matching problem, as well as a related new affinity measure, derived using ideas from harmonic analysis on the symmetric group. We show that just by using it as a preprocessing step to existing SfM pipelines, PDM can greatly improve reconstruction quality on difficult datasets.

Deepti Pachauri, Risi Kondor, Gautam Sargur, Vikas Singh
Positive Curvature and Hamiltonian Monte Carlo
The Jacobi metric introduced in mathematical physics can be used to analyze Hamiltonian Monte Carlo (HMC). In a geometrical setting, each step of HMC corresponds to a geodesic on a Riemannian manifold with a Jacobi metric. Our calculation of the sectional curvature of this HMC manifold allows us to see that it is positive in cases such as sampling from a high dimensional multivariate Gaussian with small perturbation. We show that positive curvature can be used to prove theoretical concentration results for HMC Markov chains.

Christof Seiler, Simon Rubinstein-Salzedo, Susan Holmes
Pre-training of Recurrent Neural Networks via Linear Autoencoders
We propose a pre-training technique for recurrent neural networks based on linear autoencoder networks for sequences, i.e. linear dynamical systems modelling the target sequences. We start by giving a closed form solution for the definition of the optimal weights of a linear autoencoder given a training set of sequences. This solution, however, is computationally very demanding, so we suggest a procedure to get an approximate solution for a given number of hidden units. The weights obtained for the linear autoencoder are then used as initial weights for the input-to-hidden connections of a recurrent neural network, which is then trained on the desired task. Using four well known datasets of sequences of polyphonic music, we show that the proposed pre-training approach is highly effective, since it allows to largely improve the state of the art results on all the considered datasets.

Luca Pasa, Alessandro Sperduti
Predicting Useful Neighborhoods for Lazy Local Learning
Lazy local learning methods train a classifier ``on the fly" at test time, using only a subset of the training instances that are most relevant to the novel test example. The goal is to tailor the classifier to the properties of the data surrounding the test example. Existing methods assume that the instances most useful for building the local model are strictly those closest to the test example. However, this fails to account for the fact that the success of the resulting classifier depends on the full \emph{distribution} of selected training instances. Rather than simply gather the test example's nearest neighbors, we propose to predict the subset of training data that is jointly relevant to training its local model. We develop an approach to discover patterns between queries and their ``good" neighborhoods using large-scale multi-label classification with compressed sensing. Given a novel test point, we estimate both the composition and size of the training subset likely to yield an accurate local model. We demonstrate the approach on image classification tasks on SUN and aPascal and show it outperforms traditional global and local approaches.

Aron Yu, Kristen Grauman
Primitives for Dynamic Big Model Parallelism
When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in order to harness additional memory and processors. However, naive, unstructured parallelization of ML algorithms can make inefficient use of distributed memory, while failing to obtain proportional convergence speedups --- or can even result in divergence. We develop a framework of primitives for dynamic model-parallelism, STRADS, in order to explore partitioning and update scheduling of model variables in distributed ML algorithms --- thus improving their memory efficiency while presenting new opportunities to speed up convergence without compromising inference correctness. We demonstrate the efficacy of model-parallel algorithms implemented in STRADS versus popular implementations for Topic Modeling, Matrix Factorization and Lasso.

Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth Gibson, Eric Xing
Probabilistic Differential Dynamic Programming
We present a data-driven, probabilistic trajectory optimization framework for general unknown stochastic nonlinear systems, called Probabilistic Differential Dynamic Programming (PDDP). PDDP takes into account model uncertainty for dynamics and policy by using Gaussian processes (GP). Rooted in the Dynamic Programming(DP) principle and second-order local approximations of the value function, PDDP features a “safe exploration” scheme to avoid highly inaccurate model approximations. Different from gradient-based policy search methods, PDDP learns linear and time varying feedback policy locally. We demonstrate the effectiveness and efficiency of the proposed algorithm on two nontrivial tasks. Compared to a state-of-the-art policy search method and a variation of the proposed framework, PDDP offers a superior combination of data-efficiency, policy learning speed, and scalability.

Yunpeng Pan, Evangelos Theodorou
Probabilistic low-rank matrix completion on finite alphabets
The task of reconstructing a matrix given a sample of observed entries is known as the \emph{matrix completion problem}. Such a consideration arises in a wide variety of problems, including recommender systems, collaborative filtering, dimensionality reduction, image processing, quantum physics or multi-class classification to name a few. Most works have focused on recovering an unknown real-valued low-rank matrix from randomly sub-sampling its entries. Here, we investigate the case where the observations take a finite numbers of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification. We also consider a general sampling scheme (non-necessarily uniform) over the matrix entries. The performance of a nuclear-norm penalized estimator is analyzed theoretically. More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions. In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tackle potentially high dimensional settings.

Jean Lafond, Olga Klopp, Eric Moulines, Joseph Salmon
Projecting Markov Random Field Parameters for Fast Mixing
Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powerful techniques to sample from almost arbitrary distributions. The flaw in practice is that it can take a large and/or unknown amount of time to converge to the stationary distribution. This paper gives sufficient conditions to guarantee that univariate Gibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a precise sense. Further, an algorithm is given to project onto this set of fast-mixing parameters in the Euclidean norm. Following recent work, we give an example use of this to project in various divergence measures, comparing of univariate marginals obtained by sampling after projection to common variational methods and Gibbs sampling on the original parameters.

Xianghang Liu, Justin Domke
Projective dictionary pair learning for pattern classification
Discriminative dictionary learning (DL) has been widely studied in various pattern classification problems. Most of the existing DL methods aim to learn a synthesis dictionary to represent the input signal while enforcing the representation coefficients and/or representation residual to be discriminative. However, the $\ell_0$ or $\ell_1$-norm sparsity constraint on the representation coefficients adopted in many DL methods makes the training and testing phases time consuming. We propose a new discriminative DL framework, namely projective dictionary pair learning (DPL), which learns a synthesis dictionary and an analysis dictionary jointly to achieve the goal of signal representation and discrimination. Compared with conventional DL methods, the proposed DPL method can not only greatly reduce the time complexity in the training and testing phases, but also lead to very competitive accuracies in a variety of visual classification tasks.

Shuhang Gu, Lei Zhang, Wangmeng Zuo, Xiangchu Feng
Provable Tensor Factorization with Missing Data
We study the problem of low-rank tensor factorization in the presence of missing data. We ask the following question: how many sampled entries do we need, to efficiently and exactly reconstruct a tensor with a low-rank orthogonal decomposition? We propose a novel alternating minimization based method which iteratively refines estimates of the singular vectors. We show that under certain standard assumptions, our method can recover a three-mode $n\times n\times n$ dimensional rank-$r$ tensor exactly from $O(n^{3/2} r^5 \log^4 n)$ randomly sampled entries. In the process of proving this result, we solve two challenging sub-problems for tensors with missing data. First, in the process of analyzing the initialization step, we prove a generalization of a celebrated result by Szemer\'edie et al. on the spectrum of random graphs. Next, we prove global convergence of alternating minimization with a good initialization. Simulations suggest that the dependence of the sample size on dimensionality $n$ is indeed tight.

Prateek Jain, Sewoong Oh
Proximal Quasi-Newton for Computationally Intensive $\ell_1$-regularized $M$-estimators
We consider the class of optimization problems arising from computationally intensive $\ell_1$-regularized $M$-estimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the $\ell_1$-regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequence labelling, alignment, and classification with label taxonomy. $\ell_1$-regularized MLEs for CRFs are particularly expensive to optimize since computing the gradient values requires an expensive inference step. In this work, we propose the use of a carefully constructed proximal quasi-Newton algorithm for such computationally intensive $M$-estimation problems, where we employ an aggresive active set selection technique. In a key contribution of the paper, we show that our proximal quasi-Newton algorithm is provably \emph{super-linearly convergent}, even in the absence of strong convexity, by leveraging a restricted variant of strong convexity. In our experiments, the proposed algorithm converges considerably faster than current state-of-the-art on the problems of sequence labeling and hierarchical classification.

Kai Zhong, En-Hsu Yen, Inderjit Dhillon, Pradeep Ravikumar
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
In this paper, we develop a family of algorithms for optimizing “superpositionstructured” or “dirty” statistical estimators for high-dimensional problems involving the minimization of the sum of a smooth loss function with a hybrid regularization. Most of the current approaches are first-order methods, including proximal gradient or Alternating Direction Method of Multipliers (ADMM). We propose a new family of second-order methods where we approximate the loss function using quadratic approximation. The superposition structured regularizer then leads to a subproblem that can be efficiently solved by alternating minimization. We propose a general active subspace selection approach to speed up the solver by utilizing the low-dimensional structure given by the regularizers, and provide convergence guarantees for our algorithm. Empirically, we show that our approach is more than 10 times faster than state-of-the-art first-order approaches for the latent variable graphical model selection problems and multi-task learning problems when there is more than one regularizer. For these problems, our approach appears to be the first algorithm that can extend active subspace ideas to multiple regularizers.

Cho-Jui Hsieh, Peder Olsen, Inderjit Dhillon, Pradeep Ravikumar, Stephen Becker
Quantized Kernel Learning for Feature Matching
Matching local visual features is a crucial problem in computer vision, and its accuracy greatly depends on the choice of similarity measure. As it is generally very difficult to design by hand a similarity or a kernel perfectly adapted to the data of interest, learning it automatically with as few assumptions as possible is preferable. However, available techniques for kernel learning suffer from several limitations, such as restrictive parameterization or scalability. In this paper, we introduce a simple and flexible family of non-linear kernels which we refer to as Quantized Kernels (QK) and present how to learn them efficiently. In essence, QKs are arbitrary kernels in the index space of a data quantizer, leading to piecewise constant similarities in the original feature space. Quantization allows to compress features and keep the learning tractable. As a result, we obtain state-of-the-art matching performance on a standard benchmark dataset with just a few bits to represent each feature dimension. Our method also provides the explicit low-dimensional feature mapping that grants access to Euclidean geometry.

Danfeng Qin, Xuanli Chen, Matthieu Guillaumin, Luc Van Gool
Quantized Nonparametric Estimation
A central result in statistical theory is Pinsker's theorem, which characterizes the minimax rate in the normal means model of nonparametric estimation. In this paper, we present an extension to Pinsker's theorem where estimation is carried out under storage or communication constraints. In particular, we place limits on the number of bits used to encode an estimator, and analyze the excess risk in terms of this constraint, the signal size, and the noise level. We give sharp upper and lower bounds for the case of a Euclidean ball, which establishes the Pareto-optimal minimax tradeoff between storage and risk in this setting.

Yuancheng Zhu, John Lafferty
Ranking via Robust Binary Classification
We propose RoBiRank, a ranking algorithm that is motivated by observing a close connection between evaluation metrics for learning to rank and loss functions for robust classification. The algorithm shows a very competitive performance on standard benchmark datasets against other representative algorithms in the literature. Further, in large scale problems where explicit feature vectors and scores are not given, our algorithm can be efficiently parallelized across a large number of machines; for a task that requires 386,133 x 49,824,519 pairwise interactions between items to be ranked, our algorithm finds solutions that are of dramatically higher quality than that can be found by a state-of-the-art competitor algorithm, given the same amount of wall-clock time for computation.

Hyokun Yun, Parameswaran Raman, S. Vishwanathan
Real-Time Decoding of an Integrate and Fire Encoder
Neuronal encoding models range from the detailed biophysically-based Hodgkin Huxley model, to the statistical linear time invariant model specifying firing rates in terms of the extrinsic signal. Decoding the former becomes intractable, while the latter does not adequately capture the nonlinearities present in the neuronal encoding system. For use in practical applications, we wish to record the output of neurons, namely spikes, and decode this signal fast in order to drive a machine, for example a prosthetic device. Here, we introduce a causal, real-time decoder of the biophysically-based Integrate and Fire encoding neuron model. We show that the upper bound of the real-time reconstruction error decreases polynomially in time, and that the L2 norm of the error is bounded by a constant that depends on the density of the spikes, as well as the bandwidth and the decay of the input signal. We numerically validate the effect of these parameters on the reconstruction error.

Shreya Saxena, Munther Dahleh
Recursive Context Propagation Network for Semantic Scene Labeling
We propose a deep feed-forward neural network architecture for pixel-wise semantic scene labeling. It uses a novel recursive-neural network architecture for global context propagation, referred to as rCPN. It first maps the local features into a semantic space followed by a bottom-up aggregation of local information into a global contextual feature of the entire image. Then a top-down propagation of the aggregated global contextual information takes place that enhances the contextual information of each local feature. Therefore, the information from every location in the image is propagated to every other location. Experimental results on Stanford background and SIFT Flow datasets show that the proposed method outperforms previous approaches in terms of accuracy. It is also orders of magnitude faster than previous methods and takes only 0.07 seconds on a GPU for pixel-wise labeling of a $256 \times 256$ image starting from raw RGB pixel values, given the super-pixel mask that takes an additional 0.3 seconds using an off-the-shelf implementation.

Abhishek Sharma, Oncel Tuzel, Ming-Yu Liu
Recursive Inversion Models for Permutations
We develop a new exponential family probabilistic model for permutations that can capture hierarchical structure, and that has the well known Mallows and generalized Mallows models as subclasses. We describe how one can do parameter estimation and propose an approach to structure search for this class of models. We provide experimental evidence that this added flexibility both improves predictive performance and enables a deeper understanding of collections of permutations.

Chris Meek, Marina Meila
Repeated Contextual Auctions with Strategic Buyers
Motivated by the setting of real-time advertising exchanges, we analyze the problem of pricing inventory in a repeated posted-price auction. We consider both the cases of a truthful and strategic buyer, where the former makes decisions myopically and the latter may act in a fashion to maximize long-term surplus. Unlike in previous work, we assume a buyer's valuation of a good is a function of a context vector that describes the good being sold. This is a crucial aspect of many real-world ad exchanges that allows buyers to target their bids and is an scenario that previous works have not addressed in the presence of strategic buyers. We seek to minimize the strategic regret of the algorithm, that is the revenue lost compared to the revenue that would be made if the valuation function was known in advance. We present an algorithm that we show is able to achieve no-strategic-regret, i.e. a vanishing per-round strategic regret as the number of rounds increases, in the presence of both truthful and strategic buyers who discount future surplus.

Kareem Amin, Afshin Rostamizadeh, Umar Syed
Reputation-based User Filtering in Crowd-sourcing Systems
In this paper, we study the vote aggregation problem in crowd-sourcing systems to infer the true labels of objects in the face of incorrect votes cast by users. Unlike most prior work which has examined this problem under the random voting paradigm, we consider a much broader class of {\em adversarial} users with no specific assumptions on their voting pattern. Our key contribution is the design of a computationally efficient reputation algorithm to identify and filter out these adversarial users in crowd-sourcing systems. Our algorithm uses the concept of optimal semi-matchings in conjunction with user penalties based on label disagreements to identify a reputation score for every user. We provide strong theoretical guarantees across a broad spectrum of {\em adversarial} voting strategies including non-random errors, spammers and the extreme case of sophisticated liars where we analyze the worst-case behavior of our algorithm. Finally, we show that our reputation algorithm can significantly improve the accuracy of existing vote aggregation algorithms in real-world crowd-sourcing datasets.

Ashwin Venkataraman, Srikanth Jagabathula, Lakshminarayanan Subramanian
Restricted Boltzmann machines modeling human choice
We extend the multinomial logit model to represent some of the empirical phenomena that are frequently observed in the choice made by humans. These phenomena include the similarity effect, the attraction effect, and the compromise effect. We formally quantify the strength of these phenomena that can be represented by our choice model, which illuminates the flexibility of our choice model. We then show that our choice model can be represented as a restricted Boltzmann machine and that its parameters can be learnt effectively from data. Our numerical experiments with real data of human choice suggest that we can train our choice model in such a way that it represents the typical phenomena of choice.

Takayuki Osogami, Makoto Otsuka
Robust Bayesian Max-Margin Clustering
We present max-margin Bayesian clustering (BMC), a general and robust framework that incorporates the max-margin criterion into Bayesian clustering models, as well as two concrete models of BMC to demonstrate its flexibility and effectiveness in dealing with different clustering tasks. The Dirichlet process max-margin Gaussian mixture is a nonparametric Bayesian clustering model that relaxes the underlying Gaussian assumption of Dirichlet process Gaussian mixtures by incorporating max-margin posterior constraints, and is able to infer the number of clusters from data. We further extend the ideas to present max-margin clustering topic model, which can learn the latent topic representation of each document while at the same time cluster documents in the max-margin fashion. Extensive experiments are performed on a number of real datasets, and the results indicate superior clustering performance of our methods compared to related baselines.

Changyou Chen, Jun Zhu, Xinhua Zhang
Robust Kernel Density Estimation by Scaling and Projection in Hilbert Space
While robust parameter estimation has been well studied in parametric density estimation, there has been little investigation into robust density estimation in the nonparametric setting. We present a robust version of the popular kernel density estimator (KDE). As with other estimators, a robust version of the KDE is useful since sample contamination is a common issue with datasets. What ``robustness'' means for a nonparametric density estimate is not straightforward and is a topic we explore in this paper. To construct a robust KDE we scale the traditional KDE and project it to its nearest weighted KDE in the $L^2$ norm. Because the squared $L^2$ norm penalizes point-wise errors superlinearly this causes the weighted KDE to allocate more weight to high density regions. We demonstrate the robustness of the SPKDE with numerical experiments and a consistency result which shows that asymptotically the SPKDE recovers the uncontaminated density under sufficient conditions on the contamination.

Robert Vandermeulen, Clayton Scott
Robust Logistic Regression and Classification
We consider logistic regression with arbitrary outliers in the covariate matrix. We propose a new robust logistic regression algorithm, called RoLR, that estimates the parameter through a simple linear programming procedure. We prove that RoLR is robust to a constant fraction of adversarial outliers. To the best of our knowledge, this is the first result on estimating logistic regression model when the covariate matrix is corrupted with any performance guarantees. Besides regression, we apply RoLR to solving binary classification problems where a fraction of training samples are corrupted.

Jiashi Feng, Huan Xu, Shie Mannor, shuicheng Yan
Robust Tensor Decomposition with Gross Corruption
In this paper, we study the statistical performance of convex tensor decomposition with gross corruption. The observations are noisy realization of the superposition of a low-rank tensor $\Wc^*$ and an entrywise sparse corruption tensor $\Ec^*$. Unlike conventional noise with bounded variance in previous convex tensor decomposition analysis, the magnitude of the gross corruption can be arbitrary large. We show that under certain conditions, the true low-rank tensor as well as the sparse corruption tensor can be recovered simultaneously. Our theory yields nonasymptotic Frobenius error bounds for each tensor. We show through numerical experiments that our theory can precisely predict the scaling behavior in practice.

Quanquan Gu, Huan Gui, Jiawei Han
Rounding-based Moves for Metric Labeling
Metric labeling is a special case of energy minimization for pairwise Markov random fields. The energy function consists of arbitrary unary potentials, and pairwise potentials that are proportional to a given metric distance function over the label set. Popular methods for solving metric labeling include (i) move-making algorithms, which iteratively solve a minimum st-cut problem; and (ii) the linear programming (LP) relaxation based approach. In order to convert the fractional solution of the LP relaxation to an integer solution, several randomized rounding procedures have been developed in the literature. We consider a large class of parallel rounding procedures, and design move-making algorithms that closely mimic them. We prove that the multiplicative bound of a move-making algorithm exactly matches the approximation factor of the corresponding rounding procedure for any arbitrary distance function. Our analysis includes all known results for move-making algorithms as special cases.

M. Pawan Kumar
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
In this work we introduce a new fast incremental gradient method SAGA, in the spirit of SAG, SDCA, MISO and SVRG. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.

Aaron Defazio, Francis Bach, Simon Lacoste-Julien
Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature
We propose a novel sampling framework for inference in probabilistic models: an active learning approach that converges more quickly (in wall-clock time) than Markov chain Monte Carlo (MCMC) benchmarks. The central challenge in probabilistic inference is numerical integration, to average over ensembles of models or unknown (hyper-)parameters (for example to compute marginal likelihood or a partition function). MCMC has provided approaches to numerical integration that deliver state-of-the-art inference, but can suffer from sample inefficiency and poor convergence diagnostics. Bayesian quadrature techniques offer a model-based solution to such problems, but their uptake has been hindered by prohibitive computation costs. We introduce a warped model for probabilistic integrands (likelihoods) that are known to be non-negative, permitting a cheap active learning scheme to optimally select sample locations. Our algorithm is demonstrated to offer faster convergence (in seconds) relative to simple Monte Carlo and annealed importance sampling on both synthetic and real-world examples.

Tom Gunter, Mike Osborne, Roman Garnett, Philipp Hennig, Stephen Roberts
Scalable Kernel Methods via Doubly Stochastic Gradients
The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called ``doubly stochastic functional gradients''. Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization performance of O(1/\sqrt{t}). This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.

Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, Le Song
Scalable Non-linear Learning with Adaptive Polynomial Expansions
Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.

Alekh Agarwal, Alina Beygelzimer, Daniel Hsu, John Langford, Matus Telgarsky
Scale Adaptive Blind Deblurring
The presence of noise and small scale structures usually leads to large kernel estimation errors in blind image deblurring empirically, if not a total failure. We present a scale space perspective on blind deblurring algorithms, and introduce a cascaded scale space formulation for blind deblurring. This new formulation suggests a natural approach robust to noise and small scale structures through tying the estimation across multiple scales and balancing the contributions of different scales automatically by learning from data. The proposed formulation also allows to handle non-uniform blur with a straightforward extension. Experiments are conducted on both benchmark dataset and real-world images to validate the effectiveness of the proposed method. One surprising finding based on our approach is that blur kernel estimation is not necessarily best at the finest scale.

Haichao Zhang, Jianchao Yang
Scaling-up Importance Sampling for Markov Logic Networks
Markov Logic Networks (MLNs) are weighted first-order logic templates for generating large (ground) Markov networks. Lifted inference algorithms for them bring the power of logical inference to probabilistic inference. These algorithms operate as much as possible at the compact first-order level, grounding or propositionalizing the MLN only when necessary. As a result, lifted inference algorithms can be much more scalable than propositional algorithms that operate directly on the much larger ground network. Unfortunately, existing lifted inference algorithms suffer from two interrelated problems, which severely affects their scalability in practice. First, for most real-world MLNs having complex structure, they are unable to exploit symmetries and end up grounding most atoms in the MLN (the grounding problem). Second, they suffer from the {\em evidence problem}, which arises because evidence breaks symmetries, severely diminishing the power of lifted inference. Here, we address both problems by presenting a scalable, lifted importance sampling-based approach that never grounds the full MLN. Specifically, we show how to scale up the two main steps in importance sampling: sampling and weight computation. Scalable sampling is achieved by using an informed, easy-to-sample proposal distribution derived from a compressed MLN-representation. Fast weight computation is achieved by only visiting a small subset of sampled groundings of each formula instead of all of its possible groundings. We show that our new algorithm yields an asymptotically unbiased estimate. Our experiments on several MLNs clearly demonstrate the promise of our approach.

Deepak Venugopal, Vibhav Gogate
Self-Adaptable Patterns for Feature Coding
In object recognition, feed-forward networks are hierarchical, and at each level, features are extracted and encoded, followed by a pooling step. Within this processing pipeline, the common trend is to learn the feature coding patterns, often referred as codebook entries, filters, or over-complete basis. Recently, an approach that does not use these patterns has been shown to obtain very promising results. This is the tensor representation (TR). In this paper, we analyze TR as a coding-pooling scheme, and we find that TR automatically adapts the feature coding patterns to the input features. From this finding, we are able to bring common concepts of coding-pooling schemes to TR, such as feature quantization. This allows us to obtain significant accuracy improvements of TR in standard benchmarks of image classification, namely Caltech101 and VOC07.

Xavier Boix, Gemma Roig, Luc Van Gool
Self-Paced Learning with Diversity
Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals that gradually incorporates easy to more complex samples into training. Existing methods are limited in that they ignore an important aspect in learning: diversity. To incorporate this information, we propose an approach called self-paced learning with diversity (SPLD) which formalizes the preference for both easy and diverse samples into a general regularizer. This regularization term is independent of the learning objective, and thus can be easily generalized into various learning tasks. Albeit non-convex, the optimization of the variables included in this SPLD regularization term for sample selection can be globally solved in linearithmic time. We demonstrate that our method significantly outperforms the conventional SPL on three real-world datasets. Specifically, SPLD achieves the best MAP so far reported in literature on the Hollywood2 and Olympic Sports datasets.

Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, Alexander Hauptmann
Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models
Sampling from hierarchical Bayesian models is often difficult for MCMC methods, because of the strong correlations between the model parameters and the hyperparameters. Recent Riemannian manifold Hamiltonian Monte Carlo (RMHMC) methods have significant potential advantages in this setting, but are computationally expensive. We introduce a new RMHMC method, which we call semi-separable Hamiltonian Monte Carlo, which uses a specially designed mass matrix that allows the joint Hamiltonian over model parameters and hyperparameters to decompose into two simpler Hamiltonians. This structure is exploited by a new integrator which we call the alternating blockwise leapfrog algorithm. The resulting method can mix faster than simpler Gibbs sampling while being simpler and more efficient than previous instances of RMHMC.

Yichuan Zhang, Charles Sutton
Sensory Integration and Density Estimation
The integration of partially redundant information from multiple sensors is a standard computational problem for agents interacting with the world. In man and other primates, integration has been shown psychophysically to be nearly optimal in the sense of error minimization. An influential generalization of this notion of optimality is that populations of multisensory neurons should retain all the information from their unisensory afferents about the underlying, common stimulus [1]. More recently, it was shown empirically that a neural network trained to perform latent-variable density estimation, with the activities of the unisensory neurons as observed data, satisfies the information-preservation criterion, even though the model architecture was not designed to match the true generative process for the data [2]. We prove here an analytical connection between these seemingly different tasks, density estimation and sensory integration; that the former implies the latter for the model used in [2]; but that this does not appear to be true for all models.

Joseph Makin, Philip Sabes
Separable Deep Convolutional Neural Network for Image Deconvolution
Many fundamental image problems involve deconvolution operators. However, the real world blur degradation seldom complies with an ideal linear convolution model due to camera noise, saturation, image compression, to name a few. Instead of perfectly modeling these outliers, which are rather challenging from a generative perspective, we develop a deep convolutional neural network system to capture the characteristics of degradation that can hardly be enumerated traditionally. We found direct application of existing deep neural network fails on this task and alternatively stablish connection between traditional optimization-based deconvolution schemes and a neural network architecture. It features an effective pipeline for robust deconvolution against all artifacts. Our network contains two sub ones, both trained in a supervised manner with reasonable initialization. They yield the best performance on non-blind image deconvolution compared to previous generative-model based methods.

Li Xu, Jimmy Ren, Jiaya Jia, Ce Liu
Sequential Monte Carlo for Graphical Models
We propose a new framework for how to use sequential Monte Carlo (SMC) algorithms for inference in probabilistic graphical models (PGM). Via a sequential decomposition of the PGM we find a sequence of auxiliary distributions defined on a monotonically increasing sequence of probability spaces. By targeting these auxiliary distributions using SMC we are able to approximate the full joint distribution defined by the PGM. One of the key merits of the SMC sampler is that it provides an unbiased estimate of the partition function of the model. We also show how it can be used within a particle Markov chain Monte Carlo framework in order to construct high-dimensional block-sampling algorithms for general PGMs.

Christian Naesseth, Fredrik Lindsten, Thomas Schon
SerialRank: Spectral Ranking using Seriation
We describe a seriation algorithm for ranking a set of n items given pairwise comparisons between these items. Intuitively, the algorithm assigns similar rankings to items that compare similarly with all others. It does so by constructing a similarity matrix from pairwise comparisons, using seriation methods to reorder this matrix and construct a ranking. We first show that this spectral seriation algorithm recovers the true ranking when all pairwise comparisons are observed and consistent with a total order. We then show that ranking reconstruction is still exact even when some pairwise comparisons are corrupted or missing, and that seriation based spectral ranking is more robust to noise than other scoring methods. An additional benefit of the seriation formulation is that it allows us to solve semi-supervised ranking problems. Experiments on both synthetic and real datasets demonstrate that seriation based spectral ranking achieves competitive and in some cases superior performance compared to classical ranking methods.

Fajwel Fogel, Alexandre D'Aspremont, Milan Vojnovic
Shape and Illumination from Shading using the Generic Viewpoint Assumption
The Generic Viewpoint Assumption (GVA) states that the position of the viewer or the light in a scene is not special. Thus, any estimated parameters from an observation should be stable under small perturbations such as object, viewpoint or light rotations. The GVA has been analyzed and quantified in previous works, but has not been been put to practical use in actual vision tasks. In this paper, we show how to utilize the GVA to estimate shape and illumination from a single shading image, without the use of any other priors. We propose a novel linearized Spherical Harmonics (SH) shading model which enables us to obtain a computationally efficient form of the GVA term. Together with a data term, we build a model whose unknowns are shape and SH illumination. The model parameters are estimated using the Alternating Direction Method of Multipliers embedded in a multi-scale estimation framework. In this prior free framework, we obtain competitive shape and illumination estimation results under a variety of models and lighting conditions.

Dilip Krishnan, William Freeman, Daniel Zoran
Shaping Social Activity by Incentivizing Users
Events in an online social network can be categorized roughly into endogenous events, where users just respond to the actions of their neighbors within the network, or exogenous events, where users take actions due to drives external to the network. How much external drive should be provided to each user, such that the network activity can be steered towards a target state? In this paper, we model social events using multivariate Hawkes processes, which can capture both endogenous and exogenous event intensities, and derive a time dependent linear relation between the intensity of exogenous events and the overall network activity. Exploiting this connection, we develop a convex optimization framework for determining the required level of external drive in order for the network to reach a desired activity level. We experimented with event data gathered from Twitter, and show that our method can steer the activity of the network more accurately than alternatives.

Mehrdad Farajtabar, Nan Du, Isabel Valera, Le Song, Manuel Gomez Rodriguez, Hongyuan Zha
Signal Aggregate Constraints in Additive Factorial HMMs, with Application to Energy Disaggregation
Blind source separation problems are difficult because they are inherently unidentifiable, yet the entire goal is to identify meaningful sources. We introduce a way of incorporating domain knowledge into this problem, called signal aggregate constraints (SACs). SACs encourage the total signal for each of the unknown sources to be close to a specified value. This is based on the observation that the total signal often varies widely across the unknown sources, and we often have a good idea of what total values to expect. We incorporate SACs into an additive factorial hidden Markov model (AFHMM) to formulate the energy disaggregation problems where only one mixture signal is assumed to be observed. A convex quadratic programming for approximate inference is employed for recovering those source signals. On a real-world energy disaggregation data set, we show that the use of SACs dramatically improves the original FHMM, and significantly improves over a recent state-of-the-art approach.

Mingjun Zhong, Nigel Goddard, Charles Sutton
Simple MAP Inference via Low-Rank Relaxations
We focus on the problem of maximum a posteriori (MAP) inference in Markov random fields with binary variables and pairwise interactions. For this common subclass of inference tasks, we consider low-rank relaxations that interpolate between the discrete problem and its full-rank semidefinite relaxation, followed by randomized rounding. We develop new theoretical bounds studying the effect of rank, showing that as the rank grows, the relaxed objective increases but saturates, and that the fraction in objective value retained by the rounded discrete solution decreases. In practice, we show two algorithms for optimizing the low-rank objectives which are simple to implement, enjoy ties to the underlying theory, and outperform existing approaches on benchmark MAP inference tasks.

Roy Frostig, Sida Wang, Percy Liang, Christopher Manning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Stochastic gradient descent algorithms for training linear and kernel predictors are gaining more and more importance, thanks to their scalability. While various methods have been proposed to speed up their convergence, the model selection phase is often ignored. In fact, in theoretical works most of the time assumptions are made, for example, on the prior knowledge of the norm of the optimal solution, while in the practical world validation methods remain the only viable approach. In this paper, we propose a new kernel-based stochastic gradient descent algorithm that performs model selection while training, with no parameters to tune, nor any form of cross-validation. The algorithm builds on recent advancement in online learning theory for unconstrained settings, to estimate over time the right regularization in a data-dependent way. Optimal rates of convergence are proved under standard smoothness assumptions on the target function, using the range space of the fractional integral operator associated with the kernel.

Francesco Orabona
Smoothed Gradients for Stochastic Variational Inference
The field of statistical machine learning has seen a rapid progress in complex hierarchical Bayesian models. In Stochastic Variational Inference (SVI), the inference problem is mapped to an optimization problem which is solved using stochastic gradients. While this scheme was shown to scale up to massive datasets, the intrinsic noise of the stochastic gradient impedes a fast convergence of the algorithm. Inspired by gradient avergaging methods from the field of stochastic optimization, we propose a variance reduction scheme tailored to SVI by averaging successively over the sufficient statistics of the local variational paramerts. Its simplicity comes at the price of rendering the stochastic gradient biased. We show that we can eliminate large parts of the bias while obtaining the same variance reduction as in simple gradient averaging schemes. We explore the tradeoff between variance and bias on the example of Latent Dirichlet Allocation.

Stephan Mandt, David Blei
Sparse Dependent Bayesian Structure Learning
In many problem settings, vectors of regression weights are not merely sparse, but dependent in such a way that non-zero weights tend to cluster together. We refer to this local form of dependency as ''region sparsity''. Region sparsity differs from ''group sparsity'' in that it does not require a prior specification of groups, only a notion of distance between coefficients, such as one finds in spatial and time series regression problems. Classical methods for sparse regression, such as automatic relevance determination and the lasso, do not exploit such dependencies, and effectively model regression weights as independent. Here we introduce a hierarchical model for smooth, region-sparse weight tensors. Our approach represents a hierarchical extension of Tipping's relevance determination framework, in which we use a transformed Gaussian process to describe the dependencies between the prior variances of nearby weights. To simultaneously impose smoothness, we employ a structured model of the prior variances of Fourier coefficients, which prunes high frequencies and encourages estimates to be sparse in two bases simultaneously. We develop efficient approximate inference methods and show substantial improvements over comparable methods (e.g., group lasso and smooth RVM) on both simulated and real datasets from brain imaging.

Anqi Wu, Mijung Park, Jonathan Pillow, Oluwasanmi Koyejo
Sparse Multi-Task Reinforcement Learning
In multi-task reinforcement learning (MTRL), the objective is to simultaneously learn multiple tasks and exploit their similarity to improve the performance w.r.t.\ single-task learning. In this paper we investigate the case when all the tasks can be accurately represented in a linear approximation space using the same small subset of the original (large) set of features. This is equivalent to assuming that the weight vectors of the task value functions are \textit{jointly sparse}, i.e., the set of their non-zero components is small and it is shared across tasks. Building on existing results in multi-task regression, we develop two multi-task extensions of the fitted $Q$-iteration algorithm. While the first algorithm assumes that the tasks are jointly sparse in the given representation, the second one learns a transformation of the features in the attempt of finding a more sparse representation. For both algorithms we provide a sample complexity analysis and numerical simulations.

Daniele Calandriello, Alessandro Lazaric, Marcello Restelli
Sparse PCA via Covariance Thresholding
In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension $n\times p$ and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here that the principal components $\bv_1,\dots,\bv_r$ have at most $k$ non-zero entries, and study the high-dimensional regime in which $p$ is of the same order as $n$. In an influential paper, Johnstone and Lu introduced a simple algorithm that estimates the support of the principal vectors $\bv_1,\dots,\bv_r$ by the largest entries in the diagonal of the empirical covariance. This method can be shown to succeed with high probability if $k\le C_1\sqrt{n/\log p}$, and to fail with high probability if $k\ge C_2 \sqrt{n/\log p}$ for two constants $0 < C_1,C_2 < \infty$. Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees. Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler and Vilenchik. We confirm empirical evidence presented by these authors and rigorously prove that the algorithm succeeds with high probability for $k$ of order $\sqrt{n})$. Recent conditional lower bounds suggest that it might be impossible to do significantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before.

Yash Deshpande, Andrea Montanari
Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space
In this paper, we propose a Sparse Random Feature algorithm, which learns a sparse non-linear predictor by minimizing an $\ell_1$-regularized objective function over the Hilbert Space induced from kernel function. By interpreting the algorithm as Randomized Coordinate Descent in the infinite-dimensional space, we show the proposed approach converges to a solution comparable within $\eps$-precision to exact kernel method by drawing $O(1/\eps)$ number of random features, contrasted to the $O(1/\eps^2)$-type convergence achieved by Monte-Carlo analysis in current Random Feature literature. In our experiments, the Sparse Random Feature algorithm obtains sparse solution that requires less memory and prediction time while maintains comparable performance on tasks of regression and classification. In the meantime, as an approximate solver for infinite-dimensional $\ell_1$-regularized problem, the randomized approach converges to better solution than Boosting approach when the greedy step of Boosting cannot be performed exactly.

En-Hsu Yen, Ting-Wei Lin, Shou-De Lin, Pradeep Ravikumar, Inderjit Dhillon
Spectral Clustering of graphs with the Bethe Hessian
Spectral clustering is a standard approach to label nodes on a graph by studying the (largest or lowest) eigenvalues of a symmetric real matrix such as e.g. the adjacency or the Laplacian. Recently, it has been argued that using instead a more complicated, non-symmetric and higher dimensional operator, related to the non-backtracking walk on the graph, leads to improved performance in detecting clusters, and even to optimal performance for the stochastic block model. Here, we propose to use instead a simpler object, a symmetric real matrix known as the Bethe Hessian operator, or deformed Laplacian. We show that this approach combines the performances of the non-backtracking operator, thus detecting clusters all the way down to the theoretical limit in the stochastic block model, with the computational, theoretical and memory advantages of real symmetric matrices.

Alaa Saade, Florent Krzakala, Lenka Zdeborova
Spectral Learning of Mixture of Hidden Markov Models
In this paper, we propose a learning approach for the Mixture of Hidden Markov Models (MHMM) based on the method of moments. Doing so allows us to take advantage of a computational complexity that is independent of the number of data instances, thus making this model appropriate for analyzing large data sets. However, it is not possible to directly learn an MHMM using the output of existing learning approaches due to a permutation ambiguity. Instead, we show that even in the presence of estimation noise, it is possible to resolve this ambiguity using the spectral properties of a global transition matrix. We demonstrate the validity of our approach on both synthetic and real data.

Cem Subakan, Johannes Traa, Paris Smaragdis
Spectral Methods for Indian Buffet Process Inference
The Indian Buffet Process is a versatile statistical tool for modeling distributions over binary matrices. We provide an efficient spectral algorithm as an alternative to costly Variational Bayes and sampling-based algorithms. We derive a novel tensorial characterization of the moments of the Indian Buffet Process proper and for two of its applications. We give a computationally efficient iterative inference algorithm, concentration of measure bounds, and reconstruction guarantees. Our algorithm provides superior accuracy and cheaper computation than comparable Variational Bayesian approach on a number of reference problems.

Hsiao-Yu Tung, Alex Smola
Spectral Methods for Supervised Topic Models
Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on either variational approximation or Monte Carlo sampling. This paper presents a novel spectral decomposition algorithm to recover the parameters of supervised latent Dirichlet allocation (sLDA) models. The Spectral-sLDA algorithm is provably correct and computationally efficient. We prove a sample complexity bound and subsequently derive a necessary condition for the identifiability of sLDA. Thorough experiments on a diverse range of synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the algorithm.

Yining Wang, Jun Zhu
Spectral k-Support Norm Regularization
The $k$-support norm has successfully been applied to sparse vector prediction problems. We observe that it belongs to a wider class of norms, which we call the box-norms. Within this framework we derive an efficient algorithm to compute the proximity operator of the squared norm, improving upon the original method for the $k$-support norm. We extend the norms from the vector to the matrix setting and introduce the spectral $k$-support norm. We study its properties and show that it is closely related to the multitask learning cluster norm. We apply the norms to real and synthetic matrix completion datasets. Our findings indicate that spectral $k$-support norm regularization gives state of the art performance, consistently improving over trace norm regularization and the matrix elastic net.

Andrew McDonald, Massimiliano Pontil, Dimitris Stamos
Speeding-up Graphical Model Optimization via a Coarse-to-fine Cascade of Pruning Classifiers
We propose a general and versatile framework that significantly speeds-up graphical model optimization while maintaining an excellent solution accuracy. The proposed approach relies on a multi-scale pruning scheme that is able to progressively reduce the solution space by use of a novel strategy based on a coarse-to-fine cascade of learnt classifiers. We thoroughly experiment with classic computer vision related MRF problems, where our framework constantly yields a significant time speed-up (with respect to the most efficient inference methods) and obtains a more accurate solution than directly optimizing the MRF.

Bruno CONEJO, Nikos Komodakis, Sebastien Leprince, Jean Philippe Avouac
Spike Frequency Adaptation Implements Anticipative Tracking in Continuous Attractor Neural Networks
To extract motion information, the brain needs to compensate for time delays that are ubiquitous in neural signal transmission and processing. Here we propose a simple yet effective mechanism to implement anticipative tracking in neural systems. The proposed mechanism utilizes the property of spike-frequency adaptation (SFA), a feature widely observed in neuronal responses. We employ continuous attractor neural networks (CANNs) as the model to describe the tracking behaviors in neural systems. Incorporating SFA, a CANN exhibits intrinsic mobility, manifested by the ability of the CANN to hold self-sustained travelling waves. In tracking a moving stimulus, the interplay between the external drive and the intrinsic mobility of the network determines the tracking performance. Interestingly, we find that the regime of anticipation effectively coincides with the regime where the intrinsic speed of the travelling wave exceeds that of the external drive. Depending on the SFA amplitudes, the network can achieve either perfect tracking, with zero-lag to the input, or perfect anticipative tracking, with a constant leading time to the input. Our model successfully reproduces experimentally observed anticipative tracking behaviors, and sheds light on our understanding of how the brain processes motion information in a timely manner.

Yuanyuan Mi, C. C. Alan Fung, K. Y. Michael Wong, Si Wu
Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm
We improve a recent gurantee of Bach and Moulines on the linear convergence of SGD for smooth and strongly convex objectives, reducing a quadratic dependence on the strong convexity to a linear dependence. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence on average smoothness, dominating previous results, and more broadly discus how importance sampling for SGD can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods.

Deanna Needell, Rachel Ward, Nathan Srebro
Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; an assumption that supports a sharp characterization of the regret, yet is often violated in practical settings. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, while still maintaining mathematical tractability. We fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and by establishing a connection between the adversarial and the stochastic MAB frameworks.

Yonatan Gur, Assaf Zeevi, Omar Besbes
Stochastic Network Design in Bidirected Trees
Wu et al. recently introduced the problem of tree-structured stochastic network design to model the optimization of phenomena spreading away from a single source in a directed tree. We extend this framework to model phenomena that spread from many sources in both directions along the edges of a tree. Actions can be taken to increase the probability of propagation on edges, and the goal is to maximize a measure of the total amount of spread away from all sources. Our main result is that, although this problem is apparently harder than the single source problem, it is amenable to a rounded dynamic programming approach that leads to a fully polynomial-time approximation scheme (FPTAS), that is, an algorithm that can find (1-epsilon)-optimal solutions for any problem instance in time polynomial in the input size and 1/epsilon. Our algorithm outperforms competing approaches on a motivating problem from computational sustainability to remove barriers in river networks to restore the health of aquatic ecosystems.

Xiaojian Wu, Daniel Sheldon, Shlomo Zilberstein
Stochastic Proximal Gradient Descent with Acceleration Techniques
Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized empirical risk minimiza\tion problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.

Atsushi Nitanda
Stochastic variational inference for hidden Markov models
Variational inference algorithms have proven successful for Bayesian analysis in large data settings, with recent advances using stochastic variational inference (SVI). However, such methods have largely been studied in independent or exchangeable data settings. We develop an SVI algorithm to learn the parameters of hidden Markov models (HMMs) in a time-dependent data setting. The challenge arises from correlated samples when applying stochastic optimization in this setting as sampling subchains analogously to SVI introduces errors due to broken dependencies. Instead, we propose an algorithm that harnesses the memory decay of the chain to bound errors arising from edge effects. We demonstrate the effectiveness of our algorithm on synthetic experiments and a large genomics dataset where a batch algorithm is computationally infeasible.

Nick Foti, Jason Xu, Dillon Laird, Emily Fox
Streaming, Memory Limited Algorithms for Community Detection
In this paper, we consider sparse networks consisting of a finite number of non-overlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra- and inter-cluster edge densities vanish when the size of the graph grows large, making the cluster reconstruction problem nosier and hence difficult to solve. We are interested in scenarios where the network size is very large, so that the adjacency matrix of the graph is hard to manipulate and store. The data stream model in which columns of the adjacency matrix are revealed sequentially constitutes a natural framework in this setting. For this model, we develop two novel clustering algorithms that extract the clusters asymptotically accurately. The first algorithm is {\it offline}, as it needs to store and keep the assignments of nodes to clusters, and requires a memory that scales linearly with the network size. The second algorithm is {\it online}, as it may classify a node when the corresponding column is revealed and then discard this information. This algorithm requires a memory growing sub-linearly with the network size. To construct these efficient streaming memory-limited clustering algorithms, we first address the problem of clustering with partial information, where only a small proportion of the columns of the adjacency matrix is observed and develop, for this setting, a new spectral algorithm which is of independent interest.

Se-Young Yun, Alexandre Proutiere, Marc Lelarge
Structure learning of antiferromagnetic Ising models
In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. Our first result is an unconditional computational lower bound of $\Omega (p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of statistical algorithms recently introduced by Feldman et al. The construction is related to the notoriously difficult learning parities with noise problem in computational learning theory. Our lower bound shows that the $\widetilde O(p^{d+2})$ runtime required by Bresler, Mossel, and Sly's exhaustive-search algorithm cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., most recent papers on structure learning assume that the model has the correlation decay property. Indeed, focusing on ferromagnetic Ising models, Bento and Montanari showed that all known low-complexity algorithms fail to learn simple graphs when the interaction strength exceeds a number related to the correlation decay threshold. Our second set of results gives a class of repelling (antiferromagnetic) models that have the \emph{opposite} behavior: very strong repelling allows efficient learning in time $\widetilde O(p^2)$. We provide an algorithm whose performance interpolates between $\widetilde O(p^2)$ and $\widetilde O(p^{d+2})$ depending on the strength of the repulsion.

Guy Bresler, David Gamarnik, Devavrat Shah
Submodular Attribute Selection for Action Recognition in Video
In real-world action recognition problems, low-level features cannot adequately characterize the rich spatial-temporal structures in action videos. In this work, we encode actions based on attributes that describes actions as high-level concepts: \textit{e.g.}, jump forward and motion in the air. We base our analysis on two types of action attributes. One type of action attributes is generated by humans. The second type is data-driven attributes, which is learned from data using dictionary learning methods. Attribute-based representation may exhibit high variance due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and formulated as a submodular optimization problem. A greedy optimization algorithm is presented and guaranteed to be at least (1-1/e)-approximation to the optimum. Experimental results on the Olympic Sports and UCF101 datasets demonstrate that the proposed attribute-based representation can significantly boost the performance of action recognition algorithms and outperform most recently proposed recognition approaches.

Jingjing Zheng, Zhuolin Jiang, Rama Chellappa, Jonathon Phillips
Subspace Embeddings for the Polynomial Kernel
Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, a so-called oblivious subspace embedding, can only be applied to spaces with an explicit representation as the column span of a matrix, while in many settings learning is done in a high-dimensional space implicitly defined by a data matrix via a kernel transformation. We propose the first fast oblivious subspace embeddings that are able to embed a space induced by a non-linear kernel without explicitly mapping the data to the high-dimensional space. In particular, we propose an embedding for mappings induced by the polynomial kernel. Given an $n \times d$ input matrix $A$, let $\phi(A)$ be the application of the mapping to higher-dimension induced by a polynomial kernel to each of the rows of $A$. Using our subspace embeddings, we obtain the fastest known algorithms for computing an implicit low rank approximation of $\phi(A)$ and approximate kernel PCA, as well as doing principal component regression (PCR) with respect to an approximation of the top $k$ left singular vectors of $\phi(A)$. Our algorithms are asymptotically optimal for several settings of parameters.

Haim Avron, Huy Nguyen, David Woodruff
Testing Unfaithful Gaussian Graphical Models
We investigate the relationship between conditional independence relations and graph structure of the precision matrix in Gaussian undirected graphical models. In a Gaussian graphical model, if a node set $S$ is a node separator of nodes $u$ and $v$, then by global Markov property the variable $X_u$ associated with the node $u$ is conditionally independent of $X_v$ given $X_S$. The opposite direction need not be true, that is, $X_u \perp X_v | X_S$ need not imply $S$ is a node separator of $u$ and $v$. When it does the relation $X_u \perp X_v | X_S$ is called faithful. In this paper we provide a characterization of faithful relations and then provide an algorithm to test faithfulness based only on knowledge of other conditional relations of the form $X_i \perp X_j | X_S$.

De Wen Soh, Sekhar Tatikonda
The Blinded Bandit: Learning with Adaptive Feedback
We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of \emph{adaptive feedback} naturally occurs in scenarios where the environment reacts to the player's actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, which we call the \emph{blinded multi-armed bandit}, in which no feedback is given to the algorithm whenever it switches arms. We develop efficient online learning algorithms for this problem and prove that they guarantee the same asymptotic regret as the optimal algorithms for the standard multi-armed bandit problem. This result stands in stark contrast to another recent result, which states that adding a switching cost to the standard multi-armed bandit makes it substantially harder to learn, and provides a direct comparison of how feedback and loss contribute to the difficulty of an online learning problem. We also extend our results to the general prediction framework of bandit linear optimization, again attaining near-optimal regret bounds.

Ofer Dekel, Elad Hazan, Tomer Koren
The Infinite Mixture of Infinite Gaussian Mixtures
Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG.

Halid Yerebakan, Bartek Rajwa, Murat Dundar
The Large Margin Mechanism for Differentially Private Maximization
A basic problem in the design of privacy-preserving algorithms, especially for machine learning, is the \emph{private maximization problem}: the goal is to pick an item from a universe that (approximately) maximizes a data-dependent function, all under the constraint of differential privacy. Previous algorithms for this problem are either range-dependent---i.e., their utility diminishes with the size of the universe---or only apply to very restricted function classes. This work provides the first general-purpose, range-independent algorithm for private maximization that guarantees approximate differential privacy. Its applicability is demonstrated on fundamental tasks from data mining and machine learning.

Kamalika Chaudhuri, Daniel Hsu, Shuang Song
The Linear Convergence Rate of Decomposable Submodular Function Minimization
Submodular functions describe a variety of discrete problems in machine learning, signal processing, and computer vision. However, minimizing submodular functions poses a number of algorithmic challenges. Recent work introduced an easy-to-use, parallelizable algorithm for minimizing submodular functions that decompose as the sum of ``simple'' submodular functions. Empirically, this algorithm performs extremely well, but no theoretical analysis was given. In this paper, we show that the algorithm converges linearly, and we provide upper and lower bounds on the rate of convergence. Our proof relies on the geometry of submodular polyhedra and draws on results from spectral graph theory.

Robert Nishihara, Michael Jordan, Stefanie Jegelka
Tight Bounds for Influence in Diffusion Networks and Application to Bond Percolation and Epidemiology
In this paper, we derive theoretical bounds for the long-term influence of a node in an Independent Cascade Model (ICM). We relate these bounds to the spectral radius of a particular matrix and show that the behavior is sub-critical when this spectral radius is lower than 1. More specifically, we point out that, in general networks, the sub-critical regime behaves in O(sqrt(n)) where n is the size of the network, and that this upper bound is met for star-shaped networks. We apply our results to epidemiology and percolation on arbitrary networks, and derive a bound for the critical value beyond which a giant connected component arises. Finally, we show empirically the tightness of our bounds for a large family of networks.

Remi Lemonnier, Kevin Scaman, Nicolas Vayatis
Tight Continuous Relaxation of the Balanced k-Cut Problem
Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation of any balanced k-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the hard sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method beats all existing approaches for ratio cut and other balanced k-cut criteria.

Syama Sundar Yadav Rangapuram, Pramod Kaushik Mudrakarta, Matthias Hein
Tight convex relaxations for sparse matrix factorization
Based on a new atomic norm, we propose a new convex formulation for sparse matrix factorization problems in which the number of nonzero elements of the factors is assumed fixed and known. The formulation counts sparse PCA with multiple factors, subspace clustering and low-rank sparse bilinear regression as potential applications. We compute slow rates and an upper bound on the statistical dimension of the suggested norm for rank 1 matrices, showing that its statistical dimension is an order of magnitude smaller than the usual $\ell_1$-norm, trace norm and their combinations. Even though our convex formulation is in theory hard and does not lead to provably polynomial time algorithmic schemes, we propose an active set algorithm leveraging the structure of the convex problem to solve it and show promising numerical results.

Emile Richard, Guillaume Obozinski, Jean-Philippe Vert
Tightening after Relax: Minimax-Optimal Sparse PCA in Polynomial Time
We provide statistical and computational analysis of sparse Principal Component Analysis (PCA) in high dimensions. The sparse PCA problem is highly nonconvex in nature. Consequently, though its global solution attains the optimal statistical rate of convergence, such solution is computationally intractable to obtain. Meanwhile, although its convex relaxations are tractable to compute, they yield estimators with suboptimal statistical rates of convergence. On the other hand, existing nonconvex optimization procedures, such as greedy methods, lack statistical guarantees. In this paper, we propose a two-stage sparse PCA procedure that attains the optimal principal subspace estimator in polynomial time. The main stage employs a novel algorithm named sparse orthogonal iteration pursuit, which iteratively solves the underlying nonconvex problem. However, our analysis shows that this algorithm only has desired computational and statistical guarantees within a restricted region, namely the basin of attraction. To obtain the desired initial estimator that falls into this region, we solve a convex formulation of sparse PCA with early stopping. Under an integrated analytic framework, we simultaneously characterize the computational and statistical performance of this two-stage procedure. Computationally, our procedure converges at the rate of $1/\sqrt{t}$ within the initialization stage, and at a geometric rate within the main stage. Statistically, the final principal subspace estimator achieves the minimax-optimal statistical rate of convergence with respect to the sparsity level $s^*$, dimension $d$ and sample size $n$. Our theory and method don't hinge on the spiked covariance assumption, and adapt to both non-Gaussianity and data dependency. Our analysis also illustrates an interesting phenomenon: A larger sample size can accelerate the computation of our procedure.

Zhaoran Wang, Huanran Lu, Han Liu
Time--Data Tradeoffs by Smoothing
This paper proposes a tradeoff between sample complexity and computation time that applies to statistical estimators based on convex optimization. When there is excess data, one can aggressively smooth the optimization problem to achieve accurate estimates more quickly. This work provides theoretical and experimental evidence of this tradeoff for a class of regularized linear inverse problems.

John Bruer, Joel Tropp, Volkan Cevher, Stephen Becker
Top Rank Optimization in Linear Time
Bipartite ranking aims to learn a real-valued ranking function that orders positive instances before negative instances. Recent efforts of bipartite ranking are focused on optimizing ranking accuracy at the top of the ranked list. Most existing approaches are either to optimize task specific metrics or to extend the rank loss by emphasizing more on the error associated with the top ranked instances, leading to a high computational cost that is super-linear in the number of training instances. We propose a highly efficient approach, titled TopPush, for optimizing accuracy at the top that has computational complexity linear in the number of training instances. We present a novel analysis that bounds the generalization error for the top ranked instances for the proposed approach. Empirical study shows that the proposed approach is highly competitive to the state-of-the-art approaches and is 10-100 times faster.

Nan Li, Rong Jin, Zhi-Hua Zhou
Universal Option Models
We consider the problem of learning models of options for real-time abstract planning, in the setting where reward functions can be specified at any time and their expected returns must be efficiently computed. We introduce a new model for an option that is independent of any reward function, called the {\it universal option model (UOM)}. We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function. We extend the UOM to linear function approximation, and we show it gives the TD solution of option returns and value functions of policies over options. We provide a stochastic approximation algorithm for incrementally learning UOMs from data and prove its consistency. We demonstrate our method in two domains. The first domain is document recommendation, where each user query defines a new reward function and a document's relevance is the expected return of a simulated random-walk through the document's references. The second domain is a real-time strategy game, where the controller must select the best game unit to accomplish dynamically-specified tasks. Our experiments show that UOMs are substantially more efficient in evaluating option returns and policies than previously known methods.

Hengshuai Yao, Csaba Szepesvari, rich Sutton, Joseph Modayil, Shalabh Bhatnagar
Unsupervised Learning by Deep Scattering Contractions
We introduce a deep scattering network, which computes invariants with iterated contractions adapted to training data. It defines a deep convolution network model, whose contraction properties can be analyzed mathematically. A cascade of wavelet transform convolutions are computed with a multirate filter bank, and adapted with permutations. Unsupervised learning of permutations optimize the contraction directions, by maximizing the average discriminability of training data. For Haar wavelets, it is solved with a polynomial complexity pairing algorithm. Translation and rotation invariance learning is shown with classification experiments on hand-written digits.

Xu Chen, Xiuyuan Cheng, Stephane Mallat
Using Convolutional Neural Networks to Recognize Rhythm Stimuli from Electroencephalography Recordings
Electroencephalography (EEG) recordings of rhythm perception might contain enough information to distinguish different rhythm types/genres or even identify the rhythms themselves. We apply convolutional neural networks (CNNs) to analyze and classify EEG data recorded within a rhythm perception study in Kigali, Rwanda which comprises 12 East African and 12 Western rhythmic stimuli – each presented in a loop for 32 seconds to 13 participants. We investigate the impact of the data representation and the pre-processing steps for this classification tasks and compare different network structures. Using CNNs, we are able to recognize individual rhythms from the EEG with a mean classification accuracy of 24.4% (chance level 4.17%) over all subjects by looking at less than three seconds from a single channel. Aggregating predictions for multiple channels, a mean accuracy of up to 50% can be achieved for individual subjects.

Sebastian Stober, Daniel Cameron, Jessica Grahn
Variational Gaussian Process State-Space Models
State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present extensions to stochastic variational inference and online learning.

Roger Frigola, Yutian Chen, Carl Rasmussen
Weakly-supervised Discovery of Visual Pattern Configurations
The increasing prominence of weakly labeled data nurtures a growing demand for object detection methods that can cope with a minumum of supervision. We propose an approach that automatically identifies discriminative configurations of visual patterns that are characteristic of a given object class. We formulate the problem as a constrained submodular optimization problem and demonstrate the benefits of the discovered configurations in remedying mislocalizations and finding informative positive and negative training examples. Together, these contributions lead to state-of-the-art weakly-supervised detection results on the challenging PASCAL VOC dataset.

Hyun Oh Song, Yong Jae Lee, Stefanie Jegelka, Trevor Darrell
Weighted importance sampling for off-policy learning with linear function approximation
Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, \emph{weighted} importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD\la. We show empirically that our new WIS-LSTD\la algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD\la (Yu 2010, Bertsekas \& Yu 2009).

Ashique Rupam Mahmood, Hado van Hasselt, Richard Sutton
Zero-shot recognition with unreliable attributes
In principle, zero-shot learning makes it possible to train an object recognition model simply by specifying the category's attributes. For example, if one has classifiers for generic attributes like \emph{striped}, \emph{four-legged}, and \emph{metallic}, then one can construct a classifier for the zebra category by enumerating which of those properties it possesses---even without providing training images of zebras. In practice, however, the standard zero-shot paradigm suffers because attribute predictions in novel images are hard to get right. We propose a novel random forest approach to train zero-shot models that explicitly accounts for the unreliability of attribute predictions. By leveraging statistics about each attribute's error tendencies during zero-shot training, our method obtains more robust discriminative models for the unseen classes. We further devise extensions to handle the few-shot scenario and unreliable attribute descriptions. Our results demonstrate the substantial benefit for visual category learning with zero or few training examples, a critical domain for learning rare categories or categories that are defined on the fly.

Dinesh Jayaraman, Kristen Grauman
Zeta Hull Pursuits: Learning Non-convex Data Hulls
Selecting a small informative subset from a given dataset, also called column sampling, has drawn much attention in machine learning. For incorporating structured data information into column sampling, research efforts were devoted to the cases where data points are fitted with clusters, simplices, or general convex hulls. This paper aims to study non-convex hull learning, on which has been paid little attention in the community. In order to learn data-adaptive non-convex hulls, we propose a novel geometric approach based on a graph-theoretic measure that leverages graph cycles to characterize the structural complexity of input data points. Using this measure, we present a greedy algorithmic framework, dubbed Zeta Hulls, to perform structured column sampling. The process of pursuing a Zeta Hull involves the computation of matrix inverse. To accelerate this matrix computation and reduce its space complexity as well, we construct a low-rank approximation to the adjacency graph by employing an efficient anchor graph technique. Extensive experimental results show that data representation learned by Zeta Hulls can result in the state-of-the-art accuracy in text and image classification tasks.

Yuanjun Xiong, Wei Liu, Deli Zhao, Xiaoou Tang