March 21, 2016 by kevin@ksvanhorn.com Leave a Comment

It’s All About Jensen’s Inequality

A recent paper proves something that runners have long suspected: GPS overestimates the distance you have traveled. This isn’t due to any algorithmic error; it is instead an unavoidable consequence of two facts:

The position measurements that GPS makes are noisy — there is some degree of random error to them.
The distance between two points is a convex function of the coordinates of the points.

A convex function is one that curves upwards. Here are some examples:

For a function of one argument (such as the above examples), convexity means that the function has a positive second derivative. A convex function of several arguments curves upward no matter what direction you follow; that is, the directional second derivative is positive no matter what direction you choose.

Jensen’s Inequality states that

if is a convex function
and is a (possibly vector-valued) random variable

then

(Strictly speaking, you could have instead of , but only if the probability distribution for is concentrated at a single point.)

In this case, is the vector , where are the measured GPS coordinates for the starting point and are the measured GPS coordinates for the ending point, and

is the calculated distance between the two points. It is straightforward to show that this distance function is convex.

Note that and are noisy measurements, not the actual (imperfectly known) coordinates. If we assume that the GPS measurements, although noisy, are at least unbiased, then

where and are the actual coordinates. The calculated distance is , the actual distance is , and Jensen’s inequality guarantees that

March 17, 2016 by kevin@ksvanhorn.com Leave a Comment

Analysis of a Nootropics Survey

The Problem

Scott Alexander’s blog Slate Star Codex recently carried the results of a survey of over 850 users of nootropics (cognitive enhancers) such as caffeine, Adderall, and Modafinil. The survey asked respondents to subjectively rate each substance on a scale of 0 to 10, with 0 meaning useless, 1-4 meaning subtle effects, 5-9 meaning strong effects, and 10 meaning life-changing.

There are several difficulties in analyzing this kind of data:

Discretization. Actual effects vary over a continuum, but respondents are asked to choose from a discrete set of choices.
Heterogeneous scale usage. People vary in how they use the scale. Some may spread their responses out more than others. Some may tend to give higher answers than others for the same underlying effect. There may be nonlinearities in how people use the scale.
Meaning. Just what do these ratings mean? How do they translate into specific effects on a person’s mind?

In this note I tackle problems (1) and (2), which are purely technical problems; for (3) I have no answer. On (2) I restrict my attention to bias and scaling of responses.

The analytic approach used here is a simplified version of the methodology described in this paper.

One final caveat: the survey subjects are a self-selected sample, and hence may not be representative of the general populace. One way of dealing with that issue would be to regress the nootropic effects on various subject characteristics that might be predictive of nootropic effect. I did not do this, although the survey includes questions that could be used for this purpose.

Summary of Results

I did a Bayesian analysis that used a hierarchical prior for the nootropic effects and accounted for scale usage heterogeneity and discretization of responses. The first figure shows estimates for the population mean effect ( from the model described below) for each nootropic. The black point is the posterior median, the red line is the 80% posterior credible interval, and the black line is the 90% interval.

Credible intervals for nootropic effects — Posterior credible intervals for nootropic effects

The picture is considerably murkier if you look at the posterior predictive distribution for each nootropic. The effect for an individual is , where is an individual deviation from the population mean, with nootropic-specific variance . The next figure shows credible intervals for this individual effect, for each nootropic. These individual effects are quite uncertain: the values vary from around 1.9 to around 2.8.

The Data

The data are provided as a table with one row per subject (survey respondent), and one column per nootropic. There were 36 nootropics mentioned in the survey, but subjects only gave ratings to those they had actually used. The first step is to reshape the data from this “wide” format into a “long” format with columns for subject, nootropic, and response, with each case being some subject’s experience with some nootropic. Then

indexes cases;
is the subject for case ;
is the nootropic for case ;
is the rating subject gave for nootropic .

Discretization

Take each rating to be the binned version of a continuous latent variable . For example, a rating of means that . Similarly, a rating of means , and a rating of means .

This approach uses a fixed, equally-spaced set of breakpoints , ; a further refinement, which I did not explore, would be to infer the breakpoints themselves, perhaps restricting them to some parametric form such as a quadratic in .

Scale Usage

One can expect that survey respondents will vary in how they translate their response to a nootropic into a continuous rating . Assume that an underlying continuous response gets translated into a continuous rating via an individual bias term and scaling factor:

Hierarchical priors for the scale-usage parameters are appropriate:

It probably would have made sense to use a bivariate normal prior on and to allow for correlations between them in the population, but I did not explore this option.

The priors on , and are weakly informative, based on the 0 to 10 scale used:

Values of outside the range 0 to 10 are implausible.
A value of is implausible, as it allows quite extreme values for to be common.
A value of is implausible, as it means that it would be common for to see a factor of difference in the the scaling used by two different subjects.

Nootropic effects

The effectiveness of a nootropic will vary over the population; letting be the mean effect of nootropic , and its variance over the population, we have

Since there are 36 different nootropics in the study, I used hierarichical priors for the mean effects and variances:

The priors for , , and are again intended to be weakly informative—given the 10-point scale, values of larger than 7, values of larger than 2 , and values of larger than 2 all seem implausibly extreme.

Likelihood

The normal distribution for induces a normal distribution for , conditional on the other model variables:

The likelihood for case in the data set is then given by

where is the CDF for the standard normal distribution.

Estimation Scripts

The R code I used to run the estimation and produce the plots are in these three scripts, which I ran one after the other: nootropics.R, nootropics2.R, and nootropics3.R.

August 14, 2015 by kevin@ksvanhorn.com 1 Comment

Which Link Function — Logit, Probit, or Cloglog?

(PDF)

Introduction

A generalized linear model for binary response data has the form

where is the 0/1 response variable, is the -vector of predictor variables, is the vector of regression coefficients, and is the link function. In the Stan modeling language this would be written as

  y ~ bernoulli(p);
  g(p) <- dot_product(x, beta);

with replaced by the name of a link function, and similarly for the BUGS modeling language.

The most common choices for the link function are

logit:
probit:
where is the cumulative distribution function for the standard normal distribution; and
complementary log-log (cloglog):

All three of these are strictly increasing, continuous functions with and .

In this note we’ll discuss when to use each of these link functions.

Probit

The probit link function is appropriate when it makes sense to think of as obtained by thresholding a normally distributed latent variable :

Defining , this yields

Logit

Logit is the default link function to use when you have no specific reason to choose one of the others. There is a specific technical sense in which use of logit corresponds to minimal assumptions about the relationship between and . Suppose that we describe the joint distribution for and by giving

the marginal distribution for , and
the expected value of for each predictor variable .

Then the maximum-entropy (most spread-out, diffuse, least concentrated) joint distribution for and satisfying the above description has a pdf of form

for some function , coefficient vector and normalizing constant . The conditional distribution for is then

and so

Cloglog

The complementary log-log link function arises when

where is a count having a Poisson distribution:

To see this, let

Then