Study of the Week: We’ll Only Scale Up the Good Ones

When it comes to education research and public policy, scale is the name of the game.

Does pre-K work? Left-leaning people (that is, people who generally share my politics) tend to be strong advocates of these programs. It’s true that generically, it’s easier to get meaningful educational benefits from interventions in early childhood than later in life. And pre-K proponents tend to cite some solid studies that show some gains relative to peer groups, though these gains are generally modest and tend to fade out over time. Unfortunately, while some of these studies have responsible designs, many that are still cited are old, from small programs, or both.

Today’s Study of the Week, by Mark W. Lipsey, Dale C. Farran, and Kerry G. Hofer, is a much-discussed, controversial study from Tennessee’s Voluntary Prekindergarten Program. The Vanderbilt University researchers investigated the academic and social impacts of the state’s pre-K programs on student outcomes. The study we’re looking at is a randomized experimental design, which was pulled from a larger observational study. The Tennessee program, in some locales, had more applicants than available seats. These seats are filled by a random lottery, creating a natural control and experimental group.

There is one important caveat here: the students examined in the intensive portion of the research had to be selected from those whose parents gave consent. That’s about a third of the potential students. This is a potential source of bias. While the randomized design will help, what we can responsibly say is that we have random selection within the group of students whose parents opted in, but with a nonrandom distribution relative to the overall group of students attending this program. I don’t think that’s a particularly serious problem, but it’s a source of potential selection bias and something to be aware of. There’s also my persistent question about the degree to which school selection lotteries can be gamed by parents and administrators. There are lots of examples of this happening. (Here’s one at a much-lauded magnet school in Connecticut.) Most people in the research field seem not to see this as a big concern. I don’t know.

In any event, the results of the research were not encouraging. Researchers examined six identified subtests (two language, two literacy, two math) from the Woodcock-Johnson tests of cognitive ability, a well-validated and widely-used battery of tests of student academic and intellectual skills. They also looked at a set of non-cognitive abilities related to behavior, socialization, and enthusiasm for school. A predictable pattern played out. Students who attended the Tennessee pre-K program saw short-term significant gains relative to their peers who did not attend the program. But over time, the peer group caught up, and in fact in this study, exceeded the test group. That is, students who attended Tennessee’s pre-K program ended up actually underperforming those who were not selected into it.

By the end of kindergarten, the control children had caught up to the TN‐VPK children and there were no longer significant differences between them on any achievement measures. The same result was obtained at the end of first grade using both composite achievement measures. In second grade, however, the groups began to diverge with the TN‐VPK children scoring lower than the control children on most of the measures….  In terms of behavioral effects, in the spring the first grade teachers reversed the fall kindergarten teacher ratings. First grade teachers rated the TN‐ VPK children as less well prepared for school, having poorer work skills in the classrooms, and feeling more negative about school.

This dispiriting outcome mimics that of the Head Start study, another much-discussed, controversial study that found similar outcomes: initial advantages for Head Start students that are lost entirely by 3rd grade.

Further study is needed1 but it seems that the larger and more representative the study, the less impressive – and the less persistent – the gains from pre-K. There’s a bit of uncertainty here about whether the differences in outcomes are really the product of differences in programs or due to differences in the research itself. And I don’t pretend that this is a settled question. But it is important to recognize that the positive evidence for pre-K comes from smaller, higher-resource, more-intensive programs. Larger programs have far less encouraging outcomes.

The best guess, it seems to me, is that at scale universal pre-K programs would function more like the Tennessee system and less like the small, higher-performing programs. That’s because scaling up any major institutional venture, in a country the size of the United States, is going to entail the inevitable moderating effects of many repetitions. That is, you can build one school or one program and invest a lot of time, effort, and resources into making it as effective as possible, and potentially see significant gains relative to other schools. But it strikes me as a simple statement of the nature of reality that this intensity of effort and attention can’t scale. As Farran and Lipsey say in a Brookings Institution essay, “To assert that these same outcomes can be achieved at scale by pre-K programs that cost less and don’t look the same is unsupported by any available evidence.”

Some will immediately say, well, let’s just pay as much for large-scale pre-K as they do in the other programs and model their techniques. The $26 billion question is, can you actually do that? Can what makes these programs special actually be scaled? Is there hidden bias here that will wash out as we expand the programs? I confess I’m skeptical that we’ll see these quantitative gains under even the best scenario. I think we need to understand the inevitability of mediocrity and regression to the mean. That doesn’t mean I don’t support universal pre-kindergarten childcare. As with after school programs, I do for social and political reasons, though, not out of the conviction much that they’ll change test scores much. I’d be happy to be proven wrong.

Now I don’t mean to extrapolate irresponsibly. But allow me to extrapolate irresponsibly: isn’t this precisely what we should expect with charter schools, too? We tend to see, survivorship-bias heavy CREDO studies aside, that at scale the median charter school does little or nothing to improve on traditional public schools. We also see a number of idiosyncratic, high-intensity, high-attention charters that report better outcomes. The question you have to ask, based on how the world works, is which is more likely to be replicated at scale – the median, or the exceptions?

I’ve made this point before about Donald Trump’s favorite charter schools, Success Academy here in New York. Let’s set aside questions of the abusive nature of the teaching that goes on in these schools. The basic charter proponent argument is that these schools succeed because they can fire bad teachers and replace them with good. Success Academy schools are notoriously high stress, long-hour, low pay affairs. This leads naturally to high teacher attrition. Luckily for the NYC-based Success Academy, New York is filled with lots of eager young people who want to get a foothold in the city, do some do-goodering, then bail for their “real” careers later on – essentially replicating the Teach for America model. So: even if we take all of the results from such programs at face value, do you think this is a situation that can be scaled up in places that are far less attractive to well-educated, striving young workers? Can you get that kind of churn and get the more talented candidates you say you need, at no higher cost, to come to the Ozarks or Flint, Michigan or the Native American reservations? Can you nationally have a profession of 3 million people, already caught in a teacher shortage, and then replicate conditions that lead to somewhere between 35%-50% annual turnover, depending on whose numbers you trust?

And am I really being too skeptical if my assumption is to say no, you can’t?


 

public services are not an ATM

Built into the rhetoric of school choice is a deeply misguided vision of how public investment works.

You sometimes hear people advocating for charters or voucher programs by saying that parents just want to take “their share” of public education funds and use it to get their child an education, whether by siphoning it from traditional public schools towards charters or by cutting checks to private schools. The “money should follow the child,” to use another euphemism. But this reflects a strange and deeply conservative vision of how public spending works. There is no “your share” of public funds. There is the money that we take via taxation from everyone which represents the pooled resources of civic society, and there is what civic society decides to spend it on via the democratic process. You might use that democratic process to create a system where some of the money goes to charter schools or private school vouchers or all manner of things I don’t approve of. But it’s not your money, no matter how much you paid into taxes. And the distinction matters.

To begin with, the constantly-repeated claim that charter schools don’t cost traditional public schools money is just proven wrong again and again. People lay out these theoretical systems where they don’t, like you can just subtract one student and all of the costs associated with that student and just shift the kid and the money to another school. But this reflects a basic failure to understand pooled costs and economies of scale. And when we go looking, that’s what we find: after years of promises that charters are not an effort to defund traditional public schools, our reality checks show they have that effect. Take Chicago, where the charter school system has absolutely contributed to the fiscal crisis in the traditional public schools. Or Nashville. Or Los Angeles. I could go on.

But suppose we knew that we could extract exactly as much, dollar for dollar and student for student, from public education for each student who leaves. Would that be a wise thing to do? Not according to any conventional progressive philosophy towards government.

Do we let you take “your share” out of the public transportation system so that you can use it to defray the cost of buying your own car? Can you take “your share” out of the police budgets to hire your own private security? Can I extract my tax dollars from the public highway system I almost never use in order to build my own bike lanes? Of course not. In many cases this simply wouldn’t make sense; how can you extract your share from a building, or a bridge, or any other type of physical infrastructure? And besides: the basic progressive nature of public ownership means that we are pooling resources so that those who have the least ability to pay for their own services can benefit from the contributions of those with the most ability to pay. To advance the notion of people pulling “their” tax dollars out from public schools undermines the very conception of shared social spending. And governmental spending should require true democratic accountability; letting the Bill and Melinda Gates Foundation dictate public education policy, Mark Zuckerberg become the wholly unqualified education czar of Newark, or the Catholic church control public education dollars through voucher programs directly undermines that accountability.

So of course there’s a deep and widening split opening up within the school reform coalition, which has always been filled with self-styled progressives. There’s a major, existential disagreement at play about the basic concepts of social spending and the public good. These have been papered over for years by the missionary zeal of choice acolytes and their crisis narrative. But there was never a coherent progressive political philosophy underneath. The Donald Trump and Betsey Devos education platform is a disaster in the making, but at least it has brought these basic conflicts into the light. These issues are not going away, nor should they, and the “progressive” ed reform movement is going to have to do a lot of soul searching.

Reporting Regression Results Responsibly

We’re in a Golden Age for access to data, which unfortunately also means we’re in a Golden Age for the potential to misinterpret data. Though the absurdity of gated academic journals persists, academic research is more accessible now than ever before. We’ve also seen a rapid growth in the use of arguments based on statistics in the popular media in the last several years. This is potentially a real boon to our ability to understand the world around us, but it carries with it all of the potential for misleading statistical arguments.

My request is pretty simple. All statistical techniques, particularly the basic parametric statistical techniques that are most likely to show up in data journalism, require the satisfaction of assumptions and checking of diagnostic measures to ensure that hidden bias isn’t misleading us. Many of these assumptions and diagnostics are ultimately judgment calls, relying on practitioners to make informed decisions about what degree of wiggle room is appropriate given the research scenario. There are, however, conventions and implied standards that people can use to guide their decisions. The most important and useful kind of check, though, is the  eyes of other researchers. Given that the ability to host graphs, tables, and similar kinds of data online is simple and nearly free, I think that researchers and data journalists alike should provide links to their data and to the graphs and tables they use to check assumptions and diagnostic measures. In the digital era, it’s crazy this is still a rare practice. I don’t expect to find these graphs and tables sitting square in the center of a blog post, and I expect that 90% of readers wouldn’t bother to look. But there’s nothing to risk in having them available, and transparency, accountability, and collaboration to gain.

That’s the simple part, and you can feel free to close tab. For a little more:

What kind of assumptions and diagnostics am I talking about? Let’s consider the case of one of the most common types of parametric methods, linear regression. Whether we have a single predictor for simple linear regression or multiple predictors for multilinear regression, fundamentally regression is a matter of assessing the relationship between quantitative (continuous) predictor variables and a quantitative (continuous) outcome variable. For example, we might ask how well SAT scores predict college GPA; we might ask how well age, weight, and height predict blood pressure. When someone talks about how one number predicts another, the strength of their relationship, and how we might attempt to change one by changing the other, they’re probably making an appeal to regression.

The types of regression analysis, and the issues therein, are vast, and there are many technical issues at play that I’ll never understand. But I think it’s worthwhile to talk about some of the assumptions we need to check and some problems we have to look out for. Regression has come in for a fair amount of abuse lately from sticklers and skeptics, and not for no reason; it’s easy to use the techniques irresponsibly. But we’re inevitably going to ask basic questions of how X and Y predict Z, so I think we should expand public literacy about these things. I want to talk a little bit about these issues not because I think I’m qualified to teach statistics to others, or because regression is the only statistical process that we need to see assumptions and diagnostics for. Rather, I think regression is an illustrative example through which to explore why we need to check this stuff, to talk about both the power and pitfalls of public engagement with data.

There are four assumptions that need to be true to run a linear (least squares) regression: independence of observations, linearity, constancy of variance, and normality. (Some purists add a fifth, existence, which, whatever.)

Independence of Observations

This is the biggie, and it’s why doing good research can be so hard and expensive. It’s the necessary assumption that one observation does not affect another. This is the assumption that requires randomness. Remember that in statistics error, or necessary and expected variation, is inevitable, but bias, or the systematic influence on observations, is lethal.

Suppose you want to see how eating ice cream affects blood sugar level. You gather 100 students into the gym and have them all eat ice cream. You then go one by one through the students and give them a blood test. You dutifully record everyone’s values. When you get back to the lab, you find that your data does not match that of much of the established research literature. Confused, you check your data again. You use your spreadsheet software to arrange the cells by blood sugar. You find a remarkably steady progression of results running higher to lower. Then it hits you: it took you several hours to test the 100 students. The highest readings are all from the students who were first to be tested, the lowest from those who were tested last. Your data was corrupted by an uncontrolled variable, time-after-eating-to-test. Your observations were not truly independent of each other – one observation influenced another because taking one delayed taking the other. This is an example that you’d hope most people would avoid, but the history of research is the history of people making oversights that were, in hindsight, quite obvious.

Independence is scary because threats to it so often lurk out of sight. And the presumption of independence often prohibits certain kind of analysis that we might find natural. For example, think of assigning control and test conditions to classes rather than individual students in educational research. This is often the only practical way to do it; you can’t fairly ask teachers to only teach half their students one technique and half another. You give one set of randomly-assigned classes a new pedagogical technique, while using the old standard with your control classes. You give a pre- and post-test to both and pop both sets of results in an ANOVA. You’ve just violated the assumption of independence. We know that there are clustering effects of children within classrooms; that is, their results are not entirely independent of each other. We can correct for this sort of thing using techniques like hierarchical modeling, but first we have to recognize that those dangers exist!

Independence is the assumption that is least subject to statistical correction. It’s also the assumption that is the hardest to check just by looking at graphs. Confidence in independence stems mostly from rigorous and careful experimental design. You can check a graph of your observations (your actual data points) against your residuals (the distance between your observed values and the linear progression from your model), which can sometimes provide clues. But ultimately, you’ve just got to know your data was collected appropriately. On this one, we’re largely on our own. However, I think it’s a good idea for academic researchers to provide online access to a Residuals vs. Observations graph when they run a regression. This is very rare, currently.

Here’s a Residuals vs. Observations graph I pulled off of Google Images. This is what we want to see: snow. Clear nonrandom patterns in this plot are bad.

Linearity

The name of the technique is linear regression, which means that observed relationships should be roughly linear to be valid. In other words, you want your relationship to fall along a more or less linear path as you move across the x axis; the relationship can be weaker or it can be stronger, but you want it to be more or less as strong as you move across the line. This is particularly the case because curvilinear relationships can appear to regression analysis to be no relationship. Regression is all about interpolation: if I check  my data and find a strong linear relationship, and my data has a range from A to B, I should be able to check any x value within A and B and have a pretty good prediction for y. (What “pretty good” means in practice is a matter of residuals and r-squared, or the portion of the variance in y that’s explained by my xs.) If my relationship isn’t linear, my confidence in that prediction is unfounded.

Take a look at these scatter plots. Both show close to zero linear relationship according to Pearson’s product-moment coefficient:

And yet clearly, there’s something very different going on from one plot to the next. The first is true random variance; there is no consistent relationship between our x and y variables. The second is a very clear association; it’s just not a linear relationship. The degree and direction of y varying along x changes over different values for x. Failure to recognize that non-linear relationship could compel us to think that there is no relationship at all. If the violation of linearity is as clear and consistent as in this scatter plot, it can be cleaned up fairly easily by transforming the data.

Regression is fairly robust to violations of linearity, and it’s worth noting that any relationship that is sufficiently lower than 1 will be non-linear in the strict sense. But clear, consistent curves in data can invalidate our regression analyses.

Readers could check data for linearity if scatter plots are posted for simple linear regression. For multilinear regression, it’s a bit messier; you could plot every individual predictor, but I would be satisfied if you just mention that you checked linearity.

Constancy of variance

Also known by one of my very favorite ten-cent words, homoscedasticity. Constancy of variance means that, along your range of x predictors, your y varies about as much; it has as much spread, as much error. Remember, when I’m doing inferential statistics, I’m sampling, and sampling means sampling error – even if I’m getting quality results, I’m inevitably going to get differences in my data from one collection of samples to the next. But if our assumptions are true, we can trust that those samples will vary in predictable intervals relative to the true mean. That is, if an SAT score predicts freshman year GPA with a certain degree of consistency for students scoring 400, it should be about as consistent for students scoring 800, 1200, and 1600, even though we know that from one data set to the next, we’re not going to get the exact same values even if we assume that all of the variables of interest are the same. We just need to know that the degree to which they vary for a given is constant over our range.

Why is this important? Think again about interpolation. I run a regression because I want to understand a relationship between various quantitative variables, and often because I want to use my predictor variables to… predict. Regression is useful insofar as I can move along the axes of my x values and produce a meaningful, subject-to-error-but-still-useful value for y. Violating the assumption of constant variance means that you can’t predict y with equal confidence as you move around x(s); the relationship is stronger at some points than others, making you vulnerable to inaccurate predictions.

Here’s a residuals plot showing the dreaded megaphone effect: the error (size of residuals, difference between observations and results expected from the regression equation) increases as we move from low to high values of x. The relationship is strong at low values of x and much weaker at high values.

We could check homoscedasticity by having access to residual plots. Violations of constant variance can often be fixed via transformation, although it may often be easier to use techniques that are more inherently robust to this violation, such as quantile regression.

Normality

The concept of the normal distribution is at once simple and counterintuitive, and I’ve spent a lot of my walks home trying to think of the best way to explain it. The “parametric” in parametric statistics refers to the assumption that there is a given underlying distribution for most observable data, and frequently this distribution is the normal distribution or bell curve. Think of yourself walking down the street and noticing that someone is unusually tall or unusually short. The fact that you notice is in and of itself a consequence of the normal distribution. When we think of someone that is unusually tall or short, we are implicitly assuming that we will find fewer and fewer people as we move further along the extremes of the height distribution. If you see a man in North American who is 5’10, he is above average height, but you wouldn’t bat an eye; if you see a man who is 6’3, you might think yourself, that’s a tall guy; when you see someone who is 6’9, you say, wow, he is tall!, and when you see a 7 footer, you take out your cell phone. This is the central meaning of the normal distribution: that the average is more likely to occur than extremes, and that the relationship between position on the distribution and probability of occurrence is predictable.

Not everything in life is normally distributed. Poll 1,000 people and ask how much money they received in car insurance payments last year and it won’t look normal. But a remarkable amount of naturally occurring phenomena are normally distributed, simply thanks to the reality of numbers and extremes, and the central limit theorem teaches us that essentially all averages are normally distributed. (That is, if I take a 100 person sample of a population for a given quantitative trait, I will get a mean; if I take another 100 person sample, I will get a similar but not exact mean, and so on. If I plot those means, they will be normal even if the overall distribution is not.)

The assumption of normality in regression requires our data to be roughly normally distributed; in order to assess the relationship of y as it moves across x, we need to know the relative frequency of extreme observations to observations close to the mean. It’s a fairly robust assumption, and you’re never going to have perfectly normal data, but too strong of a violation will invalidate your analysis. We check normality with what’s called a qq plot. Here’s an almost-perfect one, again scraped from Google Images:

That strongly linear, nearly 45 degree angle is just what we want to see. Here’s a bad one, demonstrating the “fat tails” phenomenon – that is, too many observations clustered at the extremes relative to the mean:

Usually the rule is that unless you’ve got a really clear break from a straightish 45 degree angle, you’re probably alright. When the going gets tough, seek help from a statistician.

Diagnostics

OK, so 2000 words into this thing, we’ve checked out four assumptions. Are we good? Well, not so fast. We need to check a few diagnostic measures, or what my stats instructor  used to call “the laundry list.” This is a matter of investigating influence. When we run an analysis like regression, we’re banking on the aggregate power of all of our observations to help us make responsible observations and inferences. We never want to rely too heavily on individual or small numbers of observations because that increases the influence of error in our analysis. Diagnostic measures in regression typically involve using statistical procedures to look for influential observations that have too much sway over our analysis.

The first thing to say about outliers is that you want a systematic reason for eliminating them. There are entire books about the identification and elimination of outliers, and I’m not qualified to say what the best method is in any given situation. But you never want to toss an observation simply because it would help your analysis. When you’ve got that one data point that’s dragging your line out of significance, it’s tempting to get rid of it, but you want to analyze that observation for a methodology-internal justification for eliminating it. On the other hand, sometimes you have the opposite situation: your purported effect is really the product of a single or small number of influential outliers that have dragged the line in your favor (that is, to a p-value you like). Then, of course, the temptation is simply to not mention the outlier and publish anyway. Especially if a tenure review is in your future…

Some examples of influential observation diagnostics in regression include examining leverage, or outliers in your predictors that have a great deal of influence on your overall model; Cook’s Distance, which tells you how different your model will be if you delete a given observation; DFBetas, which tells you how a given predictor observation influences on a particular parameter estimate; and more. Most modern statistical packages like SAS or R have commands for checking diagnostic measures like these. While offering numbers would be nice, I would mostly like it if researchers reassured readers that they had run diagnostic measures for regression and found acceptable results. Just let me know: I looked for outliers and influential observations and things came back fairly clean.

*****

Regression is just one part of a large number of techniques and applications that are happening in data journalism right now. But essentially any statistical techniques are going to involve checking assumptions and diagnostic measures. A typical ANOVA, for example, the categorical equivalent of regression, will involve checking some of the same assumptions. In the era of the internet, there is no reason not to provide a link to a brief, simple rundown of what quality controls were pursued in  your analysis.

None of these things are foolproof. Sums of squares are spooky things; we get weird results as we add and remove predictors from our models. Individual predictors are strongly significant by themselves but not when added together; models are significant with no individual predictors significant; individual predictors are highly significant without model significance; the order you put your predictors in changes everything; and so on. It’s fascinating and complicated. We’re always at the mercy of how responsible and careful researchers are. But by sharing information, we raise the odds that what we’re looking at is a real effect.

This might all sound like an impossibly high bar to clear. There are so many ways things can go wrong. And it’s true that, in general, I worry that people today are too credulous towards statistical arguments, which are often advanced without sufficient qualifications. There are some questions where statistics more often mislead than illuminate. But there is a lot we can and do know. We know that age is highly predictive of height in children but not in adults; we know that there is a relationship between SAT scores and freshman year GPA; we know point differential is a better predictor of future win-loss record than past win-loss record. We can learn lots of things, but we always do it better together. So I think that academic researchers and data journalists should share their work to a greater degree than they do now. That requires a certain compromise. After all, it’s scary to have tons of strangers looking over your shoulder. So I propose that we get more skeptical and critical on our statistical arguments as a media and readership, but more forgiving of individual researchers who are, after all, only human. That strikes me as a good bargain.

And one I’m willing to make myself, so please email me to point out the mistakes I’ve inevitably made in this post.

diversifying the $5 reward tier

Hey gang, first I’m sorry content has been a bit light on the main site this week. Good things are coming in bunches soon. I have been releasing archival content to all subscribers on the Patreon page at a steady clip. I wanted to let you know that I’ve decided to diversify the $5 patron content a little. It’s not so much that I’m not keeping up with the book reading – it’s been a bit tough but not bad – but rather that I’m feeling a little constrained by the review format. So I’m going to alternate between book reviews and more general cultural writing, reading recommendations, considerations of contemporary criticism, etc. There will still not be any explicitly political content, which I host on Medium.

Book reviews return this weekend at last, though, and thanks for your patience. I’ve got a number of good ones coming up. Thank you for your continued support. If you aren’t yet a Patreon patron, please consider it. Also, thanks so much for the emails, and I apologize if I haven’t gotten back to you. I’ve taken some unexpected heat lately, and the support means more than I can say.

g-reliant skills seem most susceptible to automation

This post is 100% informed speculation.

As someone who is willing to acknowledge that IQ tests measure something real, measurable, and largely persistent, I take some flak from people who are skeptical of such metrics. As someone who does not think that IQ (or g, the general intelligence factor that IQ tests purport to measure) is the be-all, end-all of human worth, I take some flak from the internet’s many excitable champions of IQ. This is one of those things where I get accused of strawmanning – “nobody thinks IQ measures everything worthwhile!” – but please believe me that long experience shows that there are an awful lot of very vocal people online who are deeply insistent that IQ measures not just raw processing power but all manner of human value. Like so many other topics, IQ seems to be subject to a widespread binarism, with most people clustered at two extremes and very few with more nuanced positions. It’s kind of exhausting.

I want to make a point that, though necessarily speculative, seems highly intuitive to me. If we really are facing an era where superintelligent AI is capable of automating a great deal of jobs out from under human workers, it seems to me that many g-reliant jobs are precisely the ones most likely to be automated away. If the factor represents the ability to do raw intellectual processing, then it seems likely to me that the g-factor will become less economically worthwhile when such processing is offloaded to software. IQ-dominant tasks in specific domains like chess have already been conquered by task-specific AI. It doesn’t seem like a stretch to me to suggest that more obviously vocational skills will be colonized by new AI systems.

Meanwhile, contrast this with professions that are dependent on “soft” skills. Extreme IQ partisans are very dismissive of these things, often arguing that they aren’t real or that they’re just correlated with IQ anyway. But I believe that there are social, emotional, and therapeutic skills that are not validly measured by IQ tests, and these skills strike me as precisely those that AI will have the hardest time replicating. Human social interactions are incredibly complex and are barely understood by human observers who are steeped in them every day. And human beings need each other; we crave human contact and human interaction. It’s part of why people pay for human instructors in all sorts of tasks that they could learn from free online videos, why we pay three times as much for a drink at a bar than we would pay to mix it at home, why we have set up these odd edifices like coworking spaces that simply permit us to do solo tasks surrounded by other human beings. I don’t really know what’s going to happen with automation and the labor market; no one does. But that so many self-identified smart people are placing large intellectual bets on the persistent value of attributes that computers are best able to replicate seems very strange to me.

You could of course go too far with this. I don’t think that people at the very top of their games need to worry too much; research physicists, for example, probably combined high IQs and a creative/imaginative capacity we haven’t yet really captured in research. But the thing about these extremely high performers is that they’re so rare that they’re not really relevant from a big picture perspective anyway. It’s the larger tiers down, the people whose jobs are g-dependent but who aren’t part of a truly small elite, that I think should worry – maybe not that group today, but its analog 50 or 100 years from now. I mean, despite all of the “teach a kid to code” rhetoric, computer science is probably a heavily IQ-screened field and it’s silly to try and push everyone into it anyway. But even beyond that… someday it’s code that will write code.

Predictions are hard, especially about the future. I could be completely wrong. But this seems like an intuitively persuasive case to me, and yet I never hear it discussed much. That’s the problem with the popular conversation on IQ being dominated by those who consider themselves to have high IQs; they might have too much skin in the game to think clearly.

Study of the Week: Of Course Virtual K-12 Schools Don’t Work

This one seems kind of like shooting fish in a barrel, but given that “technology will solve our educational problems” is holy writ among the Davos crowd no matter what the evidence, I suppose this is worth doing.

Few people would ever come out and say this, but central to assumptions about educational technology is that human teachers are an inefficiency to be removed from the system by whatever means possible. Right now, not even the most credulous Davos type, nor the most shameless ed tech profiteer, is making the case for fully automated AI-based instruction. But attempts to dramatically increase the number of students that you can force through the capitalist pipeline at low cost that you can help nurture and grow are well under way, typically by using digital systems to let one teacher teach more students than you’d see in a brick-and-mortar classroom. This also cuts down on the costs of facilities, which give kids a safe and engaging place to go every day but which are expensive. So you build a virtual platform, policy types use words like “innovation” and “disrupt,” and for-profit entities start sucking up public money with vague promises of deliverance-through-digital-technology. Kids and parents get “choice,” which the ed reform movement has successfully branded as a good thing even though at scale school choice has not been demonstrated to have any meaningful relationship to improved outcomes at all.

Today’s Study of the Week, from a couple years ago, takes a look at whether these virtual K-12 schools actually, you know, work. It’s a part of the CREDO project. I have a number of issues, methodological and political, with the CREDO program generally, but I still think this is high-quality data. It’s a large data set that compares the outcomes of students in traditional public schools, brick and mortar charters, and virtual charters. The study uses a matched data method – in simple terms, comparing students from the different “conditions” who match on a variety of demographic and educational metrics in order to attempt to control construct-irrelevant variance. This can be help to ameliorate some of the problems with observational studies, but bear in mind that once again, this is not the same as a true randomized controlled trial. They had to do things this way because online charter seats are not assigned via lottery. (For the record, I do not trust the randomization effects of such lotteries because of the many ways in which they are gamed, but here that’s not even an issue because there’s no lottery at all.)

The matched variables, if you’re curious:

• Grade level
• Gender3
• Race/Ethnicity
• Free or Reduced-Price Lunch Eligibility
• English Language Learner Status
• Special Education Status
• Prior test score on state achievement test

So how well do online charters work? They don’t. They don’t work. Look at this.

Please note that, though these negative effect sizes may not seem that big to you, in a context where most attempted interventions are not statistically different than zero, they’re remarkable. I invite you to look at the “days of learning lost” scale on the right of the graphic. There’s only 180 days in the typical K-12 school year! This is educational malpractice. How could such a thing have been attempted with over 160,000 students without any solid evidence it could work? Because the constant, the-sky-is-falling crisis narrative in education has created a context where people believe they are entitled to try anything, so long as their intentions are good. Crisis narratives undermine checks and balances and the natural skepticism that we should ordinarily apply to the interests of young children and to public expenditure. So you get millions of dollars spent on online charter schools that leave students a full school year behind their peers.

Are policy types still going full speed ahead, working to send more and more students – and more and more public dollars – into these failed, broken online schools? Of course. Educational technology and the ed reform movement writ large cannot fail, they can only be failed, and nothing as trivial as reality is going to stand in the way.

Study of the Week: Trade Schools Are No Panacea

You will likely have encountered the common assertion that we need to send people into trade schools to address problems like college dropout rates and soft labor markets for certain categories of workers. As The Atlantic recently pointed out, the idea that we need to be sending more people to trade and tech schools has broad bipartisan, cross-ideological appeal. This argument has a lot of different flavors, but it tends to come down to the claim that we shouldn’t be sending everyone to college (I agree!) and that instead we should be pushing more people into skilled trades. Oftentimes this is encouraged as an apprenticeship model over a schooling model.

I find there’s far more in the way of narrative force behind these claims than actual proof. It just sounds good – we need to get back to making things, to helping people learn how to build and repair! But… where’s the evidence? I’ve often looked at brute-force numbers like unemployment numbers for particular professions, but it’s hard to make responsible conclusions with that kind of analysis. Well, there’s a big new study out that looks in a much more rigorous way – and the results aren’t particularly encouraging.

Today’s Study of the Week, written by Eric A. Hanushek, Guido Schwerdt, Ludger Woessmann, and Lei Zhang, looks at how workers who attend vocational schools perform relative to those who attend general education schools. Like the recent Study of the Week on the impact of universal free school breakfast, this study uses a difference-in-difference approach to explore causation, again because it’s impossible to do an experiment with this type of question – you can’t exactly tell people that your randomization has sorted them into a particular type of schooling and potentially life-long career path, after all. The primary data they use is the International Adult Literacy Survey, a very large, metadata-robust survey with demographic, education, and employment data from 18 countries, gather from 1994 to 1998. (The authors restrict their analysis to the 11 countries that have robust vocational education systems in place.) The age of the data is unfortunate, but there’s little reason to believe that the analysis here would have changed dramatically, and the data set is so rich with variables (and thus the potential to do extensive checks for robustness and bias) that it’s a good resource. What do they find? In broad strokes, vocational/tech training helps you get a job right out of school, but hurts you as you go along later in life:

(don’t be too offended by excluding women – their overall change in workforce participation made it necessary)

Most important to our purpose, while individuals with a general education are initially (normalized to an age of 16 years) 6.9 percentage points less likely to be employed than those with a vocational education, the gap in employment rates narrows by 2.1 percentage points every ten years. This implies that by age 49, on average, individuals completing a general education are more likely to be employed than individuals completing a vocational education. Individuals completing a secondary-school equivalency or other program (the “other” category) have a virtually identical employment trajectory as those completing a vocational education.

Now, they go on to do a lot of quality controls and checks for robustness and confounds. As much of a slog as that stuff is, I recommend you check some of that out and start to pick some of it apart. Becoming a skilled reader of academic research literature really requires that you get used to picking apart the quality controls, because this is often where the juicy stuff can be found. Still, in this study the various checks and controls all support the same basic analysis: those who attend vocational schools or programs enjoy initial higher employability but go on to suffer from higher unemployment later in life.

What’s going on with these trends? The suggestion of the authors seems correct to me: vocational training is likely more specific and job-focused than general ed, which means that its students are more ready to jump right into work. But over time, technological and economic changes change which skills and competencies are valued by employers, and the general education students have been “taught to learn,” meaning that they are more adaptable and can acquire new and valuable skills.

I’m not 100% convinced that counseling more people into the trades is a bad idea. After all, the world needs people who can do these things, and early-career employability is nothing to dismiss. But the affirmative case that more trade school is a solution to long-term unemployment problems seems clearly wrong. And in fact this type of education seems to deepen one of our bigger problems in the current economy: the speed of technological change moves so fast these days that it’s hard for older workers to adapt, and they often find themselves in truly unfortunate positions. Even in trades that are less susceptible to technological change, there’s uncertainty; a lot of the traditional construction trades, for example, are very exposed to the housing market, as we learned the hard way in 2009. Do we want to use public policy to deepen these risks?

In a broader sense: it’s unclear if it’s ever a good idea to push people into a particular narrow range of occupations, because then people rush into them and… there stops being any shortage and advantage for labor. For a little while there, petrochemical engineering seemed huge. But it takes a lot of schooling to do those jobs, and then the oil market crashed. Pharmacy was the safe haven, and then word got out, a ton of people went into the field, and the labor market advantage was eroded. Also, there are limits to our understanding of how many workers we need in a given field. Some people argue there’s a teacher shortage; some insist there isn’t. Some people believe there’s a shortage of nurses; some claim there’s a glut. If you were a young student, would you want to bet your future on this uncertainty? It seems far more useful to me to try and train students into being nimble, adaptable learners than to train them for particular jobs. That has the bonus advantage of restoring the “practical” value of the humanities and arts, which have always been key aspects of learning to be well-rounded intellects.

My desires are twofold. First, that we be very careful when making claims about the labor market of the future, given the certainty that trends change. (One of my Purdue University students once told me, with a smirk, that he had intended to study Search Engine Optimization when he was in school, only to find that Facebook had eaten Google as the primary driver of many kinds of web traffic.) Second, that we stop saying “the problem is you went into X field” altogether. Individual workers are not responsible for labor market conditions. Those are the product of macroeconomic conditions – inadequate aggregate demand, outsourcing, and the merciless march of automation. What’s needed is not to try and read the tea leaves and guess which fields might reward some slice of our workforce now, but to redefine our attitude towards work and material security through the institution of some sort of guaranteed minimum income. Then, we can train students in the fields in which they have interest and talent, contribute to their human flourishing in doing so, and help shelter them from the fickleness of the economy. The labor market is not a morality play.

why universities can’t be the primary site of political organizing

This is not a political publication, but I am definitely interested in discussing campus issues in this space, and I would like to take a second and lay out some reasons why Amber A’Lee Frost is correct that the university can’t be the key site of left-wing (or any other) organizing. (If you think that idea’s a strawman, I invite you to read the Port Huron Statement.)

Please note that this is a series of empirical claims, not normative ones. I’m not saying it would be good or bad for campus to be the key site of a given movement’s organizing strategy. I’m saying that it’s not going to work, for good or bad.

There’s not a lot of people on campus. There’s a lot of universities out there, and you could be forgiven for overestimating the size of the student population. But NCES says there’s only about 20 million students, grad and undergrad, enrolled in degree-granting post-secondary institutions. There’s also about 4 million people who work in those institutions. Back of the envelope that means that there’s about 7.5% of the American population regularly on campus in one capacity or another, setting aside questions of online-only education. Is 7.5% nothing? Not at all. It’s a meaningful chunk of people. But even if all of them were capable of being politically organized – which of course is far from the truth – you’re still leaving out the vast majority of the adult population.

Campus activism is seasonal. You aren’t going to hear a lot about campus protests for a few months. Why? Because of summer break. Vacation is notoriously hard on student protest groups. Why did the “campus uprising” of a few years ago fizzle out? In large measure because of Christmas break – the spring semester wasn’t nearly as active as the fall – and then summer break. Activism requires momentum and continuity of practice, and the regularity of vacation makes that quite difficult. Organizations that are careful and have strong leadership in place can take steps to adjust for this seasonal nature, but there’s just always going to be major lulls in campus organizing according to the calendar. And politics happens year-round.

College students are an itinerant population. Speaking of continuity of practice, campus political groups constantly have to replace membership and leadership because students (we hope) will eventually graduate. Again, that problem can be ameliorated with hard work and forethought by these groups, but it’s very difficult to have consistent strength of numbers and a coherent political vision when you’re seeing 100% turnover in a 5-6 year span.

Town and gown conflicts can make local organizing difficult. Sadly, many university towns are sites of tension and mutual distrust between the campus community and the locals. The degree of these tensions varies widely from campus to campus, and they can be ameliorated. In fact, making attempts to heal those divides can be the best form of campus activism. But it’s the case that the complex conflicts between colleges and the towns in which they’re housed will often make it difficult to build meaningful solidarity across the campus borders, which often serve as an invisible wall of attention and community.

Students are too busy to devote too much time to organizing. 70% of college students work. A quarter have dependent children. These students must also do all of the necessary work of being students. We should be realistic and fair with their time and recognize that a majority of students will not be able to engage politically for many hours out of the week.

College students have a natural and justifiable first-order priority of getting employed. Everyone who works is of course at risk of having professional repercussions for their political engagement, but college students perhaps have a unique set of worries about being publicly politically active, particularly in the era of the internet. Nowadays, we’re all constantly building an easily-searchable, publicly-accessible archive of the things we once thought and did. This is particularly troublesome for those who have not yet gotten their first jobs and have yet to build the kind of social capital necessary to feel secure in their ability to get work with a controversial political past. It’s my impression that a lot of college students are inclined to be political but who feel that they simply can’t risk it, and that’s a fear that we should respect given the modern job market.

College activism can either be a low-stakes place where students learn and grow safely, or an essential site of organizing – but it can’t be both. Oftentimes, when campus activists make mistakes (such as forcing a free yoga class for disabled students to be shut down because yoga is “cultural appropriation”), defenders will say, hey, they’re just college kids – they need a chance to screw up, to make mistakes, to be free to fail. And there’s some real truth to that. The problem is that this attitude cannot coexist with the idea that campus has to be a central site or the central site of left-wing political organizing. If what happens on campus is crucial to the broader left movement, it can’t then be called not worth worrying about; if campus organizing is a space that is largely free of consequences for young activists, then it can’t be a space where essential political work gets done. These ideas are not compatible.

Organize the campus’s workforce according to labor principles. None of this means that organizing shouldn’t take place on campus; it absolutely should. But like Frost I think that the left is far too fixated on what happens in campus spaces, likely because these spaces are some of the only areas where the left appears to hold any meaningful power. Student activists should be encouraged to engage politically in order to learn and grow, but we should not imagine that they are the necessary vanguard of the young left, given that only a third of Americans ever gets a college degree. Meanwhile, we absolutely must continue to organize the campus as a workplace. (For the record, Frost is a member of a campus union, as am I.) But that organization takes place according to labor principles, not according to any special dictates of academic culture. And this returns to Frost’s basic thesis: it is the organization of labor, not of students, that must be the primary focus and goal of the American left.

correlation: neither everything nor nothing

via Overthinking

One thing that everyone on the internet knows, about statistics, is this: correlation does not imply causation. It’s a stock phrase, a bauble constantly polished and passed off in internet debate. And it’s not wrong, at least not on its face. But I worry that the denial of the importance of correlation is a bigger impediment to human knowledge and understanding than belief in specious relationships between correlation and causation.

First, you should read two pieces on the “correlation does not imply causation” phenomenon, which has gone from a somewhat arcane notion common to research methods classes to a full-fledged meme. This piece by Greg Laden is absolute required reading on correlation and causation and how to think about both. Second, this piece by Daniel Engber does good work talking about how “correlation does not imply causation” became an overused and unhelpful piece of internet lingo.

As Laden points out, the question is really this: what does “imply” mean? The people who employ “correlation does not imply causation” as a kind of argumentative trump card are typically using “imply” in a way that nobody actually means, which is as synonymous with “prove.” That’s pretty far from what we usually mean by “implies”! In fact, using the typical meaning of implication, correlation sometimes implies causation, in the sense that it provides evidence for a causal relationship. In careful, rigorously conducted research, a strong correlation can offer some evidence of causation, if that correlation is embedded in a theoretical argument for how that causative relationship works. If nothing else, correlation is often the first stage in identifying relationships of interest that we might then investigate in more rigorous ways, if we can.

A few things I’d like people to think about.

There are specific reasons that an assertion of causation from correlation data might be incorrect. There is a vast literature of research methodology, across just about every research field you can imagine. Correlation-causation fallacies have been investigated and understood for a long time. Among the potential dangers is the confounding variable, where an unknown variable is driving the change in two other variables, making them appear to influence one another. This gives us the famous drownings-and-ice cream correlation – as drownings go up, so do ice cream sales. The confounding variable, of course, is temperature.1 There are all sorts of nasty little interpretation problems in the literature. These dangers are real. But in order to have understanding, we have to actually investigate why a particular relationship is spurious. Just saying “correlation does not imply causation” doesn’t do anything to actually improve our understanding. Explore why, if you want to be useful. Use the phrase as the beginning of a conversation, not a talisman.

Correlation evidence can be essential when it is difficult or impossible to investigate a causative mechanism. Cigarette smoking causes cancer. We know that. We know it because of many, many rigorous and careful studies have established that connection. It might surprise you to know that the large majority of our evidence demonstrating that relationship comes from correlation studies, rather than experiments. Why? Well, as my statistics instructor used to say – here, let’s prove cigarette smoking causes cancer. We’ll round up some infants, and we’ll divide them into experimental and control groups, and we’ll expose the experimental group to tobacco smoke, and in a few years, we’ll have proven a causal relationship. Sound like a good idea to you? Me neither. We knew that cigarettes were contributing to lung cancer long before we identified what was actually happening in the human body, and we have correlational studies to thank for that. Blinded randomized controlled experimental studies are the gold standard, but they are rare precisely because they are hard, sometimes impossible. To refuse to take anything else as meaningful evidence is nihilism, not skepticism.

Sometimes what we care about is association. Consider relationships which we believe to be strong but in which we are unlikely to ever identify a specific causal mechanism. I have on my desk a raft of research showing a strong correlation between parental income and student performance on various educational metrics. It’s a relationship we find in a variety of locations, across a variety of ages, and through a variety of different research contexts. This is important research, it has stakes; it helps us to understand the power of structural advantage and contributes to political critique of our supposedly meritocratic social systems.

Suppose I was prohibited from asserting that this correlation proved anything because I couldn’t prove causation. My question is this: how could I find a specific causal mechanism? The relationship is likely very complex, and in some cases, not subject to external observation by researchers at all. To refuse to consider this relationship in our knowledge making or our policy decisions because of an overly skeptical attitude towards correlational data would be profoundly misguided. Of course there’s limitations and restrictions we need to keep in mind – the relationship is consistent but not universal, its effect is different for different parts of the income scale, it varies with a variety of factors. It’s not a complete or simple story. But I’m still perfectly willing to say that poverty is associated with poor educational performance. That’s the only reasonable conclusion from the data. That association matters, even if we can’t find a specific causal mechanism.

Correlation is a statistical relationship. Causation is a judgement call. I frequently find that people seem to believe that there is some sort of mathematical proof of causation that a high correlation does not merit, some number that can be spit out by statistical packages that says “here’s causation.” But causation is always a matter of the informed judgment of the research community. Controlled experiments are the gold standard in that regard, but there are controlled experiments that can’t prove causation and other research methods that have established causation to the satisfaction of most members of a discipline.

Human beings have the benefit of human reasoning. One of my frustrations with the “correlation does not imply causation” line is that it’s often deployed in instances where no one is asserting that we’ve adequately proved causation. I sometimes feel as though people are trying to protect us from mistakes of reasoning that no one would actually fall victim to. In an (overall excellent) piece for the Times, Gary Marcus and Ernest Davis write, “A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two.” That’s true – it is hard to imagine! So hard to imagine that I don’t think anyone would have that problem. I get the point that it’s a deliberately exaggerated example, and I also fully recognize that there are some correlation-causation assumptions that are tempting but wrong. But I think that, when people state the dangers of drawing specious relationships, they sometimes act as if we’re all dummies. No one will look at these correlations and think they’re describing real causal relationships because no one is that senseless. So why are we so afraid of that potential bad reasoning?

Those disagreeing with conclusions drawn from correlational data have a burden of proof too. This is the thing, for me, more than anything. It’s fine to dispute a suggestion of causation drawn from correlation data. Just recognize that you have to actually make the case. Different people can have responsible, reasonable disagreements about statistical inferences. Both sides have to present evidence and make a rational argument drawn from theory. “Correlation does not imply causation” is the beginning of discussion, not the end.

I consider myself on the skeptical side when it comes to Big Data, at least in certain applications. As someone who is frequently frustrated by hype and woowoo, I’m firmly in the camp that says we need skepticism ingrained in how we think and write about statistical inquiry. I personally do think that many of the claims about Big Data applications are overblown, and I also think that the notion that we’ll ever be post-theory or purely empirical are dangerously misguided. But there’s no need to throw the baby out with the bathwater. While we should maintain a healthy criticism of them, new ventures dedicated to researched, data-driven writing should be greeted as a welcome development. What we need, I think, is to contribute to a communal understanding of research methods and statistics, including healthy skepticism, and there’s reason for optimism in that regard. Reasonable skepticism, not unthinking rejection; a critical utilization, not a thoughtless embrace.


 

you learn by being taught

Forgive the relative quiet lately; I’ve been enjoying my birthday weekend and then catching up on a ton of work. There’s a bunch of good things coming this week, including the return of book reviews after a brief (and unplanned) break.

This morning I spoke to an entire public high school, where I was invited to discuss being a product of public schools, higher ed, and success. It was very funny for me to be asked, though flattering – as I told the kids today, I would never think of myself casually as a success. Who ever thinks that way, beyond the wealthy and the deluded? But it was flattering and fun. I told them that there was no great wisdom in life, just a series of decisions before you, and hopefully with time the perspective to be able to choose better from worse. And, because I think this is important, I told them that they needed to cultivate a sense of “good enough” in their lives. At that age, they are being told constantly that they should pursue their dreams. But very few of us get what we’ve dreamed of, and those who have often find it’s far less grand than they’d imagined. So I told them to learn and experience and enjoy and to figure out how to live in the essential disappointment of human life.

It wasn’t as much of a bummer as it sounds!

I have been reflecting on the value of teachers. I have been accused a lot, lately, of not believing that teachers matter. That’s the opposite of the truth, really. I just think that this notion of casting the value of teachers in purely quantitative terms is a mistake, and a very recent one. The entire history of the Western canon, from Socrates to Aquinas to Locke to Dewey to Baldwin, contains arguments against this reduction. But this fight, to define what I mean and what I don’t against the tide, is a fight I suspect I will always have to keep fighting, and I intend to.

Our culture celebrates autodidacts. It talks constantly of “disrupting” education. It insists always that we need to radically reshape how we teach and learn. It treats as heroic the rejection of teachers and traditional mentorship. The self-help aisle of the bookstore abounds with writers who insist that they truly learned by rejecting the typical method of education and became, instead, self-taught, self-made. It’s an unavoidable trope.

What amazes me about my own education is just how far that is from the truth for me personally. I’ve learned, over decades, how I learn. It’s pretty simple: teachers teach me. That was true in kindergarten and it’s true now that I have my doctorate. I can’t tell you how often I have found myself feeling lost and ignorant, only to have patient, kind teachers take me through the familiar processes of modeling and repetition that are cornerstones of education. I think back to my graduate statistics classes, where I often feel like the slowest person in class, but where I always ended up getting there, thanks to steady and reassuring teaching. When I don’t get what I need from class, I’d go to office hours, or I’d go to the statistics help room, where brilliant graduate students eagerly shared knowledge and experience with me. None of this is fundamentally any different than when Mrs. Gebhardt taught me to cut shapes out of paper or when Mr. Shearer taught me simple algebra or when Mr. Tucci taught me to read poetry or when Dr. Nunn taught me to write a real research paper. The process is always the same, and in every case, I have succeeded not through rejecting the authority of teachers but by accepting their help, by recognizing their superior knowledge and letting them use it to enrich my life.

Is that a contradiction of what I’ve said about the limited ability of teachers to control the outcomes of their students? I don’t think so. The question is, do you want us to have a fuller and more humane vision of what it means to learn? I do.

They say that great men see farther than others by standing on the shoulders of giants. I think most of us are enabled to see as far as others because others have collectively reached their hands down and pulled us up.