lots of fields have p-value problems, not just psychology

You likely have heard of the replication crisis going on, where past research findings cannot be reproduced by other researchers using the same methods. The issue, typically, lies with p-value, an essential but limited statistic that we use to establish statistical significance. (There are other replication problems than just p-value, but that’s the one that you read about the most.) You can read about p-value here and the replication crisis here.

These problems are often associated with the social sciences in general and the fields of psychology and education specifically. This is largely due to the inherent complexities of human-subject research, which typically involves many variables that researchers cannot control; the inability to perform true control-grouped experimental studies due to practical or ethical limitations; and the relatively high alpha thresholds associated with these fields, typically .05, which are necessary because effects studied in the social sciences are often weak compared to those in the natural or applied sciences.

However, it is important to be clear that the p-value problem exists in all manner of fields, including in some that are among the “hardest” of scientific disciplines. In a 2016 story for Slate, Daniel Engber writes of much cancer research, “much of cancer research in the lab—maybe even most of it—simply can’t be trusted. The data are corrupt. The findings are unstable. The science doesn’t work,” because of p-value and associated problems. In a 2016 article for the Proceedings for the National Academy of Sciences of the United States, Eklund, Nichols, and Knutsson found that inferences drawn from fMRI brain imaging are frequently invalid, sharing concerns voiced in a 2016 eNeuro article by Katherine S. Button about replication problems across the biomedical sciences. A 2016 paper by Erik Turkhemier, an expert in genetic heritability of behavioral traits, discussed the ways that even replicable weak associations between genes and behavior prevent researchers from drawing meaningful conclusions about the relationship between genes and behavior. In a 2014 article for Science, Erik Stokstad expressed concerns that ecology literature was more and more likely to list p-values, but that the actual explained effects were becoming weaker and weaker, and that p-values were not adequately contextualized through reference to other statistics.

Clearly, we can’t reassure ourselves that p-value problems are found only in the “soft” sciences. There is a far broader problem with basic approaches to statistical inference that affect a large number of fields. The implications of this are complex; as I have said and will say again, research nihilism is not the answer. But neither is laughing it off as a problem inherent just to those “soft” sciences. More here.

I am not, it turns out, the agent of faculty death at Brooklyn College

When I got this job, one of the excitable, obsessive boys at Lawyers Guns and Money, Erik Loomis, announced that I was now a neoliberal administrator bent on destroying the professoriate. Now, this has much less to do with my actual job and more to do with the team over there’s ongoing, weird fixation on me – Loomis got tenure the other day and immediately rushed to his blog to talk about me, which is sadder and weirder than I can imagine. (Bear in mind that this is a man who once mocked the quality of my education despite the fact that I got my MA at the university that employs him.) But still, having been here going on eight months now, perhaps it’s time to take stock of his accusations.

I’ve been at my current job since the end of September. I can tell you that, despite some bumps and snags, things have gone pretty well and I’m happy with the position and the job I’m doing. I can also tell you, happily, that the predictions of Erik Loomis have not come true. Not that there was ever much chance of that.

When I went to interview with Brooklyn College, I had already been on the academic job market for almost two years. Though my CV was a perfect match for the job, I was hesitant. My position is administrative, and I am on record as believing (and still believe) that there’s too much administrative hiring in the academy. (Of course, I wanted a tenure track job more than anything, but I couldn’t get one in two years of trying.) Plus, assessment is a touchy subject, an endeavor that if undertaken clumsily and without faculty oversight can indeed erode faculty control. So when I went on my campus interview I reiterated a point to the hiring committee that I had made in my Skype interview: that I would accept the position only under the condition that the job would be mutually understood to be a faculty support position. I said this to the hiring committee; I said it to the Associate Provost who would go on to be my boss. They assured me this was what they were looking for, and they have held up their end of the bargain.

Faculty control just about every step of the assessment process here. Faculty write the mission statements for their departments. Faculty devise the student learning outcomes. Faculty decide on the assessment tools they want to use. Faculty decide how best to analyze the data. Faculty ultimately decide what the data means and what changes to make because of it. Assessment involves shared governance between faculty and administrators, but the particulars of how given departments are assessed are firmly in the control of faculty, and should be.

What do I do? I take on a lot of the grunt work that faculty don’t want to. When faculty are crafting student learning outcomes, I give them advice about what I think will make learning easily measurable, when asked. When faculty choose particular tools like tests, portfolios, or surveys, I talk to them about what some of the options are and what I think is most pragmatically feasible, when asked. I research what other institutions do and lay out what are commonly thought of as best practices for a given field. I do a lot of the busywork for actual data collection and analysis – I wrangle spreadsheets, I organize shared folders on servers, I assign numbers to anonymized student work, I let department chairs know when documents have been turned in. Sometimes, I’m the one that does the stats work, again only if asked. I don’t insist on doing any of this, and there are departments here that choose to handle all of that themselves. They just send me reports when they have them, and that’s fine. Other departments have asked me to do a lot of the heavy lifting, out of concern for the workload of faculty who are already stretched far too thin by teaching and research requirements. I’m happy to help when they do. The point is that in every way that matters, it is faculty who ultimately control the assessment process. And while I work underneath the Provost’s office, I report to an Academic Assessment Council where faculty have a substantial ability to dictate policy.

Maybe the most important point is that, regardless of your take on my character or my commitment to faculty independence, I’m just not important enough here to do the kind of destructive work Loomis claimed. I don’t have that kind of power. Brooklyn College, I’m happy to say, has an unusually powerful faculty. Curricular decisions, to a rare degree, are made by the professors. It helps that it’s a public school in a state where public sector workers are powerful. We also have an activist faculty union – a union that I’m a member of. Despite Loomis’s contention that I would work against the union, I have in fact been active in the PSC from the start; I’ve attended every meeting of my own chapter since I arrived, and I’m starting to get involved in the Brooklyn College chapter too. I hope to get organize during our upcoming contract fight. In any event, trust me: no one will soon be made to bow down before my great power here, and I would never attempt to do so because of my academic beliefs, my investment in labor solidarity, and my conscience.

Put it this way: I’m sure there are many faculty members here who don’t know I exist. All-conquering administrators should be far less anonymous.

There were other complaints. Loomis says I’ll be a provost someday. But, of course, I wouldn’t ever take such a job. I know because I’m me. I have no interest in that. I could stay in this position permanently and, thanks to the benefits and a collectively bargained contract, feel pretty good about that. Or I may in the future look for other positions within CUNY, as appropriate. Who can say? But I will never be looking at executive jobs because I am not interested in doing so. Commenters insisted that I’d be overpaid, but I am in fact at the exact same salary bumps as CUNY faculty of equivalent experience. That was a selling point for me: it helps to know that I share compensation levels with professors. I just can’t get tenure. So if you’re saying I’m overpaid, you’re saying that CUNY professors are overpaid which, well, that’s a remarkable idea given our endless contract battles and the precarious state of our funding.

Ultimately my job is like a lot of jobs: it’s not perfect, it can be frustrating, but I can see real ways that I’m helping the larger community. Faculty that I’ve worked with have been universally cordial, and I’ve enjoyed helping them develop assessment plans for their departments. Besides: this work is going to be done. The question is whether it’s done well and whether it’s done in a way that is minimally invasive to faculty. The fact is that assessment is inevitable, particularly in large public systems. The accreditation agencies mandate (and have always mandated) regular assessment. And for reasons I won’t get into, in recent years Brooklyn College has been under immense pressure to improve our assessment efforts for accreditors. You can lament the impact of accreditation agencies but they are a fact of life. Another fact of life is that a lot of faculty simply don’t want to do the kind of work that I do. I can’t blame them! They’re already brutally overworked. That’s why my job exists, so that I can use my expertise and experience in assessment of student learning to take on some of the inevitable burden that is coming down from the college, from CUNY, from the state, from our accreditation agency, and from the feds. Is that worth the cost of my salary? I can’t possibly be the one to judge. Paying my rent depends on my believing that it is worth it. Members of this community will just have to judge for themselves.

It happens that I also think there is a profound social justice component to assessment writ large – that an American higher education system that leaves millions of students with loan debt but no degree needs to take a hard look at its learning systems to come together, as a community, and figure out how to fix things, not in a way antagonistic to faculty but with faculty as the inevitable and essential leaders of such a project. But that’s a bigger issue and one for another day.

None of this, of course, will matter to Loomis. I could have gotten a job that perfectly matched with his politics – say, Assistant Professor of Centrist Democrat Studies at Rahm Emanuel University – and he would have been mad. But it matters to me. I got a good job at a great college in a wonderful city, and I’m slowly becoming part of a community of teachers and researchers that I respect and admire. I’m thrilled to have it. It’s not perfect but I’m making the most of it. And I’m so grateful to be here.

quick and dirty: economic inequality and test scores

Is there a relationship between a country’s performance on international education benchmarks like the PISA tests and that country’s economic inequality as measured by the Gini coefficient?

Math
Reading
Science

Sure looks like it! Those are some healthy correlations there. (Lower Gini = lower inequality.) The math plot in particular is striking. I’m sure there’s noise here and if I get scolded by somebody I’ll update the post.

Of course, this invites the classic question about the arrow of causation in education: are these societies more equal because they have better education? Or are their education results better because their economies are more equal? You can probably guess what I think.

Plots by me. Data: PISA, Gini.

Update: But take care because, as Scott Alexander points out to me, measures of inequality are hard on correlational study, so don’t take this too seriously. I’m gonna expand the scope of the analysis and see what we can see, but like I said – quick and dirty, so don’t hold me to it!

from the archives: physical restraint as the least bad option

This piece originally appeared on my blog in July of 2014.

I have seen now some dozen people share this ProPublica map, about the use of restraining holds on school children, on various social networks and websites. It makes me sad, because this issue is sad. But the kind of reactions that are being provoked also make me sad, because they demonstrate the ways in which the world of sharing and likes and shallow understanding destroys nuance and creates a bogus conception of a black-and-white world.

It happens that I have some experience in this regard. For about a year and a half, I worked in a public school that had a special, segregated section for kids with severe emotional disturbance. Some of the students were significantly mainstreamed into the general ed population, but many couldn’t be, as they posed too much of a risk to other students and to themselves.

Those risks were neither hypothetical nor minor. The more severe of these cases were children who typically could not last a single school day without inflicting harm on themselves or on others. I have personally witnessed a 10 year old lift his 40-pound desk from the floor and hurl it towards the head of another student. I have witnessed a student jump from her seat to claw and bite at another, with almost no provocation. I have seen kids go from seeming calm to punching other kids repeatedly in the back of the head without warning. The self-harm was even worse. I had to intervene when a child, frustrated with his multiplication homework, struck himself repeatedly in the face with a heavy fake gold medallion, to the point where he drew his own blood. I saw a student try to cut his own lip with safety scissors. I saw a girl tear padding from a padded wall and eat it; when she eventually had to be removed from the school via ambulance, she urinated on herself, rubbed her face with her urine, and attempted to do the same to paramedics.

Mental illness is powerful and terrible and that’s the world we live in.

Part of the response to this kind of behavior was restraint. I didn’t enjoy doing it; none of the staff did. Hated it, in fact. We were all trained in how to provide restraint as safely as possible, but that didn’t mean we were under any illusion: we knew that these techniques were uncomfortable and potentially harmful to students. Injuries to staff members were common. A fellow staff member badly broke her tailbone in the process of restraining a child, an injury that left her unable to work for a calendar year. There was something gross about the euphemism “therapeutic hold,” and we talked about the trainings with black humor. I left, after that year and a half or so, because I could not take the emotional toll. There were women there who had been working with such children for over 30 years. I couldn’t make it two. The notion that these women were somehow callous or unconcerned about these children is ludicrous and defamatory. They had dedicated their lives to helping these kids, for terribly low pay. They had to watch these kids grow up and get shipped to the middle school level where there was no similar program. And we were the last stop, for these kids, before the state mental health system. That was the stark choice: if it didn’t work here, the only alternatives were either special private schools, which given that the students were overwhemingly from poverty, was not an option at all, or being committed to the state mental system, which most likely meant institutionalization and constant medication. Those were the stakes.

I have struggled to write about that period of my life for years, as I am still unable to adequately process the emotions I felt. I do know and will loudly say that the women (and besides me they were almost all women) who worked as teachers and paraprofessionals were an inspiration in the true sense, working quietly and without celebration to bring a little education and relief for children who life had treated terribly. They shame me with their dedication. To see them and people like them repeatedly represented as serial abusers who don’t care for if they harm children is infuriating, baseless, and wrong.

The question I have for someone like Heather Vogell, who wrote this sensationalistic and damaging piece for ProPublica, and for all of the people sharing that map with breathless outrage, is this: what alternative would you propose? I am not kidding when I tell you that dozens of times, there was no choice but to physically restrain a child. The only alternative was to allow that child to badly hurt another or him- or herself. If you think that a 7 year old is incapable of badly harming another person, I assure you, you’re wrong. I have seen many people arguing that there is never a situation where such restraint is necessary, and all I can say is that you’re ignorant, and that your ignorance is dangerous. To say that all children can be verbally calmed in all situations is to betray a stunning lack of understanding of the reality of childhood mental illness. Vogell mentions in passing that there are situations in which restraint is necessary, then spends thousands of words ignoring that fact. At every time when she is faced with a journalistic or stylistic choice, she opts for the most sensationalistic and unsympathetic presentation possible, minimizing the other side and failing to even pretend to have genuinely wrestled with the topic before coming to a conclusion. It’s not just that she insults thousands of nameless, faceless public servants who no capacity to fight back or even be seen as potentially-sympathetic human beings. It’s just lousy journalism, written for a clickbait culture, utterly credulous to one set of opinions and utterly dismissive of another. It’s an embarrassment.

Meanwhile, childhood mental illness continues to wreak its terrible havoc, and educators will be forced to make terrible choices. I hated restraining those children, but I saw with my own two eyes the incredible violence that mental illness made possible, and I do not for one minute regret properly restraining children when that was the only way to save that child or another from bodily harm. I invite Vogell, or any of the people loudly expressing their outrage, to take jobs in special education or child mental health services. You can actually get involved, you know. See it with your own eyes. Help actual human lives get a little bit better. See what choice you’re able to make when it is clear that you must intervene or allow injury to another person. But I’m afraid that takes more time and effort then launching a tweet.

Years from now, when people like Vogell are no longer wasting a second of their time thinking about physical restraint of children who are a danger to themselves and others, the women in my old program will be working, quietly and selflessly and for awful compensation, trying to help the children they are now accused of abusing.

please stop trying to get me fired for things I haven’t said and don’t believe

just an example

I wrote a couple pieces this week that had a pretty simple intent. They pointed out that a lot of progressive people think that any discussion of “intelligence” – a contested and socially-constructed concept, as I said in both posts – and genetics is necessarily a Charles Murray-type act of pseudoscientific racism. I pointed out something that I know, which is that there is a large and active field within the broad world of psychology that looks for connections between genetics and all measurable psychological traits, including intelligence. These people are perfectly mainstream researchers, many of whom work at some of the most prestigious institutions in the world. And I argued that it was a mistake to simply assume that genetics and intelligence is a subject that carries with it racist intent. That doesn’t mean they’re right, and it certainly doesn’t mean there’s nothing to criticize or be worried about in that research. But it simply is not accurate to say that the study of genetic origins of human traits is only found on the fringes. I encounter that work often as someone who researches educational testing.

I have written against race “science” many times. Oftentimes people ask me why; they say, hey, that stuff’s discredited. I’ve always answered the same way: in a world that’s full of antiblack racism, we can be sure that many people secretly believe that black people are less intelligent. And given that these arguments have bubbled up again and again, in mainstream publications like The New Republic and Slate, I think it’s important to forcefully rebut them. I believe good science is the best way to fight bad science. You are, again, entitled to disagree. But I have never expressed anything but total commitment to opposing racist pseudoscience. Meanwhile, someone like William Saletan wrote a week-long series explicitly endorsing those conclusions ten years ago, and yet seems to attract much less blowback than I do for opposing them.

I thought that writing 5,000 words about what I think genes influence and don’t, how much variation is likely attributable to genetics, and discussing the predictive powers and limitations of IQ would be sufficient to prevent people from deliberately misreading my post as an endorsement of race science. Sadly, that is not the case, and so of course Twitter is accusing me of believing literally the opposite of what the very first lines of the first post said:

This comes in the context of a basic fact of my life for years, which is strangers from the internet contacting people in positions of power over me to hurt my life because they don’t like me or my politics. Actual people have been trying to get me fired from my actual real-life job for as long as I’ve had it. I’m not really worried; I am part of a union and enjoy certain protections as a public sector employee, though nothing like the protections of tenure. But still, it is profoundly unnerving to be someone whose ability to pay the rent, health insurance, and life in general are threatened by an active and ongoing attempt to get me in trouble. This is not something particularly new. In grad school people emailed my professors. During the election people Tweeted at Purdue to fire me from my adjuncting gig. Publications that publish my work, no matter how uncontroversial, take shit for it. This has extended into my new life in New York despite the fact that I have tried like hell to avoid controversy. It’s exhausting to constantly have to worry that someone is lying about what you’ve argued in order to ruin you professionally.

If you’re someone who regularly uses Twitter to misrepresent what someone has said or believes, you’re messing with real people’s real lives. The shameful Matt Bruenig story should make it very clear that while social media has very little ability to cause positive change, it can really divest people of their incomes and their health insurance. The culture of dishonesty in social media – particularly left-leaning social media and especially media-industry Twitter – might seem fun or cute to the people involved. But it has teeth. I get it: media culture treats all of the professional world as a game. I still have to pay my rent.

People told me to log off Twitter; I logged off. People told me to stop blogging about politics; I stopped. I am trying to hold down a little space online where I can cultivate a small readership of people who are interested in subjects related to my expertise, which is the assessment of learning and research methods associated with that. I am not asking for much attention. I would really hate it if I simply have to stop writing online altogether, but if I’m risking the health insurance that allows me to lead a stable and functional life, then I’ll have to. And listen: I fucking hate having to write something like this. I hate having to make myself out like some sort of a victim, when I’m not. But I am asking you, please, if you are going to get mad at me or my work, get mad about things I actually believe and actually have written.

Study of the Week: Computers in the Home

A quickie today. It is fair to say that technology plays an enormous role in our educational discourse. Indeed, “technology will solve our educational problems” is a central part of the solutionism that dominates ed talk. From “teach a kid to code” to “who needs highly trained teachers when there’s Khan Academy?,” the idea that digital technology holds the key to the future of schooling is ubiquitous and unavoidable.

This is strange given that educational technology has done almost nothing but fail. Study after study has found no impact on education metrics from technology.

(Now, let me say upfront: this blog post is not intended as a literature review for the vast body of work on the educational impacts of technology. It is instead using a large and indicative study to discuss a broader research trend. If you would like for me to write a real literature review, my PayPal is available at the right.)

Consider having a personal computer in the home. Many would assume that this would give kids an advantage in school. After all, they could play educational software, surf the Web, get help on their homework remotely…. And yet that appears to not be the case. Published in 2013, this week’s study comes from the National Bureau of Educational Research. Written by Robert W. Fairlie and Jonathan Robinson of UC Santa Cruz, the study finds in fact that a personal computer in the home simply makes no difference to student outcomes – not good, not bad, nothing.

The study is large (= 1,123) and high quality. In particular, it offers the rare advantage of being a genuine controlled randomized experiment. That is, the researchers identified research subjects who, at baseline, did not own computers, assigned them randomly to control (no computer) and test (given a computer). This is really not common in educational research. Typically, you’d have to do an observational/correlational study. That is, you’d try to identify research subjects, find which of them already have computers and which didn’t, and look for differences in the groups. These studies are often very useful and the best we have to go on given the nature of the questions we are likely to ask. You can’t, for example, assign poverty as a condition to some kids and not to others. (And, obviously, it would be unethical if you could.) But experiments, where researchers actually cause the difference between experimental and control groups – some methodologists say that there must be, in some sense, a physical intervention to manipulate independent variables – are the gold standard because they are the studies where we can most carefully assess cause and effect. Giving one set of kids computers certainly qualifies as a physical intervention.

And the results are clear: it just doesn’t matter. Grades, test scores, absenteeism and more… no impact. The study is generally accessible to a general audience, save for some discussion of their statistical controls, and I encourage you to peruse it on your own.

In its irrelevance for academic outcomes, owning a personal computer joins a whole host of other educational interventions via digital technology that have washed out completely. But hope springs eternal. I couldn’t help but laugh at this interview of Marc Andreessen in Vox, as it’s so indicative of how this conversation works. Andreessen makes outsized claims about the future impacts of technology. Timothy Lee points out that these claims have never come true in the past. Andreessen simply asserts that this will change, and Lee dutifully writes it down. That is the basic trend, always: the repeated failures of technology to make actually meaningful impacts on student outcomes will always be hand waved away; progress is always coming, next year, or the year after that, or the next. Meanwhile, we had the internet in my classrooms in my junior high school in 1995. Maybe it’s time to stop waiting for technology to save us.

But then again, there are iPads to sell….

addressing some complaints

There was a lively discussion about my last post on Facebook yesterday. There was a lot of enthusiastic people participating. Let me address some common complaints.

I am mad because I believe something that you expressly agree with. The most depressing response was all of the people who made claims that I had myself made in the post and then represented that as criticism. That is, dozens of people made statements that they imagined contradicted me, even though those statements were points I had made in the very piece the were attempting to critique. This problem can generally be avoided if you take the radical step of reading what you are responding to.

Things people floated as disagreements that I explicitly said in the original piece include:

  • Race is a social construct
  • Genetics do not explain all the variation in IQ tests and other quantitative measures of academic ability
  • The precise amount of variation explained by genetics is contested
  • IQ tests are not complete or comprehensive measures of human mental capacity or human worth
  • Other things than IQ are valuable and important predictors of student success
  • The definition of academic success is socially mediated and influenced by capitalism
  • There are methodological criticisms of twin studies
  • Some established researchers disagree with this line of thinking
  • More research is needed

I could go on. To each of these “criticisms,” the answer is the same: yes, I agree, that’s why I said them in the essay that you’re criticizing and clearly didn’t read.

[voice of a New England blueblood wearing a blazer with brass buttons, Nantucket Red pants, and an ascot while swirling a glass of port] “Sirrah! What are your qualifications!” My qualifications are in fact irrelevant here. I would defend them if I thought it was relevant. But I’m not doing primary research. I didn’t disappear into a lab and emerge with a new model of human cognition. I’m reading studies and books, which is what I do all day, and faithfully reporting back what I find. And what I find, and have reported, is that in the fields of genetic behaviorism and developmental psychology there is a broad agreement that academic and cognitive outcomes are significantly influenced by genetics.

That’s not really a consensus. I do not agree. I’m afraid there is no system of consensus points that I could assemble to establish this point objectively. However, given how many people within the relevant fields discuss those fields, I find it hard to dispute that there is broad agreement. Consider this from the Plomin piece I linked to:

Finding that differences between individuals (traits, whether assessed quantitatively as a dimension or qualitatively as a diagnosis) are significantly heritable is so ubiquitous for behavioural traits that it has been enshrined as the first law of behavioural genetics. Although the pervasiveness of this finding makes it a commonplace observation, it should not be taken for granted, especially in the behavioural sciences, because this was the battleground for nature-nurture wars until only a few decades ago in psychiatry, even fewer decades ago in psychology, and continuing today in some areas such as education. It might be argued that it is no longer surprising to demonstrate genetic influence on a behavioural trait, and that it would be more interesting to find a trait that shows no genetic influence….

For some areas of behavioural research—especially in psychiatry—the pendulum has swung so far from a focus on nurture to a focus on nature that it is important to highlight a second law of genetics for complex traits and common disorders: All traits show substantial environmental influence, in that heritability is not 100% for any trait.

Or this from the Turkheimer I linked to:

I too am a behavior geneticist, so it is important to conclude this response with a “lest I be misunderstood” paragraph. It is remarkable that in this day and age there continues to be a school of thought maintaining that behavior genetics is fundamentally mistaken about even weak genetic influence, that the nearly universal findings of quantitative genetics can be dismissed because of methodological assumptions of twin studies (Joseph, 2014) or contemporary findings in epigenetics (Charney, 2012). Those arguments can be evaluated on their own terms, but my point of view must not be cited in their support. Genetic influence is real and has profound methodological implications for how human behavior is studied.

Note that many people cited Turkheimer to me as a skeptic of behavioral genetics writ large. You could take The Blank Slate, which is now rather out of date but which functions as a book-length exploration of these topics. Or you could read The Nurture Assumption which had a new edition come out in 2009. There is a lot of literature out there for you to consider. Does this mean that nobody disagrees? Of course not! And I specifically that there is controversy here. But the existence of dissenters does not mean that there is not broad agreement. They could all be mistaken. But I’m not mistaken for saying that this is a widespread belief in the relevant fields. Please stop saying I’m making this up.

This blog post is not an exhaustive literature review! No. That’s true. I’m afraid I don’t have it in me to conduct one when I’m never going to be able to stick it in a tenure review. (I mean, I’m not even in a tenure track job.) Luckily, other people who receive more direct professional incentives for doing literature studies have already put them together. I linked to the Plomin article because it is a very recent review that includes citations to dozens of papers that establish a long research record. I write a lot, and I enjoy it, and I remain your humble servant, but give me a break, please.

IQ tests don’t measure anything. I’m sorry, you guys, but you have to drop this one. It is not true. The predictive power of IQ tests has been replicated over and over again. If I take a group of 8 year olds, or a group of 12 year olds, or a group of 16 year olds, and give them high-quality age-appropriate IQ tests, those results will be strongly predictive of various academic outcomes. Not perfectly! No one ever said they were perfectly predictive. We live in a world variability. But in the world of social science and human research, they are remarkably well validated. If I want to know if someone will pass high school algebra, yes, IQ tests tell me something. If I want to know if someone will graduate from high school on time, yes, IQ tests tell me something. If I want to predict how selective of a college someone will go to, yes, IQ tests tell me something. They also predict a number of social and life outcomes that are not academic, although generally less well than they do academic outcomes. Click the Slate link above. The evidence is out there. This one is really a kind of know-nothingism. It’s casually destructive to keep saying that without consulting the evidence.

Now: is there something tautological about this? I think so, yes. Does this reflect assumptions about value and human worth that stem from capitalism and ideology and such? Yes, and I said so. Should we prefer a society where these things are less valued? Yes, and I said so. Are there strong objections to the manner of thinking that created the tests, and the hierarchical systems which we sort people into? Of course. I am a socialist in part because I want to tear down those systems. But if you want to attack IQ tests, attack their weaknesses, not their strengths. That is, don’t attack their predictive validity; attack the social and economic framework in which they are potentially destructive. That leads to the next complaint.

You invented capitalism, “meritocracy,” and test mania. Uh, I in fact did not. So many of the complaints I’ve received have been about systems and ways of thinking that I openly oppose. Yes, it’s true that it’s bad to reduce human value to test scores – but I am against that and I said so. Yes, it’s true that this kind of thinking can lead to pernicious tracking systems and restrictions on opportunity – but I’m against that and I said so. I am attempting to describe problems with a system from within that system. That does not imply my endorsement of that system, only my understanding that we are currently in it.

The fact of the matter is that American education policy is being written by people who are obsessive about quantitative metrics of academic performance generally and test scores particularly, and who believe against all evidence that all students can reach the same arbitrary performance standards. That is a recipe for disaster for our public schools. To mount an argument against this situation, I have to be able to address the problems in a way that does not preemptively assume a radical critique that people with power in our education system are unlikely to share. Does that make sense?

I cherry-picked this one study that disputes what you’re saying. You can do that. I will read those studies with interest if I haven’t already, when I find the time. As I said, repeatedly, there is still work to be done and there are still controversies here. My mind is not closed. My honest take on the extant evidence is that these dissenting studies, while interesting and valuable, are not sufficient to counter the general trend. I could be wrong. But I’m laying out a case here and particularly citing a lot of qualified people who have made the same case.

Only twin studies have shown this result. Nope.

A belief in genetic influences on academic performance is incompatible with a belief that the racial achievement gap is the product of socioeconomic inequality. Not so. Let me argue by analogy.

Suppose I wanted to study what variables impact how high people can jump. Most people would not dispute that your genetics has an impact. We are not all equal when it comes to our natural talents for physical activities like jumping. Children of high jump Olympians will tend (tend!) to jump higher than the average person. Of course, there’s also substantial non-genetic variation in play – the amount you train, your diet and nutrition, etc. To say that jumping ability is substantially genetic is not the same as saying that it is exclusively genetic. And in fact most children of Olympic high jumpers will not go on to be Olympic high jumpers themselves, just as most geniuses do not have genius parents or children even though there is a significant genetic influence on intelligence.

Now, let’s suppose that a certain portion of society – like, say, black and Hispanic people – are fitted by society with heavy weight belts at birth. These weight belts would, obviously, constrain the ability of black and Hispanic people from jumping high. If you simply looked at the average heights of jumps by racial groups, you might conclude that black and Hispanic people are genetically predisposed to being bad jumpers. But of course, when you’re wearing a weight belt, it’s hard to jump high.

Now: does arguing that the weight belt is creating a perceived difference in jumping ability mean that genetic explanations are invalid? Of course not. It means that whatever genetic predisposition individuals have is being washed out by the weight belt. The existence of the weight belts is not an argument against genetic influence on jumping ability. It’s instead a non-genetic variable that produces a group difference. Were the weight belts to be removed from black and Hispanic people, there would still be substantial genetic variation between individuals in their ability to jump. We would just find the average to be higher relative to other groups.

Of course, in this silly analogy, white supremacy and its many manifestations are the weight belt. Yes, as Charles Murray types always insist, income band alone does not sufficiently explain various aspects of the racial achievement gap. But then, who ever said racial inequality is only about income gaps? Racial inequality is a profoundly multivariate phenomenon. It manifests itself in all sorts of ways. And I don’t believe that the “human biodiversity” types have come close to accounting for the influence of all of those variables.

My belief is that, if and when we remove the weight belt of white supremacy from black and Hispanic people, the racial achievement gap will disappear, and at scale we’ll seen equivalent academic performance across groups. But we’ll still also see substantial variation between individuals; the racial groupings will be proportionately arranged around performance bands, but there will still be people who do better or worse in school/on IQ tests. And that variation, the evidence suggests, is significantly (but not completely, of course) influenced by genetics.

Other factors complicate and attenuate the genetic influence on IQ and academic performance. Of course they do. I said so in the piece! The presence of other variables does not imply that there is no variation influenced by genetics. Some have cited Angela Lee Duckworth and the important of conscientiousness as a counter to my post, but Duckworth herself explicitly says IQ/g/native intelligence are also important. I never disputed that. In fact, I wrote 2000 words on a study exploring this connection literally last week!

You’re a genetic determinist! This is eugenics! No I’m really, really not, and it really, really isn’t. In fact, I embedded that post with so many caveats and qualifiers that I am absolutely amazed that people are so affronted by it. I’m making a very mild version of a generally uncontroversial argument.

Here is what I am saying. Biological children tend to resemble their biological parents in all manner of academic outcomes, and this similarity increases over the course of life. This relationship is not perfect and no one has ever claimed that it is. However, it is powerful, particular in the context of studying human variation. In contrast, adoptive children are not much more like their adoptive parents than they are like random strangers. Identical twins reared apart are more like each other than they are like adoptive siblings; adoptive siblings are not much more like each other than they are like random strangers. These observations have proven to be durable in a variety of studies over the course of decades conducted by established researchers at respected institutions. Perhaps new evidence will cast them into doubt; we’ll see. For now I can only work based on the information available. I think that these observations have obvious and important consequences for our educational policy, and I think it’s a good idea for progressive people to think about them. Yes, they have some potentially disturbing implications. But that’s all the more reason to be able to confront them clearly and rationally as we think about what kind of society we want to be.

disentangling race from intelligence and genetics

Here are two things that I believe to be true:

  • Bigoted ideas about fundamental intellectual inequalities between demographic groups are wrong. Black people aren’t less intelligent than white, women aren’t bad at science, Asian people do not have natural facility for math, etc.
  • Genetics play a substantial role in essentially all human outcomes, including what we define as “intelligence” or academic ability.

Both of these things, I think, are true. The evidence for both seems very strong to me. And in fact it’s not hard at all to believe both of them at the same time. Yet I find it almost impossible for some progressive people to recognize that we can believe both things at the same time.

Take this recent Vox.com piece about pseudoscientific racism. The author, Nicole Hemmer, is typical in that she seems to think that any discussion of genetics and intelligence implies racist notions of inherent inequalities between racial groups. At the very least, she does nothing to separate a belief in genetic influences on IQ from the notion that some races are inherently more intelligent than others, when those ideas must be carefully separated. Here’s a typical passage.

Murray and Herrnstein’s book, The Bell Curve, was published in 1994, generating immediate controversy for its arguments that IQ was heritable, to a significant degree, and unchangeable to that extent; that it was correlated to both race and to negative social behaviors; and that social policy should take those correlations into account.

I kept waiting for Hemmer to pull these separate claims apart and show what’s correct and what’s wrong, but she never does. Throughout the piece she moves through the claims of people like Charles Murray without bothering to identify the truths on which they then build lies. That’s perhaps understandable, as it’s easy to simply want to wash our hands of the whole thing. But that’s a mistake. That some races are genetically superior to others is a racist fiction. That IQ is significantly heritable and unchangeable is a empirical fact. On this essential intellectual task – untangling the difference between racist pseudoscience and the science of genetic influence on human psychological outcomes – Hemmer is silent. And she’s joined in that failure by far too many liberals I know, who often get visibly anxious any time genetics and intelligence are discussed at all, as if racist conclusions must necessarily follow. This is a problem.

I am, for context, not at all a genetic determinist, compared to many other people who talk about these issues. The world is filled with people who argue as if genetics is destiny. I’m largely an amateur when it comes to these questions, but I’m willing to say that I am skeptical of the confidence and universality with which some researchers assert genetic causes for human outcomes. And there are some real methodological challenges to typical procedures for identifying genetic influences. Still, as someone with a background in academic assessment and educational testing, I find it impossible to avoid the conclusion that there is significant genetic influence on essentially all measurable human traits, including academic outcomes. In particular, that IQ is significantly heritable is one of the most robust and well-replicated findings in the history of social science. That’s the reality.

If you’d like a recent study that aggregates a lot of the evidence, this by Plomin and Deary is a great place to start. If you’d like a broad overview of what genetics research has – and crucially, has not – found in recent years, I highly recommend this article by Eric Turkheimer on the weak genetic explanation, even for those without any background in psychometrics. Turkheimer is a poised and measured writer, one who has never spoken with the zealotry common to genetic behaviorists. I encourage you to read the article.

As time goes on, the evidence for the influence of genetics on individual human variation only grows. That includes intelligence and much more. Do racist conclusions necessarily follow? Not at all. Genetics is about parentage, not race. If I claim that a trait is heritable, I am making a claim about the transmission of that trait through biological parentage – mother and father to daughter and son. Extrapolating to the socially-mediate construct of race is irresponsible and unwarranted.

Simply consider the differences in the paths of genetic information we’re talking about here. While unsolved questions still abound in genetic research, the general mechanisms through which genetic information is passed down within families have been well understood for decades. We know how parents contribute genetic material to children, and we thus know how grandparents and great-grandparents influence genotype too. If we say that a particular trait runs in families, we can look through very clear lines of descent to show how genetic information is pass along. We know more or less how an individual genotype is formed, we know how various generational connections contribute different pieces of genetic data, and we know more and more about how genotype defines phenotype.

Contrast that with the construct of race. What does it mean to call two people “Asian”? The connection between, say, a third generation Hmong American college student whose family came to Santa Barbara as refugees from Vietnam and a Indian IT specialist whose family has lived in Madurai for generations seems, uh, unclear to me. Yes, I understand that there are phenotypical markers which often (but not always) indicate closer common ancestry between individuals. But “closer” here still can mean people whose families branched off the family tree hundreds of generations ago, making the genetic connections extremely distant. Low-cost genetic testing has revealed vast complexity in the genealogy of individuals and groups, with once simple stories of descent everywhere complicated by intermixing and the tangled lines of history. (I have bad news for the alt-right: the volk does not exist and never did.) Meanwhile, the concept of race entails vastly more baggage than just genetic lineage, all of the cultural and social and linguistic and political markers that we have, as a species, decided to package with certain phenotypical markers, historically for the purpose of maintaining white supremacy. To suggest that this process of racialization must be implied by acknowledging genetic influences on individual human outcomes is, well, thinking like Charles Murray.

If nothing else, I think it’s profoundly important that everyone understands that the belief that genetics influence intelligence does not imply a belief in “scientific” racism. In fact, most of the world’s foremost experts on genetic behaviorism believe the former and not the latter.

None of this is to deny that intelligence itself is a socially-mediated concept. What we think of as intelligence is always impacted by social and economic values. When Jews began to enter elite American colleges in large numbers, those colleges suddenly discovered the importance of “character” as a part of intelligence, conveniently grafting culture-specific ideas about what it means to be intelligent into their admissions processes in order to ensure that enough WASP men from “the right families” made it in. Right now, we favor a definition of intelligence that is high on the kind of raw abstract processing that enables one to make a living on Wall Street or in Silicon Valley. That we have disregarded emotional intelligence, social consciousness, or ethical reasoning tells you a lot about why those industries are filled with sociopathic profiteers. This does not mean that IQ testing doesn’t tell us anything meaningful; IQ tests measures consistent and durable traits and are predictive of a number of academic and social outcomes related to those traits. It does mean, though, that our decision to reward this particular set of abilities is a choice, and one that I would argue has had deeply pernicious impacts on our society. The ability to score highly on Raven’s Progressive Matrices does tell us something about the likelihood that you will pass high school algebra or be good at chess. It does not tell us your worth as a human being, as worth is a concept created by humans. We decide who has value. That we distribute that designation so stingily is a product of capitalism, not of genetics.

Nor do you have to adopt a depressing, Gattacastyle assumption that genes are destiny. Read the Plomin and Deary; read the Turkheimer. As Turkheimer points out, the strong explanation – “a gene for X” – has largely not come true. As Plomin and Deary point out, no traits are 100% heritable, with environment, opportunity, privilege, and chance all playing a role in outcomes. Besides, inherited human traits tend to be the product of the interaction between many genes. For this reason, geniuses are often the children of parents with no particularly unusual intellectual aptitude. We live in a world of variability. Nothing is certain. And again, one of our crucial social and political tasks must be to fight against the assumption that only those who can do complex equations are worthwhile human beings. No matter how hard I worked, I could never have been a research physicist; I simply do not have the facility for advanced math. Yet I maintain a stubborn belief that I have value and can contribute to the human race. So can everybody else, in their own particular ways. There are so many ways to be a good human being, but we reward very few, and to our shame. (And by the way: human quantitative processing powers are the most likely to be replaced by automation in the workplace of the future, so don’t get too comfortable, smarties.)

I also think people sometimes avoid this topic because they’re afraid it leads to conservative political conclusions. Some conservatives seem to think that too. I find that bizarre: if intellectual talent leads to financial security under capitalism, and intellectual talent is largely outside of the control of individuals, that amounts to one of the most powerful arguments for socialism I can imagine. An outcome individuals cannot control cannot morally be used to determine their basic material conditions.

In any event: as long as we value intelligence in the way we do, progressive people must be willing to be honest about the existence of inherent differences between individuals in academic traits. When we act as if good schooling and committed teachers can bring any student to the pinnacle of academic achievement, we are creating entirely unfair expectations. Meanwhile, failure to recognize the impact of genetics on academic outcomes leaves us unable to combat an increasingly rigid social hierarchy. I often ask people, what happens after we close the racial achievement gap? What becomes the task then? Precisely because I don’t believe in pseudoscientific racism, I believe that we will eventually close the racial achievement gap, if we are willing to confront socioeconomic inequality directly and with government intervention. But what happens then? We will still have a distribution of academic talent. It will simply be a distribution with proportional numbers of black people, of women, of LGBTQ people…. Does it therefore follow that those on the bottom of the talent distribution will deserve poverty, hopelessness, and marginalization? I can’t imagine how that could be perceived as a just outcome. But if progressive people fear getting involved in these discussions out of a vague sense that any link between genetics and academic ability is racist, they will not be able to help shape the future.

Liberals have flattered themselves, since the election, as the party of facts, truth tellers who are laboring against those who have rejected reason itself. And, on certain issues, I suspect they are right. But let’s be clear: the denial of the impact of genetics on human academic outcomes is fake news. It’s alternative facts. It’s not the sort of thing the reality-based community should be trafficking in. As I said, I’m not a zealot on these topics. I read critical pieces about genetic behaviorism with care. I find a lot of genetic determinists and IQ absolutists frustrating, occasionally downright creepy. And I am willing to surprised by new evidence. But the strength of the current evidence is overwhelming. Denying that IQ and other metrics of academic and intellectual ability are substantially heritable is as contrary to scientific consensus as the denial of global warming. This belief does not at all imply belief in racist pseudoscience. It does, however, imply a willingness to trust scientific evidence in precisely the way progressive people insist we must.

Update: Do you have questions? I have answers.

Too Hot for Academic Journals: Lexical Diversity and Quality in L1 and L2 Student Essays

Today I’m printing a pilot study I wrote as a seminar paper for one of my PhD classes, a course in researching second language learning. It was one of the first times I did what I think of as real empirical research, using an actual data set. That data set came from a professor friend of mine. It was a corpus of essays written for a major test of writing in English, often used for entrance into English-language colleges and universities, and developed by a major testing company. The essays came packaged with metadata include the score they received, making them ideal for investigating the relationship between textual features and perceived quality, then as now a key interest of mine. And since the data had been used in real-world testing with high stakes for test takers, it added obvious exigence to the project. The data set was perfect – except for the very fact that it was from Big Testing Company, and thus proprietary and subject to their rules about using their data.

That’s why, when I got the data from my prof, she said “you probably won’t want to try and publish this.” She said that the process of getting permission would likely be so onerous that it wouldn’t be worth trying to send it out for review. That wasn’t a big deal, really – like I said, it was a pilot study, written for a class – but this points to broader problems with how independent researchers can vet and validate tests that are part of a big money, high-stakes industry.

Here’s the thing: often, to use data from testing companies like Big Testing Company, you have to submit your work for their prior review at every step of the revision process. And since you will have to make several rounds of changes for most journals, and get Big Testing Company to sign off on them, you could easily find yourself waiting years and years to get published. So for this article I would have had to send them the paper, wait months for them to say if they were willing to review it and then send me revisions, make those revisions and send the paper back, wait for them to see if they’d accept my revisions, submit it to a journal, wait for the journal to get back to me with revision requests, make the revisions for the journal, then send the revised paper back to the testing company to see if they were cool with the new revisions…. It would add a whole new layer of waiting and review to an already long and frustrating process.

So I said no thanks and moved on to new projects. I suspect I’m not alone in this; grad students and pre-tenure professors, after all, have time constraints on how long the publication process can take, and that process is professionally crucial. Difficulties in obtaining data on these tests amounts to a powerful disincentive for conducting research on them, which in turn leaves us with less information about them than we should have, given the roles they play in our economy. Some of these testing companies are very good about doing rigorous research on their own products – ETS is notable in this regard – but I remain convinced that only truly independent validation can give us the confidence we need to use them, especially given the stakes for students.

As for the study – please be gentle. I was a second-semester PhD student when I put this together. I was still getting my sea legs in terms of writing research articles, and I hadn’t acquired a lot of the statistical and research methods knowledge that I developed over the course of my doctoral education. This study is small-n, with only 50 observations, though the results are still significant to the .05 alpha that is typical in applied linguistics. Today I’d probably do the whole set of essays. I’d also do a full-bore regression etc. rather than just correlations. Still, I can see the genesis of a lot of my research interests in this article. Anyway, check it out if you’re interested, and please bear in mind the context of this research.

*****

Lexical Diversity and Quality in L1 and L2 Student Essays

Introduction and Rationale.

Traditionally, linguistics has recognized a broad division within the elementary composition of any language: the lexicon of words, parts of words, and idiomatic expressions that make up the basic units of that language’s meanings, and the computational system that structures them to make meaning possible. In college writing pedagogy, our general orientation is to higher-order concerns than either of these two elementary systems (Faigley and Witte). College composition scholars and instructors are more likely to concern themselves with rhetorical, communicative, and disciplinary issues than in the two elementary systems, which they reasonably believe to be too remedial to be appropriate for college level instruction. This prioritization of higher-order concerns persists despite tensions with students, who frequently focus on lower-order concerns themselves (Beach and Friedrich). Despite this resistance, one half of this division receives considerable attention in college writing pedagogy. Grammatical issues are enough of a concern that research has been continuously published concerning how to address them. Books are published that deal solely with issues of grammar and mechanics. This grudging attention persists, despite theoretical and disciplinary resistance to it, because of a perceived exigency: without adequate skills in basic English grammar and syntax, writers are unlikely to fulfill any of the higher-order requirements typical of academic writing.

In contrast, very little attention has been paid— theoretically, pedagogically, or empirically— to the lexical development of adult writers. Consideration of vocabulary is dominantly concentrated in scholarly literature of childhood education. Here, the lack of attention is likely a combination of both resistance based on the assumption that such concerns are too rudimentary to be appropriate for college instruction, as with grammatical issues, and also because of a perceived lack of need. Grammatical errors, after all, are typically systemic— they stem from a misunderstanding or ignorance of important grammatical “moves,” which means they tend to be replicated within assignments and across assignments. A lack of depth in vocabulary, meanwhile, does not result in observable systemic failures within student texts. Indeed, because a limited vocabulary results in problems of omission rather than of commission, it is unlikely to result in identifiable error at all. A student could have a severely limited vocabulary and still produce texts that are entirely mechanically correct.

But this lack of visibility in problems of vocabulary and lexical diversity should not lead us to imagine that limited vocabulary does not represent a problem for student writers. Academic writing often functions as a kind of signaling mechanism through which students and scholars demonstrate basic competencies and shared knowledge that indicates that they are part of a given discourse community (Spack). Utilizing specialized vocabulary is a part of that. Additionally, writing instructors and others who will evaluate a given student’s work often value and privilege complexity and diversity of expression. What’s more, the use of an expansive vocabulary is typically an important element of the type of precision in writing that many within composition identify as a key part of written fluency.

Issues with vocabulary are especially important when considering second language (L2) writers. Part of the reason that most writing instructors are not likely to consider vocabulary as an element of student writing lies in the fact that, for native speakers of a given language, vocabulary is principally acquired, not learned. Most adults are already in possession of a very large vocabulary in their native language, and those who are not are unlikely to have entered college. For L2 writers, however, we cannot expect similar levels of preexisting vocabulary. Vocabulary in a second language, research suggests, is more often learned than acquired. The lexical diversity of a given L2 writer is likely influenced by all of the factors that contribute to general second language fluency, such as amount of prior instruction, quality of instruction, opportunities for immersion, exposure to native speakers, access to resources, etc. Further, because some L2 writers return to their country of origin, or otherwise frequently converse in their L1, they may lack opportunities to continue to develop their vocabulary equivalent to their L1 counterparts. In sum, the challenge of adequate vocabulary can reasonably be expected to be higher for L2 writers.

If it can be demonstrated that diversity in vocabulary in fact has a significant impact on perceptions of quality in student essays, we might be inspired to alter our pedagogy. Attention to development of vocabulary might be a necessary part of effective second language writing instruction. Such pedagogical evolution might entail formal vocabulary teaching with testing and memorization, or greater reading requirements, or any number of instruments to improve student vocabulary. But before such changes can be implemented, we first must understand whether diversity in vocabulary alters perceptions of essay quality and to what degree. This research is an attempt to contribute to that effort.

Theoretical Background.

The calculation of lexical diversity has proven difficult and controversial. The simplest method for measuring lexical diversity lies in simply counting the number of different words (NDW) that appear in a given text. (This figure is now typically referred to as types.) In some research, only words with different roots are counted, so that inflectional differences do not alter the NDW; in some research, each different type is counted separately. The problems with NDW are obvious. The figure is entirely dependent on the length of a given text. It’s impossible to meaningfully compare a text of 50 words to a text of 75 words, let alone to a text of 500 words or 3,000 words. Problems with scalability— the difficulty in making meaningful measures across texts of differing lengths— have been the most consistent issue with attempts to measure lexical diversity.

The most popular method to address this problem has been Type-to-Token Ratio, or TTR. TTR is a simple measurement where the number of types is divided by the number of tokens, giving a proportion between 0 to 1, with a higher figure indicating a more diverse range of vocabulary in the given sample. A large amount of research has been conducted utilizing TTR over a number of decades (see Literature Review). However, the discriminatory power of TTR, and thus its value as a descriptive statistic, has been seriously disputed. These criticisms are both empirical and theoretical in nature. Empirically, TTR has been shown in multiple studies to steadily decrease with sample size, making it impossible to use the statistic to discriminate between texts and thus losing any explanatory value (Broeder; Chen and Leimkhuler; Richards). David Malvern et al explain the theoretical reason for this observed phenomenon:

It is true that a ratio provides better comparability than the simple raw value of one quantity when the quantities in the ration come in fixed proportion regardless of their size. For example, in the case of the density of a substance, the ratio (mass/volume) remains the same regardless of the volume from which it is calculated. Adding half as much again to the volume will add half as much to the mass… and so on. Language production is not like that, however. Adding an extra word to a language sample always increases the token count (N) but will only increase the type count (V) if the word has not been used before. As more and more words are used, it becomes harder and harder to avoid repetition and the chance of the extra word being a new type decreases. Consequently, the type count (V) in the numerator increases at a slower rate than the token count (N) in the denominator and TTR inevitably falls. (22)

This loss of discriminatory power over sample size renders TTR an ineffective measure of lexical diversity. Many transformations of TTR have been proposed to address this issue, but none of them have proven consistently satisfying as alternative measures.

One of the most promising metrics for lexical diversity is D, derived from the vocd algorithm. Developed by Malvern et al, and inspired by theoretical statistics described by Thompson and Thompson in 1915, the process utilized in the generation of D avoids the problem of sample size through reference to ideal curves. The algorithm, implemented through a computer program, draws a set of samples from the target text, beginning with 35 types, then 36, etc., until 50 samples are taken. Each sample is then compared to a series of ideal curves that are generated based on the highest and lowest possible lexical diversity for a given text. This relative position, derived from the curve fitting, is expressed as D, a figure that represents rising lexical diversity as it increases. Since each sample is slightly different, each one returns a slightly different value for D, which is averaged to reach Doptimum. See Figure 1 for a graphical representation of the curve fitting of vocd.

Figure 1. “Ideal TTR versus token curves.” Malvern et al. Lexical Diversity and Language Development, pg. 52

D has proven to be a more reliable statistic than those based on TTR, and it has not been subject to sample size issues to the same degree as other measures of lexical diversity. (See Limitations, however, for some criticisms that have been leveled against the statistic.) The calculation of D and the vocd algorithm are quite complex and go beyond the boundaries of this research. An in-depth explanation and demonstration of vocd and the generation of D, including a thorough literature review, can be found in Malvern et al’s Lexical Diversity and Language Development (2004).

Research Questions.

My research questions for this project are multiple.

  • How diverse is the vocabulary of L1 and L2 writers in standardized essays, as operationalized through lexical density measures such as D?
  • What is the relationship between quality of student writing, as operationalized through essay rating, and the diversity of vocabulary, as operationalized through measures of lexical density such as D?
  • Is this relationship equivalent between L1 and L2 writers? Between writers of different first languages?

Literature Review.

As noted in the Rationale section, the consideration of diversity in vocabulary in composition studies generally and in second language writing specifically has been somewhat limited, at least relative to attention paid to strictly grammatical issues or to higher-order concerns such as rhetorical or communicative success. This lack of attention is interesting, as some standardized tests of writing that are important aspects of educational and economic success explicitly mention lexical command as an aspect of effective writing.

Some research has been conducted by second language researchers considering the importance of vocabulary to perceptions of writing quality. In 1995, Cheryl Engber published “The Relationship of Lexical Proficiency to the Quality of ESL Compositions.” This research involved the holistic scoring of 66 student essays and comparison to four measures of lexical diversity: lexical variation, error-free variation, percentage of lexical error, and lexical density. These measures considered not only the diversity of displayed vocabulary but also the degree to which the demonstrated vocabulary was used effectively and appropriately in its given context. Engber found that there was a robust and significant correlation between a student’s (appropriate and free of error) demonstrated lexical diversity and the rating of that student’s essay. However, the research utilized the conventional TTR measure for lexical diversity, which is flawed for the reasons previously discussed.  In 2000, Yili Li published “Linguistic characteristics of ESL writing in task-based e-mail activities.” Li’s research considered 132 emails written by 22 ESL students, which addressed a variety of tasks and contexts. These emails were subjected to linguistic feature analysis, including lexical diversity, as well as syntactic complexity and grammatical accuracy. Li found that there were slight but statistically significant differences in the lexical diversity of different email tasks (Narrative, Information, Persuasive, Expressive). She also found that lexical diversity was essentially identical between structured and non-structured writing tasks. However, she too used the flawed TTR measure for lexical diversity. In the context of the period of time in which these researchers conducted their studies, the use of TTR was appropriate, but its flaws have eroded the confidence we can place in such research.

The most directly and obviously useful precedent for my current research was conducted by Guoxing Yu and published in 2009 under the title “Lexical Diversity in Writing and Speaking Task Performances.” Having been published within the last several years, Yu’s research is new enough to have assimilated and reacted to the many challenges to TTR and related measures of lexical diversity. Yu’s research utilizes D as measured via the vocd algorithm that also was used in this research. Yu also correlated D with essay rating. However, Yu’s research was primarily oriented towards comparing and contrasting written lexical diversity with spoken lexical diversity and the influence of each on perceptions of fluency or quality. My own research is oriented specifically towards written communication. Additionally, Yu’s research utilized essays that were written and rated specifically for the research, to approximate the type of essays typically written for standardized tests, but also understood more generally. My own research utilizes a data set of essays that were specifically written and rated within the administration of a real standardized test, [REDACTED] (see Research Subjects). Also in 2009, Pauline Foster and Parvaneh Tavakoli published a consideration of how narrative complexity affected certain textual features of complexity, fluency, and lexical diversity. Like Yu, Foster and Takavoli utilized D as a measure of lexical diversity. Among other findings, their research demonstrated that the narrative complexity of a given task did not have a significant impact on lexical diversity.

Research Subjects.

For this research, I utilized the [REDACTED] archive, a database of essays that were submitted for the writing portion of the [REDACTED]. These essays were planned and composed by test takers, in a controlled environment, in 30 minutes. These essays were then rated by trained raters working for [REDACTED], holistically scored between 1 (the worst score) and 6 (the best score). Essays which earned the same rating by each rater are represented in this research through whole numbers ending in 0 (10, 20, 30, 40, 50, 60). Essays where one rater gave one score and the other rater gave one point higher or lower are represented through whole numbers ending in 5 (15, 25, 35, 45, 55). Essays where the raters assigned scores that were discordant by more than a point were rescored by [REDACTED] and are not included in this sample. According to [REDACTED], the inter-rater reliability of the [REDACTED] averages .790. A detailed explanation of the test can be found in the [REDACTED].

The corpus utilized in this research includes 1,737 essays from test administrations performed in 1990. Obviously, the age of the data should give us pause. However, as the [REDACTED] writing section and standardized essay test writing have not undergone major changes in the time since then, I believe the data remains viable. (See Limitations for more.) Within the archive, test subjects are represented from four language backgrounds: English, Spanish, Arabic, and Chinese. The essays in the archive are derived from two prompts, listed below:

  • [REDACTED]
  • [REDACTED]

The [REDACTED] archive exists as a set of .TXT files that lack file-internal metadata. Instead, the essays are identified via a complex number, and the number compared against a reference list to find information on the essay topic, language background, and score. Because of this, and because of the file extension necessary for use with the software utilized in this research (see Methods), this study utilized a small subsample of 50 essays. For this reason, this research represents a pilot study. In order to control for prompt effects, all of the included essays are drawn from the second prompt, about the writer’s preferred method of news delivery. I chose 25 essays from L1 English writers and 25 essays from L1 Chinese writers, for an n of 50. Each set of essays represents a range of scores from across the available sample. Because the essays were almost all too short to possess adequate tokens to be measured by vocd, I eliminated essays rated 10. I drew five essays each at random from those rated 20, 30, 40, 50, and 60 from the Chinese L1s, for a total of 25 essays from Chinese writers. Because L1 English writers are naturally more proficient at a test of English, the English essays have a restricted range, with very few 10s, 20s, and 30s. I therefore randomly drew five essays each from those rated 35, 40, 45, 50, and 60, for a total of 25 essays from English writers.

Methods.

This research utilized a computational linguistic approach. Due to the aforementioned problems with traditional statistics for measuring lexical density like TTR and NDW, I used an algorithm known as vocd to generate D, the previously-discussed measure of lexical density that compares a random sample of given texts to a series of ideal curves to determine how diverse the vocabulary of that text is. This algorithm was implemented in the CLAN (Computerized Language Analysis) software suite, a product of the CHILDES (Child Language Development Exchange System) program at Carnegie Mellon University. CLAN is a freeware software that provides a graphical user interface (GUI) for using several programs for typical linguistics uses, such as frequency lists or collocation. Previously, researchers using vocd would have to perform the operations using a command line system. The integration of vocd into CLAN makes vocd easier to use and more accessible.

CLAN uses a proprietary form of file extension, .CHA, as the program suite was originally developed for the study of transcribed audio data that maintains information about pausing and temporal features. In order to utilize the [REDACTED] archive files, they were converted to .CHA format using CLAN’s “textin” program. They were then analyzed using the vocd program, which returned information on types (NDW), tokens (word count), TTR, and the various Ds obtained with each sample, along with a Doptimum derived by averaging each. An example of the output provided by CLAN is below in Figure 2.

Figure 2. CLAN Interface.
Once these data were obtained, averages for type, token, TTR, and D were generated. Then, corrolation matrixes were developed for Chinese L1s, English L1s, and combined data, to find correlations between Types, Tokens, TTR, D, and rating. A scatter plot was developed from the combined data’s correlation of D and rating to represent that relationship graphically.

Results.

There are several relevant, statistically significant results from my research. Raw data is attached as Appendix A.

Figure 3. Chinese L1s Correlation Matrix

As can be seen in the correlation matrix of results from essays by Chinese L1 students (Figure 3), there is a moderate, significant correlation between D and an essay’s rating, at .498 and significant at p<.05. This suggests that, for Chinese L1s (and perhaps L2 students in general), the demonstration of diversity in vocabulary is an important part of perceptions of essay quality. The highest correlation are with the simple measures of length, type (NDW) and token (essay length). These correlations (while unusually high for this sample) confirm longstanding empirical understanding that length of essay correlates powerfully with essay rating in standardizd essay tests. The moderate negative correlation between TTR and tokens adds further evidence that TTR degrades with text length.

Figure 4. English L1s Correlation Matrix

The subsample of English L1s displays some similarities to the correlations found in Chinese L1s. Once again, there is a significant correlation between tokens (essay length) and rating, suggesting that writing enough remains an essential part of succeeding in a standardized essay test. The correlation between rating and D is somewhat smaller than that for Chinese L1s. This might perhaps owe to an assumed lower functional vocabulary for ESL students. We might imagine a threshold of minimum functional vocabulary usage that must be met before writers can demonstrate sufficient writing ability to score highly on a standardized essay test. If true, and if L1s are likely to be in possession of a vocabulary at least large enough to craft effective essay answers, lexical diversity could be more important at the lower end of the quality scale. This might prove true especially in a study with a restricted, negatively skewed range of ratings for L1 writers. More investigation is needed. Unfortunately, this correlation is not statistically significant, perhaps owing to the small sample size utilized in this research.

Figure 5. Combined Data Correlation Matrix.

The combined results show similar patterns, as is to be expected. These results are encouraging for this research. The moderate, statistically-significant (p<.01) correlaiton between rating and D demonstrates that lexical diversity is in fact an important part of student success at standardized essay tests. The high correlation between rating and both types and tokens confirms longstanding beliefs that writing a lot is the key to scoring highly on standardized essay tests. TTR’s negative correlation with tokens demonstrates again that it inevitably falls with text length; its negative correlation with rating demonstrates that it lacks value as a descriptive statistic that can contribute to understanding perceived quality.

Figure 6. Scatterplot of Doptimum-Rating Correlation for Combined Data.

This scatterplot demonstrates the key correlation in this research, between D and essay rating. A general progression from lower left to upper right, with many outliers, demonstrates the moderate correlation I previously identified. This correlation is intuitively satisfying. As I have argued, a diverse functional vocabulary is an important prerequisite of argumentative writing. However, while it is necessary, it is not sufficient; essays can be written that display many different words, without achieving rhetorical, mechanical, or communicative success. Likewise, it is possible to write an effective essay that is focused on a small number of arguments or ideas, resulting in a high rating with a low amount of demonstrated diversity in vocabulary.

Limitations.

There are multiple limitations to this research and research design.

First, my data set has limitations. While the data set comes directly from [REDACTED], and represents actual test administrations, because I did not collect the data myself, there is a degree of uncertainty about some of the details regarding its collection. For example, there is reason to believe that the native English speaking participants (who naturally were unlikely to undertake a test of English language ability like the [REDACTED]) took the test under a research or diagnostic directive. While this is standard practice in the test administration world (Fulcher 185), it might introduce construct-irrelevant variance into the sample, particularly given that those taking the test on a diagnostic or research basis might feel less pressure to perform well. Additionally, the particular database of essays contains samples that are over 20 years old. Whether this constitutes a major limitation of this study is a matter of interpretation. While that lack of timeliness might give us pause, it is worth pointing out that neither the [REDACTED]’s writing portion or standardized tests of writing of the type employed in the [REDACTED] have changed dramatically in the time since this data was collected. The benefit of using actual data from a real standardized essay test, in my view, outweighs the downside of the age of that data. A final issue with the [REDACTED] archive as a dataset for this project lies in broad objections to the use of this kind of test to assess student writing. Arguments of this kind are common, and frequently convincing. However, exigence in utilizing this kind of data remains. Tests like the TOEFL, SAT, GRE, and similar high-stakes assessments of writing are hurdles that frequently must be cleared by both L1 and L2 students alike. High stakes assessments of this type are unlikely to go away, even given our resistance to them, and so should continue to be subject to empirical inquiry.

Additionally, while D does indeed appear to be a more robust, predictive, and widely-applicable measure of lexical diversity than traditional measures like NDW and TTR, it is not without problems. Recent scholarship has suggested that D, too, is subject to reduced discrimination above a certain sample size. See, for example, McCarthy and Jarvis (2007) use parallel sampling and comparison to a large variety of other indexes of lexical diversity to demonstrate that D’s ability to act as a unit of comparison degrades across texts that vary by more than perhaps 300 words (tokens). McCarthy and Jarvis argue that research utilizing vocd should be restricted to a “stable range” of texts comprising 100-400 words. The vast majority of essays in the TWE archive fall inside of this range, although many of the worst-scoring essays contain less than 100 words and a small handful exceed 400. Four essays in my sample do not meet the 100 word threshold, with word counts (tokens) of 51, 73, 86, and 90, and one essay that exceeds the 400 word threshold, with 428 words. Given the small number of essays outside of the stable range, I feel my research maintains validity and reliability despite McCarthy and Jarvis’s concerns.

Directions for Further Research.

There are many ways in which this research can be improved and extended.

The most obvious direction for extending this research lies in expanding its sample size. Due to the aforementioned difficulty in incorporating the [REDACTED] archive with the CLAN software package, less than 3% of the total [REDACTED] archive was analyzed. Utilizing all of the data set, whether through automation or by hand, is an obvious next step. An additional further direction for this research might involve the incorporation of additional advanced metrics for lexical diversity, such as MTLD and HD-D. In a 2010 article in the journal Behavior Research Methods titled “MTLD, vocd-D, HD-D: A validation study of sophisticated approaches to lexical diversity assessment,” Philip McCarthy and Scott Jarvis advocate for the use of those three metrics together in order to increase the validity of the analysis of lexical diversity. In doing so, they argue, these various metrics can help to address the various shortcomings of each other. There may, however, be statistical and computational barriers to affecting this kind of statistical analysis.

Another potential avenue for extending and improving this research would be to address the age of the essays analyzed by substituting a difference corpus of student essays for the [REDACTED]. This would also help with publication and flexibility of presentation, as ETS places certain restrictions on the use of its data in publication. Finding a comparable corpus will not necessarily be easy. While there are a variety of corpora available to researchers, few are as specific and real world-applicable as the [REDACTED] archive. Many publicly available corpora do not use writing specifically generated by student writers; those that do often derive their essays from a variety of tasks, genres, and assignments, limiting the reliability of comparisons made statistically. Finally, there are few extant corpora that have quality ratings already assigned to individual essays, as with the [REDACTED]. Without ratings, no correlation with D (or other measures of lexical diversity) is possible. Ratings could be generated by researchers, but this would likely require the availability of funds with which to pay them.

Finally, this research could be expanded by turning from its current quantitative orientation to a mixed methods design that incorporates qualitative analysis as well. There are multiple ways in which such an expansion might be undertaken. For example, a subsample of essays might be evaluated or coded by researchers who could assess them for a qualitative analysis of their diversity or complexity of vocabulary use. This qualitative assessment of lexical diversity could then be compared to the quantitative measures. Researchers could also explore individual essays to see how lexical diversity contributes to the overall impression of that essay and its quality. Researchers could examine essays where the observed lexical diversity is highly correlative with its rating, in order to explore how the diversity of vocabulary contributes to its perceived quality. Or they could consider essays where the correlation is low, to show the limits of lexical diversity of a predictor of quality and to better understand how outliers like this are generated.

Implications.

As discussed in the Introduction and Rationale statement, this research arose from exigence. I have identified a potential gap in instruction, the lack of attention paid to vocabulary in formal writing pedagogy for adult students. I have also suggested that this gap might be especially problematic for second language learners, who might be especially vulnerable to problems with displaying adequate vocabulary, and having access to correct terms, when compared to their L1 counterparts.

Given that this research utilized a data set drawn from a standardized test of English using time essay writing, and that many scolars in composition dispute the validity of such tests for gauging overall writing ability, the most direct implications of this study must be restricted to those tasks. In those cases, the lessons of this research appear clear: students should try to write as much as possible in the time alloted, and they should attend to their vocabulary both in order to fill that space effectively and to be able to demonstrate a complex and diverse vocabulary. Precise methods for this kind of self-tutoring or instruction are beyond the boundaries of this research, but both direct vocabulary instruction (such as with word lists and definition quizzes) and indirect (such as through reading challenging material) should be considered. As noted, this kind of evolution in pedagogy and best practices within instruction is at odds with many conventional assumptions about the teaching of writing. Some resistance is to be expected.

As for the broader notion of lexical diversity as a key feature of quality in writing, further research is needed. While this pilot study cannot provide more than a limited suggestion that demonstrations of a wide vocabulary are important to perceptions of writing quality, the findings of this research coincide with intuition and assumptions about how writing works. Further research, in keeping with the suggestions outline above, could be of potentially significant benefit to students, instructors, administrators, and researchers within composition studies alike.

Works Cited

Beach, Richard, and Tom Friedrich. “Response to writing.” Handbook of writing research  New York: The Guilford Press, 2006. 222-234. Print.

Broeder, Peter, Guus Extra, and R. van Hout. “Richness and variety in the developing lexicon.” Adult language acquisition: Cross-linguistic perspectives. Vol. I: Field methods (1993): 145-163. Print.

Chen, Ye‐Sho, and Ferdinand F. Leimkuhler. “A type‐token identity in the Simon‐Yule model of text.” Journal of the American Society for Information Science 40.1 (1989): 45-53. Print.

Engber, Cheryl A. “The relationship of lexical proficiency to the quality of ESL compositions.” Journal of second language writing 4.2 (1995): 139-155. Print.

Faigley, Lester, and Stephen Witte. “Analyzing revision.” College composition and communication 32.4 (1981): 400-414. Print.

Foster, Pauline, and Parvaneh Tavakoli. “Native speakers and task performance: Comparing effects on complexity, fluency, and lexical diversity.” Language Learning 59.4 (2009): 866-896. Print.

“IELTS Handbook.” Britishcouncil.org; The British Council. 2007. Web. 1 May 2013.

Li, Yili. “Linguistic characteristics of ESL writing in task-based e-mail activities.” System 28.2 (2000): 229-245. Print.

Malvern, David, et al. Lexical diversity and language development. New York: Palgrave Macmillan, 2004. Print.

McCarthy, Philip M., and Scott Jarvis. “vocd: A theoretical and empirical evaluation.” Language Testing 24.4 (2007): 459-488. Print.

—. “MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment.” Behavior research methods 42.2 (2010): 381-392.

Richards, Brian. “Type/token ratios: What do they really tell us.” Journal of Child Language 14.2 (1987): 201-209. Print.

Spack, Ruth. “Initiating ESL students into the academic discourse community: How far should we go?.” Tesol quarterly 22.1 (1988): 29-51. Print.

“TOEFL IBT.” ets.org, The Educational Testing Service. nd. Web. 1 May 2013.

Yu, Guoxing. “Lexical diversity in writing and speaking task performances.” Applied linguistics 31.2 (2010): 236-259. Print.

if formal music education is a privilege, spread the privilege

I was encouraged by this open letter from musicians and music educators in the Guardian, responding to a deeply wrongheaded essay arguing that, since formal music education is increasingly restricted to the white and wealthy, formal music education (that is, notation and theory) is therefore somehow bad and we should stop trying to do it at all. That sounds ridiculous but look and see for yourself.

This is hardly a unique argument in education or outside of it. (A recent vintage I’ve heard, particularly ludicrous, is that the impending shutdown of the New York L train is a good thing, because the L train is ridden by privileged hipsters, which… I can’t even begin to tell you how immensely stupid that is.) French poetry and other “impractical” majors sometimes get this routine – they are disproportionately concentrated in elite private colleges, so therefore there is something inherently decadent about studying them. Their connection to privilege somehow renders them unclean. But of course the fact that these wonderful subjects are now the province of those who have less immediate pressure to achieve independent financial stability only means that we should spread that condition.

A more equitable and humane society is one in which more people, not fewer, can spend their time on beautiful, “impractical” pursuits. Yet there are those deluded leftists who sometimes take a similar tack; why, they ask, are we funding Shakespeare in the park when there are people without warm clothes for the winter? Why pay for museums when some people go hungry? But follow this thinking long enough and you realize what they’re really saying is “poor people have no inner life.” In a just society we recognize that nobody, rich or poor, lives on bread alone. If your socialism doesn’t spread access to music and art and theater and cathedrals and tree-lined boulevards, I have no use for it or for you.

The point of privilege analysis is to spread the privileges to everyone, not to end them, and since music is the food of love, let rich and poor kids play on.