Last Updated October 20, 2016 @ernietedeschi


Last Updated October 20, 2016 @ernietedeschi



Ernie Tedeschi

Github repository of Stata code (start with masterrw.do)

Download LAT/USC microdata (requires free registration)

Note: The project described on this site relies on data from survey(s) administered by the Understanding America Study, which is maintained by the Center for Economic and Social Research (CESR) at the University of Southern California. The content of this site is solely the responsibility of the author and does not necessarily represent the official views of USC or UAS.


Motivation

The LA Times/USC Tracking Poll (LAT/USC) is an online longitudinal political survey of US adults with a large rolling sample relative to other election polls. It is based on methodology developed at the nonpartisan RAND Corporation and employed to accurate effect back in 2012. The poll is singular in several ways: not only does it track the same respondents over time and frequently, it also asks them for their prediction of the winner as well as their self-assessed likelihood of voting, it addition to their personal election preference. Respondents provide these measures in each case as continuous probabilities to 100 rather than binary yes/no choices. The poll is folded into the broader survey work of a respected academic research institution: the USC Dornsife Center for Economic and Social Research. The Center makes the entire individual microlevel data of the poll public, free, and frequently updated, something virtually no other poll does.

However, LAT/USC has been a relative outlier among 2016 US election polls, at times dramatically so. This has caused some observers to overlook the poll's rich underlying data and dismiss the entire survey out of hand, while other observers have disproportionately cited the poll in often motivated ways.

Given the unique features of the poll, any one is a potential factor in its frequent outlier results, but the three most discussed possibilities are as follows:

1. The poll weights respondents partially based on self-described 2012 vote; due to well-known ex post recollection bias towards the winner, this has the effect of overweighting Romney voters who now are less likely to support Clinton;

2. Separate from the weighting, the longitudinal nature of the poll means that the survey is "stuck" with a skewed sample that would have been corrected had the poll repeatedly redrawn its sample over time; or

3. The poll is picking up signal being missed by most other electoral polls this cycle.

Since hypotheses 2 and 3 are nearly impossible to assess pre-Election Day, the goal of this exercise is to test the first hypothesis by excluding 2012 vote as a target for weighting.

Procedure
The basic approach is as follows:

Step 1: Choose the target dimensions for reweighting
Step 2: Generate population proportions along each Step 1 dimension using Census surveys
Step 3: Load the LAT/USC microdata, correct for missing data, and prepare the 7-day rolling samples
Step 4: Merge in the Step 2 proportions
Step 5: Create an initial synthetic weight and iteratively adjust using the Step 2 proportions for each 7-day sample

More discussion follows here:

Step 1: Choose the target dimensions for reweighting
I chose these dimensions based on 1) what is available in both LAT/USC and Census data, 2) how LAT/USC coded the categories underlying their variables, and 3) noncyclicality and nonseasonality. Ultimately, I thought gender, race/ethnicity, age, household income, education, and state of residence covered a broad array of electorally-relevant dimensions. Criteria #3 ruled out labor market variables such as employment status since these can swing significantly on a month-to-month basis without seasonal adjustment.

LAT/USC is a large sample relative to other election polls but is very small as a sample of the whole population, which makes multi-dimensional reweighting a challenge. Too many explicit interactions between variables risks dropping weights when applied to LAT/USC. My strategy then is an iterative reweight, where I extract the individual population proportions along each single dimension and use those to iteratively constrain the synthetic new weights I create. However, the one interaction I did at this point was between gender (2 categories) and race/ethnicity (4 categories), as this is 1) still safely low-dimensional and 2) electorally relevant. Note that none of the LAT/USC demographic variables I use are continuous. LAT/USC for example has 5 adult age categories, 4 race/ethnicity categories, and 3 household income categories.

Step 2: Generate population proportions along each Step 1 dimension using Census surveys
For sensitivity analysis, I use two different Census surveys of the population, to create two different sets of weights.

My preferred source is the Current Population Survey (CPS) March 2016 Annual Social and Economic Supplement (ASEC), which is a joint survey between the Census Bureau and the Bureau of Labor Statistics. The CPS is the survey used, among other things, to calculate the unemployment rate, and the ASEC is a special augmented version of the CPS used for producing the yearly income and poverty statistics.


The main upside of the CPS is that it is timely, having been conducted only seven months ago. Also, the CPS has recently made improvements in the way it collects data on household income, making it a reliable source of data on that topic in particular.


One downside of the CPS is that its sample is relatively small for this type of survey: 185,000 people in 94,000 households. However, since my procedure uses relatively low-dimension categories, I am not uncomfortable with this aspect of the CPS.


The more important consideration in my opinion is that the CPS is designed to cover only the civilian population living in households (the civilian noninstitutional population). It does not reliably sample active duty military servicemembers, and it does not include people living in institutional or group quarters at all (prisoners, residents of nursing homes, monasteries, etc.) unless another household member lists them (e.g. students living in dormitories, who are often not directly covered by the CPS but whose parents may list them as household members, effectively bringing them into the CPS). Adding these populations in produces the resident population.


For weighting purposes, what matters is the population proportion of each variable, and here the differences between the civilian noninstitutional and resident populations are often small. For example, men made up 48.3% of the adult civilian noninstitutional population in 2014 versus 48.7% of the resident population, which is not surprising giving the large male skew of the active duty military and institutional residents such as prisoners. And it bears emphasizing that a person excluded from the civilian noninstitutional population for weighting purposes is not excluded from the reweighted LAT/USC; the question is simply whether each person in LAT/USC is weighted correctly.


So for sensitivity analysis, I generate weights from an alternative Census survey: the 2014 American Community Survey (ACS). Like the CPS, the ACS collects a wealth of demographic, social, and economic data. Unlike the CPS, however, it covers the whole resident population: both household and institutional populations as well as the military. Its sample size is 2.5 million, far larger than the CPS ASEC.


The downside of the ACS however is that it is released with far more of a lag: the 2014 microdata is the most recent available, but that means most of its sample responded more than 2 years ago. It is very possible that even over this short time, the demographics and, especially, income makeup of the population has changed in politically-meaningful ways. We know, for example, that median household income rose sharply in 2015; because of timing differences between the CPS ASEC and ACS, this and other dynamics may be uncaptured by the ACS.


Both the CPS ASEC and ACS extracts I use come from the Minnesota Population Center's superb IPUMS database.


Step 3: Load the LAT/USC microdata, correct for missing data, and prepare the 7-day rolling samples
Some LAT/USC respondents have data missing that we need for reweighting. In some cases, this data is missing for some dates but is filled in for others; in these instances I fill in the missing data with the successful responses. Of the 37,000+ unique daily responses, 85 instances still have missing data along one of the relevant dimensions. I drop these instances from the sample.

In theory, LAT/USC polls each respondent once every 7 days, meaning that a respondent should only be present once in each 7-day wave. In practice, many respondents appear in the data at intervals more often than 7 days. I ensure that each respondent appears at most once in each 7-day window and that for each window the appearance is his or her latest response. This shrinks the sample size of each wave by 7% on average, to 2,447 from 2,632.

Step 4: Merge in the Step 2 proportions


Step 5: Create an initial synthetic weight and iteratively adjust using the Step 2 proportions for each 7-day sample
Here, I start out by assigning everyone an initial static weight of 10,000. Then, starting with the gender & race/ethnicity proportions, I go through each 7-day wave, see what each gender/race combination's population proportion is supposed to be based on the ACS and CPS ASEC, what it actually is, and then adjust each individual's synthetic weight by the ratio of the target proportion to the actual proportion based on that individual's gender/race/ethnicity combination. I then follow the same procedure for, in order, state of residence, age, education, and income, adjusting the synthetic weights along each dimension. Then I go back and repeat the whole process again beginning with gender/race/ethnicity, looping 100 times. Finally, I adjust every wave's individual weight by an equal proportion to bring the total up to the adult population total of 250 million (this last step is strictly unnecessary for getting the weighted average electoral preferences, but makes the weights more interpretable).

I also generate a composite weight that is the arithmetic average of the ACS and ASEC weights.


Results
As the figure below shows, the reweighting procedure produces a result that is far closer to the center of the polling distribution as measured by the RCP 4-way average. From July 11 to October 11, the RCP has shown an average 3.46 percentage point margin for Clinton over Trump, versus 3.42 percentage points for my baseline ASEC-reweighted LAT/USC and -1.84 percentage points for the official LAT/USC. The ACS-reweighted version shows an average margin of 2.13 percentage points for Clinton.