Wisconsin Democratic Primary Prediction - 2016

I've been updating my county-level model to predict the outcome of the democratic primary.

I recently added a Google search trend variable that uses the last seven days before the election (but not including the actual election date) and is equal to Sanders / (Sanders + Clinton). While Sanders dominates the search engine trends (typically 2:1), there is a strong positive correlation between the percent of searches that he gets and the outcome. As of this past hour, Sanders is getting 73% of the searches in Wisconsin - which is a strong showing.

I also reintroduced polls back into the model as we've had a lot of polling (using the Pollster average with the new Emerson and ARG polls).

Current Prediction for Wisconsin: Sanders 62.5% and Clinton 37.5%

My previous prediction was 61.5/38.5. So it looks like the search trend is more critical than the polling (which is 52/48).

County Predictions for the most populous counties
55009 (Brown): 64.3%
55025 (Dane): 64.6%
55035 (Eau Claire): 69.9%
55039 (Fond du Lac): 64.9%
55059 (Kenosha): 61.6%
55063 (La Crosse): 66.8%
55073 (Marathon): 62.5%
55079 (Milwaukee): 55.4%
55087 (Outagamie): 68.0%
55101 (Racine): 57.6%
55105 (Rock): 62.7%
55117 (Sheboygan): 63.0%
55127 (Walworth): 66.5%
55131 (Washington): 63.7%
55133 (Waukesha): 62.5%
55139 (Winnebago): 67.6%

Note: this model is back tested, but that doesn't mean it will accurately predict the future. It is a work in progress and comments are welcome. If you want to discuss this or your model, please email me.

The general approach is that I have a wide array of demographic, internet, and electoral variables. The most important variables are percent black, polling data, Facebook likes, and the 7 day (pre-election) Google Search Trend.

Other important variables are age, sex, education, vote results of previous elections (2000 and 2012), income, caucus (yes/no), percent hispanic, percent asian, percent white, percent Native American, and even the percent of commuters that use bicycles. I broke age/sex into groups (like men 18-21).

I minimize my use of state-level variables to reduce the danger of over fitting. On the other hand, I am using several weird variables (like total population) and am also using the square, log, square root, and even cube of one or more variables. I'm not sure if this is justified, but they are statistically significant (all at 0.01 or much better) and reduce the model's confidence interval. Taking the log of income is fairly standard, but taking the square of percent black is strange. On the other hand, it makes sense that there is a difference between going from 90% to 100% black compared to going from 50% to 60%.

I use a linear regression to back test these variables on previous election results at the county level.

I then apply a turnout model (which comes from a separate regression on primaries (excludes caucuses)) to estimate the number of voters for each county. I multiply this by the predicted Bernie vote, then sum it to get the total Bernie votes for the state, and divide by the total number of expected state votes to get the predicted Bernie state percent.

County level predictions vs results

So in general my state prediction was off by 5.85%. I predicted 62.5% (of the Clinton/Sanders total) and Sanders got
56.75% of that total.

County - Prediction (Actual) - Error
55009 (Brown): 64.3% (57.3) - 7%
55025 (Dane): 64.6% (62.6) - 2%
55035 (Eau Claire): 69.9% (63.9) - 6%
55039 (Fond du Lac): 64.9% (57.1) - 7.8%
55059 (Kenosha): 61.6% (57.2) - 4.4%
55063 (La Crosse): 66.8% (62.8) - 4%
55073 (Marathon): 62.5% (58.9) - 3.6%
55079 (Milwaukee): 55.4% (48.0) - 7.4%
55087 (Outagamie): 68.0% (60.1) - 7.9%
55101 (Racine): 57.6% (50.8) - 6.8%
55105 (Rock): 62.7% (60.5) - 2.2%
55117 (Sheboygan): 63.0% (54.0) - 9%
55127 (Walworth): 66.5% (61.7) - 4.8%
55131 (Washington): 63.7% (54.4) - 9.3%
55133 (Waukesha): 62.5% (51.4) - 6.7%
55139 (Winnebago): 67.6% (61.2) - 5.5%

If you subtract a 5.85% state swing the county results look better. It still looks like I over-estimated Sanders support in the Milwaukee county and surrounding counties. I have a density variable that won't work as well when a city is combined with a suburban area (which is true about Milwaukee), but I don't think that was large enough to do it.

I'm also very puzzled as to why my model would be off by 5.85% at the state level. The polls were also off almost this much. The Pollster polling average was Sanders at 51.8% (of the Sanders/Clinton total) - so 4.95% off.

The good news is that once the results started to come in, with very few results (1% or so) my real-time county based model for predicting the state wide swing saw that there was a 7.3% swing against my model. Over the course of the night, this fell to the 5.85% swing (I also saw this late trend towards Sanders in the vote counting process on March 15 in all of the five states).

Search Trends

I did a model without the 7 day search trend and got 58.7% for Sanders. I'm wondering if the search trend was biased by advertising - and the Sanders campaign might be doing large ad buys.

Final Result

Sanders got 56.794% (eg essentially 56.8% - reducing my error by 0.04%)