December 22, 2015

Liars, Damned Liars, Presidential Candidates

Tree of Candidates, 22 December 2015
image/svg+xml Bush Carson Christie Clinton Cruz Fiorina Huckabee Kasich Obama O'Malley Paul Pelosi Rubio Santorum Trump

A few months ago, I plotted out the Presidential candidates from the two major parties in terms of their truthfulness. I did this through a "tree" (really more of a "bush") that showed how close each was to all the others if one uses the Politifact Truth-o-Meter to measure "truthfulness".

As I mentioned before, Politifact gives us a summary chart for each person and a description of each of their determinations. Unfortunately, comparing the profiles isn't quite straightforward, especially if you want to compare several of them at once. That's where nerdistry comes in.

So, once again, I took the data on each politician's page, ran it through some nerd magic, and came up with the clustering. The names correspond to formally filed candidates who have more than 4 rulings on Politifact and haven't dropped out of the race, plus Barack Obama and Nancy Pelosi, for reference. I color-coded them by party. You can click on any name to lead you to the person's Politifact page. Many of the candidates had more data--more statements that got a Politifact "ruling". That meant that Dr. Ben Carson made it onto the tree. He wasn't in August, only because Politifact did not have at least 4 statements by him. Now they've plenty. Several candidates dropped out. Nevertheless, if you look at the older tree, you'll see that things haven't changed much. As bafore, three "meaningful" clusters appeared in the data and have curves drawn around them. The differences among politicians inside the same "meaningful" cluster are not worth noting. Yes, this means that, when it comes to truthfulness, as measured by Politifact, Santorum, Fiorina, and Huckabee lump in with Pelosi. Clinton (and Obama) are pretty much the same as Bush, Christie, Kasich, Paul, and Rubio.

What do the clusters show? As last time, it comes down to what end of the "True" vs. "Pants-on-Fire" profile a candidate sits on. The top left cluster (Let's call it Clinton-Bush) leans more to "True" and "Mostly True". The cluster on the right (Pelosi and her boys) tends to prefer "half true" and "mostly false". The bottom cluster is heavily dominated by "False", with a dash of "Pants on Fire" and a rare "True" or "Mostly True" O'Malley is in his own world, but he has moved compared to the tree as a whole. Last time, he was a complete outlier and couldn't be related to anyone. Now that the wacky pack at the bottom has amassed a truly monumental level of whoppers, this has squeezed O'Malley in to be closer to the other two clusters than to Trump/Fiorina/Cruz.

And the take-home message? Two messages: First, if you agree with Politifact, it's a rough indication of who is more trustworthy. If you reject Politifact's conclusions, just invert the true/false interpretations. Second, you can see who resembles each other in terms of trustworthiness and that this hasn't changed much since August. Agree with or reject Politifact, this part is consistent. Politicians in the same cluster seem to have the same basic character as each other when it comes to honesty or its lack. Like I said, if you dislike Politifact, just flip the interpretation of truth.

Nerd Section

This is a repeat of Agust's methods. I used a copy of "R" statistical language and the "cluster", "gclus", "ape", "clue", "protoclust", "multinomialCI" and "GMD" packages. Then I gathered up the names of declared candidates for US President. I did not intend to limit this to only Republicans or Democrats. Unfortunately, when I looked people up on Politifact, it was only Republicans or Democrats who had more than 4 rulings. Why more than 4? A rough estimate of the "standard error" of count data is the square root of the total. The square root of 4 is 2, which means that if a candidate had 4 rulings, the accuracy was plus or minus 2. That's too much for my taste. This time, I had 15 candidates.

Comparing them required a distance metric. I could have assigned scores to each ruling level and then calculated an average total per ruling. While this might be tempting, it is also wrong. Why is it wrong? Because that method would make a loose cannon the same as a muddled fence-sitter. Imagine a candidate who only tells the complete truth or complete whoppers. If you assign scores and average, this will come out being the same as a candidate who never commits but only makes halfway statements. Such people should show up as distinct in any valid comparison.

Fortunately, there are other ways to handle this question. I decided to use a metric based on the chi distance. Chi distance is based on the square of the difference between two counts divided by the expected value. It's used for comparing pictures, among other uses. However, a raw chi distance depends very much upon the total, and the totals were very different among candidates. The solution to this was easy, of course. I just took the relative counts (count divided by total) for each candidate.

I needed one more element for my metric. Politifact does not rate every single statement someone makes. They pick and choose. Eventually, if they get enough statements, their profiles probably present an accurate picture, but until they get a very large number of statements, there is always some uncertainty. Fortunately, multinomialCI estimates that uncertainty. I ran the counts through multinomialCI and got a set of "errors" for each candidate. I combined these with the chi distances to obtain "uncertainty-corrected distance" between each candidate. Long story short, this was done by dividing the chi distance by the square root of the sums of the squares of the errors. What that meant is that a candidate with a large error (few rulings) was automatically "closer" to every other candidate due to the uncertainty of that candidate's actual position.

I then created a series of hierarchical clustering trees from this set of distances. There is a good deal of argument over which tree creation method is best. I decided to combine multiple methods. I created trees using "nearest neighbor", "complete linkage", "UPGMA", "WPGMA", "Ward's", "Protoclust", and "Flexible Beta" methods. The "clue" package was designed to combine such trees in a rational fashion. Feel free to look it up if you want to follow all the math. I used clue to create the "consensus tree", which is the structure I posted on my blog. But clue doesn't tell you how to "cut" the clusters. For that, I turned to the "elbow method".

The elbow method is an old statistical rule of thumb. Basically, any set of "clustering" has multiple ways you can slice it to say "these things fall into those groups and smaller groups don't really matter". The "elbow method" compares the "variance" of each possible way of cutting the clusters and charts them on the basis of number of clusters vs. "variance explained" by that number of clusters. The math is not simple. What you do is then plot the "variance explained" vs. the number of clusters. What you look for is a "scree" or an "elbow". The line will always be descending. The idea is that you hope there is some point where there is a sharp bend in your line. At the point of that bend is the "elbow". More clusters won't add enough additional explanation to be worth the cut. In this case, my elbow was at four clusters, the three I outlined plus O'Malley.