|Tree of Candidates, 12 January|
A few months ago, I plotted out the Presidential candidates from the two major parties in terms of their truthfulness. I did this with a "tree" (more of a "bush") diagram based on the Politifact Truth-o-Meter. As I mentioned before, Politifact does provide individual summary charts for each person and a description of the various statements used to create the charts. Unfortunately, comparing profiles isn't straightforward, especially if you want to look at several of them at once. That's where nerdistry comes in.
In addition, a new candidate has formally entered the race since my last attempt. Thus, I again went to the data on each politician's page, ran it through some nerd magic, and came up with a new tree. I restricted myself to formally filed candidates who have more than 4 rulings on Politifact. I also have Barack Obama and Nancy Pelosi, for reference. I color-coded the names by party. You can click on any name to lead you to that person's Politifact page. Four "meaningful" clusters (see below for what "meaningful" means) appeared in the data and have curves drawn around them. The differences among politicians inside the same "meaningful" cluster are not worth noting. Yes, this means that, when it comes to truthfulness, as measured by Politifact, Santorum, Fiorina, Cruz, and Huckabee lump in with Pelosi. Clinton and Sanders (and Obama) are pretty much the same as Bush, Christie, Kasich, Paul, and Rubio.
As last time, the clusters roughly summarize what end of the "True" vs. "Pants-on-Fire" profile a candidate sits on. The top left cluster (Let's call it Clinton-Bush) leans more to "True" and "Mostly True". The cluster on the bottom right (we can call it the Pelosi Cartel, just for giggles) tends to prefer "half true" and "mostly false". The bottom left cluster is heavily dominated by "False", with a dash of "Pants on Fire". O'Malley and Johnson (a Libertarian candidate) are in their own outlying cluster that is more "middling" between true/false. However, both these candidates have relatively few statements in their files.
And the take-home message? Two messages: First, if you agree with Politifact, it's a rough indication of who is more trustworthy. If you reject Politifact's conclusions, just invert the true/false interpretations. Second, you can see who resembles each other in terms of trustworthiness and that this hasn't changed much since August. Agree with or reject Politifact, this part is consistent. Politicians in the same cluster seem to have the same basic character as each other when it comes to honesty or its lack. Like I said, if you dislike Politifact, just flip the interpretation of truth.
This is a repeat of Agust's methods. I used a copy of "R" statistical language and the "cluster", "gclus", "ape", "clue", "protoclust", "multinomialCI" and "GMD" packages. Then I gathered up the names of declared candidates for US President. I did not intend to limit this to only Republicans or Democrats. Unfortunately, when I looked people up on Politifact, it was only Republicans or Democrats who had more than 4 rulings. Why more than 4? A rough estimate of the "standard error" of count data is the square root of the total. The square root of 4 is 2, which means that if a candidate had 4 rulings, the accuracy was plus or minus 2. Such a large wobble was too much for my taste. This time, I ended up with 17 candidates.
Comparing them required a distance metric. I could have assigned scores to each ruling level and then calculated an average total per ruling. While this might be tempting, it is also wrong. Why is it wrong? Because that method would make a loose cannon the same as a muddled fence-sitter. Imagine a candidate who only tells the complete truth or complete whoppers. If you assign scores and average, this will come out being the same as a candidate who never commits but only makes halfway statements. Such people should show up as distinct in any valid comparison.
Fortunately, there are other ways to handle this question. I decided to use a metric based on the chi distance. Chi distance is based on the square of the difference between two counts divided by the expected value. It's used for comparing pictures, among other uses. However, a raw chi distance depends very much upon the total, and the totals were very different among candidates. The solution to this was easy, of course. I just took the relative counts (count divided by total) for each candidate.
I needed one more element for my metric. Politifact does not rate every single statement someone makes. They pick and choose. Eventually, if they get enough statements, their profiles probably present an accurate picture, but until they get a very large number of statements, there is always some uncertainty. Fortunately, multinomialCI estimates that uncertainty. I ran the counts through multinomialCI and got a set of "errors" for each candidate. I combined these with the chi distances to obtain "uncertainty-corrected distance" between each candidate. Long story short, this was done by dividing the chi distance by the square root of the sums of the squares of the errors. What that meant is that a candidate with a large error (few rulings) was automatically "closer" to every other candidate due to the uncertainty of that candidate's actual position.
I then created a series of hierarchical clustering trees from this set of distances. There is a good deal of argument over which tree creation method is best. I decided to combine multiple methods. I created trees using "nearest neighbor", "complete linkage", "UPGMA", "WPGMA", "Ward's", "Protoclust", and "Flexible Beta" methods. The "clue" package was designed to combine such trees in a rational fashion. Feel free to look it up if you want to follow all the math. I used clue to create the "consensus tree", which is the structure I posted on my blog. But clue doesn't tell you how to "cut" the clusters. For that, I turned to the "elbow method".
The elbow method is an old statistical rule of thumb. Basically, any set of "clustering" has multiple ways you can slice it to say "these things fall into those groups and smaller groups don't really matter". The "elbow method" compares the "variance" of each possible way of cutting the clusters and charts them on the basis of number of clusters vs. "variance explained" by that number of clusters. The math is not simple. What you do is then plot the "variance explained" vs. the number of clusters. What you look for is a "scree" or an "elbow". The line will always be descending. The idea is that you hope there is some point where there is a sharp bend in your line. At the point of that bend is the "elbow". More clusters won't add enough additional explanation to be worth the cut. In this case, my elbow was at four clusters.