|Tree of Candidates, 5 August 2015|
In case you are a hermit who somehow has internet access, the USA has already started the run-up to the run-up to our Presidential election. Since our current Commander in Chief is a Democrat, that means that there are boatloads of Republicans trying to get the job. Just to keep things goofy, though, the Democrats have no "anointed one", who is the presumed favorite of the current President for the post.
Anyway, we've got an awful lot of "Vote for me because vote for me." already going on. If you don't know dink about the US political system, here's how this part works. Political parties in the USA are technically "private" organizations. Nevertheless, the two biggies (that would be Republicans and Democrats) have pre-elections, where they fight amongst themselves to get the support of their hardcore hardline members in the various states. These pre-elections then decide who will be the official candidates for each party.
That means that the campaigning starts early in the USA, with a lot of argle, bargle, wharr, and garble. There are all kinds of "ratings" out there that try to summarize each candidate on various axes. Conservative? Liberal? Environmental? Business? Pick a special interest and go with it. One that I like is Politifact. What they do is pick out statements made by public figures and rate them on a six-level scale, on the basis of how factual the statement is. They have a page devoted to the many people they've looked at.
You could, if you wanted, browse through every single page and get a rough idea of a politician's history. They even summarize things with a little chart for each person. Unfortunately, comparing these histories can be a little gnarled, especially if you want to compare several of them at once. That's where nerdistry comes in.
I am certain you have noticed the diagram on the right. It's called a "hierarchical clustering". I took the data on each politician's page, ran it through some nerd magic, and came up with the clustering. The names correspond to formally filed candidates who have more than 4 rulings on Politifact, plus Barack Obama and Nancy Pelosi. I color-coded them by party. Each name can be clicked to lead you to the person's Politifact page. The three "meaningful" clusters have curves drawn around them. The differences among politicians inside the same "meaningful" cluster are not worth noting.
What do the clusters show? To keep matters short, it comes down to what end of the "True" vs. "Pants-on-Fire" profile one comes out on. The cluster at the top of the figure leans more to "True" and "Mostly True". The bottom left cluster is fairly evenly distributed among all the answers, but tends to prefer "half true" and "mostly false". The bottom right cluster is dominated by "False", with a sprinking of everything else from "Mostly True" to "Pants on Fire". Outright "True" is rare for them, though. O'Malley is just too far from anyone else to fit in.
What does this mean? If you agree with Politifact, it's a reflection of who is more trustworthy, but not in any fine-grained sense. If you reject Politifact's conclusions, just invert the true/false interpretations. What is important is that you can see who resembles each other in terms of trustworthiness. Agree with or reject Politifact, that is still consistent. Politicians in the same cluster seem to have the same basic character as each other when it comes to honesty or its lack. Like I said, if you dislike Politifact, just flip the interpretation of "True" vs. not true.
If you are interested in how I came up with the "truth tree" and its "meaningful clusters", first I needed a copy of "R" statistical language and the "cluster", "gclus", "ape", "clue", "protoclust", "MultinomialCI" and "GMD". Then I gathered up the names of declared candidates for US President. I did not intend to limit this to only Republicans or Democrats. Unfortunately, when I looked people up on Politifact, it was only Republicans or Democrats who had more than 4 rulings. Why more than 4? A rough estimate of the "standard error" of count data is the square root of the total. The square root of 4 is 2, which means that if a candidate had 4 rulings, the accuracy was plus or minus 2. That's too much for my taste. This left me with 21 candidates.
Comparing them required a distance metric. I could have assigned scores to each ruling level and then calculated an average total per ruling. While this might be tempting, it is also wrong. Why is it wrong? Because that method would make a loose cannon the same as a muddled fence-sitter. Imagine a candidate who only tells the complete truth or complete whoppers. If you assign scores and average, this will come out being the same as a candidate who never commits but only makes halfway statements. Such people should show up as distinct in any valid comparison.
Fortunately, there are other ways to handle this question. I decided to use a metric based on the chi distance. Chi distance is based on the square of the difference between two counts divided by the expected value. It's used for comparing pictures, among other uses. However, a raw chi distance depends very much upon the total, and the totals were very different among candidates. The solution to this was easy, of course. I just took the relative counts (count divided by total) for each candidate.
I needed one more element for my metric. Politifact does not rate every single statement someone makes. They pick and choose. Eventually, if they get enough statements, their profiles probably present an accurate picture, but until they get a very large number of statements, there is always some uncertainty. Fortunately, multinomialCI is perfect to estimate that uncertainty. I ran the counts through multinomialCI and got a set of "errors" for each candidate. I could combine these with the chi distances to obtain "uncertainty-corrected distance" between each candidate. Long story short, this was done by dividing the chi distance by the square root of the sums of the squares of the errors. What that meant is that a candidate with a large error (few rulings) was automatically "closer" to every other candidate due to the uncertainty of that candidate's actual position..
I then created a series of hierarchical clustering trees from this set of distances. There is a good deal of argument over which tree creation method is best. I decided to combine multiple methods. I created trees using "nearest neighbor", "complete linkage", "UPGMA", "WPGMA", "Ward's", "Protoclust", and "Flexible Beta" methods. The "clue" package was designed to combine such trees in a rational fashion. Feel free to look it up if you want to follow all the math. I used clue to create the "consensus tree", which is the structure I posted on my blog. But clue doesn't tell you how to "cut" the clusters. For that, I turned to the "elbow method".
The elbow method is an old statistical rule of thumb. Basically, any set of "clustering" has multiple ways you can slice it to say "these things fall into those groups and smaller groups don't really matter". The "elbow method" compares the "variance" of each possible way of cutting the clusters and charts them on the basis of number of clusters vs. "variance explained" by that number of clusters. The math is not simple. What you do is then plot the "variance explained" vs. the number of clusters. What you look for is a "scree" or an "elbow". The line will always be descending. The idea is that you hope there is some point where there is a sharp bend in your line. At the point of that bend is the "elbow". More clusters won't add enough additional explanation to be worth the cut. In this case, my elbow was at four clusters, the three I outlined plus extreme outlier O'Malley.