Tree of Candidate Truth, December 3, 2019 |
---|
The Presidential Silly Season is upon us again, and the clown car has filled up. Before the serious pruning begins, and that will be soon (a few have already dropped out), I'm offering a comparison of the would-be and presumed candidates (plus one--I included Pence as a "Republican Reference" in addition to Trump) on the basis of the Politifact evaluation profile. If you're not already aware, Politifact is a web site that rates truthfulness of statements made by various public figures, news outlets, and selected other loudmouths. The general reaction most people have to Politifact is that it's reliable when it agrees with what they want to believe and a pack of lies when it doesn't. In my mind, that means it's probably pretty reliable, overall. It's possible to do pair-by-pair comparisons of Politifact profiles for every two candidates. That's also tedious and not useful for most people. I have a way to compare multiple profiles at once and show the results to you.
Rather than make you wait, the results are to the right. This is a "hierarchical tree" with clustering. Colors of candidate names indicate which "cluster" each falls into. The lines show how closely the candidates are to each other. The first thing you'll see is that Trump is pretty much by himself when it comes to his universe of "truth". What is interesting is that the next nearest candidate to Trump is Yang. The remaining candidates are in two clusters. The "Harris" cluster has been more willing to bend the truth, while the "Castro" cluster has generally avoided such shenanigans. Pence ended up in the middle of the Harris cluster. Why did I name the two clusters "Harris" and "Castro"? I'll explain in my methods section.
The take-home message: First, if you agree with Politifact, it's a rough indication of who is more trustworthy. If you reject Politifact's conclusions, just invert the true/false interpretations. Second, you can see who resembles each other in terms of trustworthiness.
Methods
I gathered the profiles from Politifact of the candidates mentioned in Ballotpedia for the 2020 Presidential election. I used the "R" statistical language and the "ape", "clue", "cluster", "protoclust", "multinomialCI" and "GMD" packages. Unfortunately, when I looked people up on Politifact, several candidates had no Politifact profile or too few entries in the profile. I ended up with 13 candidates plus Mike Pence.
Comparing them required a distance metric. I could have assigned scores to each ruling level and then calculated an average total per ruling. While this might be tempting, it is also wrong. Why is it wrong? Because that method would make a loose cannon the same as a muddled fence-sitter. Imagine a candidate who only tells the complete truth or complete lies. If you assign scores and average, this will come out being the same as a candidate who never commits but only makes halfway statements. Such people need to show up as distinct in any valid comparison.
Fortunately, there are other ways to handle this question. I decided to use a metric based on the "chi squared distance". Chi squared distance is based on the square of the difference between two counts divided by the average of the two counts. It's used for comparing histograms, and a profile like Politifact's is easy to represent with a histogram. However, a raw chi distance depends very much upon the total, and the totals were very different among candidates. The solution to this was easy, of course. I just took the relative counts (count divided by total) for each candidate.
I needed one more element for my metric. Politifact does not rate every single statement someone makes. They pick and choose. Eventually, if they get enough statements, their profiles probably present an accurate picture, but until they get a very large number of statements, there is always some uncertainty. Fortunately, multinomialCI estimates that uncertainty. I ran the counts through multinomialCI and got a set of "errors" for each candidate. I combined these with the chi distances to obtain "uncertainty-corrected distance" between each candidate. Long story short, this was done by dividing the chi distance by the square root of the sums of the squares of the errors. What that meant is that a candidate with a large error (few rulings) was automatically "closer" to every other candidate due to the uncertainty of that candidate's actual position.
I then created a hierachical tree using protoclust. Protoclust is a type of hierarchical clustering method that optimizes several factors. In addition, it calculates which of the nodes (candidates) in a cluster is the "prototype"--the node that most resembles the cluster as a whole. That's where the cluster names of "Harris" and "Castro" came from. They resemble their cluster as a whole and are good stand-ins. So, I had my tree, but that doesn't tell you how to "cut" the tree to produce clusters. For that, I turned to the "elbow method".
The elbow method is an old statistical rule of thumb. Basically, any set of "clustering" has multiple ways you can slice it to say "these things fall into those groups and smaller groups don't really matter". The "elbow method" compares the "variance" of each possible way of cutting the clusters and charts them on the basis of number of clusters vs. "variance explained" by that number of clusters. The math is not trivial. I plotted the "variance explained" vs. the number of clusters too look for a "scree" or an "elbow".In this case, my elbow was at four clusters.
The case of Mr. Yang
Is Yang really all that close to Trump? The truth is that Yang has few statements, the fewest in the data set. The fewer the statements, the higher the uncertainty. Yang's apparent closeness to Trump could be an illusion due to the high uncertainty in his "real" placement. That being said, what few statements he made did include a Pants on Fire and a False. Perhaps time will tell, or perhaps Mr. Yang will drop out before we hear much more from him.
No comments:
Post a Comment