November 7, 2019

State Tax Burdens, how to lie without technically lying.

Ranks. I hate ranks. People who love data hate ranks because ranks destroy knowledge. When I do my professional analytics, I only use ranks under two circumstances: 1) When my data is such a bizarre mess that I have to "go nonparametric" (that's the nuclear option for data analysts) or when a client foolishly insists on ranks. There are other reasons people use ranks, they boil down to 1) the people are clueless and want something simplistic, and 2) the people in question are dishonest and want to hide something. Whenever it's a political question, option 2 is most likely

This appears to be the case with some data I found on WalletHub. It presents a table of state tax burdens for 2019. All well and good. They had no data methodology other than to lift the numbers from a group that calls itself "Tax Policy Center". A lot of other web pages did that, too, each presenting it their own way. I'm picking on WalletHub because 1) It comes up very early in Google searches on this topic, and 2) Some of the goofy choices they made in presenting the data.

What were the goofy choices? First, when they made their choropleth map, they decided to use ranks instead of the actual tax burdens. They produced something that looked like this.

State Tax Burden, Colored by Rank

I will give credit where credit is due. WalletHub did use a sequential color scheme for color data instead of the extremely common and stupid practice of a diverging scheme. Not the specific color or range I'd have chosen, but at least conceptually okay. What is wrong with their presentation? If ranks were all they had, then there'd be nothing wrong with that. They had more than ranks. To visualize their data, they chose to eliminate information. That tells me the actual data didn't paint the picture they wanted, so they diddled and fiddled until they got something more to their taste. The proof of this pudding is the next, completely illegitimate, pseudo-analysis they pulled off. They classified states into "red" and "blue", without stating their criteria. Why should this matter? After all, aren't red states totally red, no matter what, the same with blue? Perhaps if one is a complete moron, one might think that. The topic in this case, is state taxes.

There are states in the USA that are currently neither red nor blue at the level of state government. Their state governments are split between the parties. Since the WalletHub goobers decided to not explain how they resolved such a situation into "red" or "blue", there is no way to understand what they did. I do suspect that they blindly and stupidly (yes, stupidly) applied the results of the 2016 presidential election, even though the states went through statewide elections since then.

In any case, they then used this division to "compare" the states' tax burdens by political allegiance. They did so by averaging the ranks, not the actual percent burdens. As expected, using ranks, they came up with an enormous difference: Red states had an "average rank" of 30.13 while blue states had 18.4. This is crap. How do I know it's crap, because I repeated their analysis as best I could reconstruct it. I came up with the same "average ranks", but when I averaged actual percent scores, the "red" states had an average of 8.08% vs. "blue" states at 9.27%. Yes, still a difference, but the magnitude is far different. While the "average ranks" had a difference of 63% (or 11.7 rank points), the proper analysis of actual percent burden showed a difference of 15% (1.2 percent points)!

Without technically falsifying data, WalletHub invented a very large difference that didn't reflect reality.

Of course, there are better ways, otherwise, I'd not have written this. First, the overall presentation of ranks is simply dishonest in this situation. Nothing is gained by presenting this data as ranks, as the following maps show:

State Tax Burden, "Squashed" Range State Tax Burden, Starting at Zero

The map on the left is the actual tax burden percents, scaled to the same color range used by WalletHub. The minimum value is the lowest tax burden (Alaska). The map on the right is the same data except scaled to a minimum value of 0% tax burden. You will notice a difference. If the data is "stretched", differences can still look big. If a natural bottom end is chosen, you see that most of the differences melt into the background. So, what about the blue/red thing? When I consulted Ballotpedia, I got the distribution of state government political domination, Republican, Democrat, or Divided. Running these through the data gave me the following: Republican: 8.10% tax burden; Democrat: 9.39% tax burden; Divided: 8.43% tax burden. This isn't radically different from outcomes based on the red/blue WalletHub categorization, but it's a more complete picture. I won't bother averaging ranks. I'm not that dishonest.

October 22, 2019

Science, friend or foe? Hero or villain?

The data collected for the Wellcome Global Monitor, 2018, specifically the "Dataset and crosstabs for all countries" (linked on the Wellcome page), is a gold mine for looking at attitudes regarding science, medicine, and some associated questions internationally. I wanted to make a snapshot of how science was viewed, country-by country. The Wellcome survey had two direct questions on this topic: "In general, do you think the work that scientists do benefits people like you in this country?" and "In general, do you think the work that scientists do benefits most, some, or very few people in this country?" Basically, they boil down to "Is science your friend?" and "Is science a hero?" Okay, not "hero", but let's look at it that way for the moment. A "hero" would do something that, in the end, would benefit everyone, while a "villain" only cares about his own elite, screw everyone else. Overly simplistic, but dramatic. How could I calculate this? The first question was straightforward: Subtract the percent answering "No" from that answering "Yes". Ignored "don't know". That provides the net evaluation for the country. The second had "Most", "Some", and "Very Few", in addition to "don't know". I ignored "Some" and "don't know" as "indecision". I could have lumped "Some" into either "Very Few" or "Most", but I had not basis to do either. So, I subtracted "Very Few" from "Most". I then used the resulting numbers to create the following choropleth maps:

Is science good for me?
Is science good for everyone?

As you can see, in most countries, the survey participants tended to see science as something they personally benefit from. What is interesting is that, in nearly all countries, respondents were less certain that science was of overall benefit to other people. This is most dramatic in Latin America and Africa, where there is almost a consensus that, even if science is of personal benefit, those benefits do not extend to most people. What is interesting is that a regression of attitudes regarding science on a personal vs. social level estimates a slope of 0.806 with an R2 of 0.771, which suggests that the two attitudes reflect each other across countries, even if one might be net "positive" and the other net "negative".

Vaccine Attitudes Around the World

I was inspired to write this page by a post on Reddit's "Data is Beautiful subreddit. The post in question was inspired by data collected for the Wellcome Global Monitor, 2018, specifically from the "Dataset and crosstabs for all countries" (linked on the Wellcome page). The original post on Reddit was a workmanlike visualization of part of a large and unwieldy dataseet, and it deserves to be appreciated for that. However, it was not without flaw. Ordinarily, I would just have let it pass, but the three mistakes tha author made are so tragically common that I decided to write a bit about them.

First, the maps in that visualization were coded on a scale where the lowest point was 50% and the highest 100%. This is an extremely common mistake made by neophytes in data visualization. In some fields, it's considered an unethical way to manipulate data. I highly doubt that the other visualization's author had any unethical desires. Instead, he succumbed to an extremly common failure of neophyte visualizers, the need to make things "look good". What's wrong with that? Suppose I have a data set that measures frequency of crime in several cities. One is at 52 incidents per 1000 people, one is at 54 incidents per 1000 people, the third is at 56 incidents per 1000 people.

Not a big difference? Not if I set my lowest value to zero, the actual theoretical minimum, but what if I set my visualization so that the minimum value I show is 51 incidents? All of a sudden a difference of 5 incidents per 1000 can be made to look gigantic. The 56-incident town can be represented by a bar that is almost three times as large as the 52-incident town!

On a percent-based visualization, the natural boundaries are 0 and 100. Any visualization based on percents should be based on that scale unless the author has an extremely good reason to deviate from that practice. "It doesn't look nice." is not an extremely good reason. "My boss wants me to." might be extremely good from a keeping your job perspective, but it also means that your boss really doesn't give a damn about the truth. There needs to be a sound theoretical basis for moving the goalposts of your visualization away from any natural locations.

The next mistake the author of the original visualization makes is also extremely common. It even occurs in visualizations made by "professionals". This is the use of a diverging color scheme for sequential data. WTF does that mean? Among the many kinds of data are "diverging" and "sequential". Sequential data follows a single-direction sequence, such as 0% to 100%. Diverging data diverges from some sort of meaningful center point, such as -50% to 50% (centered on 0%). A sequential color scheme follows a sequence of (usually inverse) brightness, but sometimes hue and saturation can be worked into it. What does that mean? This is what that means. There are other color systems in use, but the essense is that dark+intense means more (usually), and more means more.

With a diverging scheme, dark+intense can mean more and it can mean less, and more vs. less depend on the hue. So what? So what is that our brains "get" things differently depending on whether or not they are presented as sequential or diverging color schemes. We are tuned to look at the "middle" of the diverging scheme as a "natural middle", where the middle value has specific intrinsic meaning. The older visualization violates this necessary principle. Instead, 75% is the "middle", with no intrinsic meaning at all. It merely happens to be the numeric mid-point between 50% and 100%. It wasn't chosen, it just happened to fit a simple method.

The third major mistake in the presentation was the choice of colors, in and of themselves. Red vs. green may be the most popular choice of contrasting colors on the Web. It's also the worst possible choice of colors. The most common forms of colorblindness involve red and green. Two "distinct" color spots can be indistinguishable if the "difference" relies on distinguishing between red and green. To understand, you will probably need to look at some simulations of the efects of different types of colorblindness on the red-green scheme.

So, can I do better? Yes, I can, and he could have, too, had he better information and more understanding of the neurology and pyschology of perception. Fortunately, a lot of that has been distilled into an extremely useful document: Colour Schemes, by Paul Tol. A lot of research went into this document, which presents useful color schemes for qualitative, sequential, and diverging data. Anyone who is serious about using color in data presentation needs to know this document very well. Anyone who knows of and ignores it doesn't give a damn about effective color use for data presentation.

How did I do the same data better? Two ways. First, I applied an actual sequential color scheme to the sequential data. Second, I "reconfigured" the sequential data so that a diverging color scheme could be validly applied, and I applied a proper diverging scheme instead of the horribly mis-designed and overly common red/green scheme. My results are below:

Vaccines are important for children to have. (Sequential) Vaccines are important for children to have. (Diverging from average)
Vaccines are safe. (Sequential) Vaccines are safe. (Diverging from average)
Vaccines are effective. (Sequential) Vaccines are effective. (Diverging from average)

What did I do? In the left column, I used a color scheme where "darker = more". The darker the color, the higher the percent of people who somewhat or strongly agreed with the statement. The hues also change along with darkness, but it's the darkness that actually imparts the message. The hue provides a bit of aesthetic enhancement to draw the eye. I could have chosen a monochromatic scheme or even a grayscale to get a similar effect. Indeed, if you converted the left column to grayscale, you'll see the same results.

On the right column, I created a "natural center", specifically the average of all the countries' scores from the left column of each map. I then subtracted this from an individual country's score. I was naughty when I presented this column, because I arbitrarily chose my maximum cutoff at -50% to +50% instead of -100% to +100%. Did anyone catch that before I mentioned it?

For the next two maps, I created my own data, which I call "WTF, People?". This is the "Vaccines are important for children to have." percent minus the lower of "Vaccines are safe." or "Vaccines are effective." It represents the percent of people who think that vaccines aren't safe or effective but still think it's imporatant for children to have them. In other wordes, "WTF, People?" What kind of culture does one have where you think it's important to give children vaccines that you think aren't safe or effective? This score in some ways could imply a lot about the culture of the country in question, or simply show how people in that country really do not think things through.

WTF, People? WTF, People?

Did you notice what I did? Take a look at the legends. Take a look at the numbers vs. the colors. I flipped the scale! Why did I do that? Remember how I wrote earlier that, for sequential schemes darker is (usually) more. This is one situation where more might not be best represented by darker. Why would I think such a thing? Because I have already primed you, the audience, to also think that darker is better. We generally presume that vaccines are important, safe, and effective, thus, such statements would generally be seen as good things to agree to. Therefore, greater agreement is more desirable. However, "WTF, People?" isn't a desirable trait. A country where people can just go along with something even if it violates their own beliefs probably has multiple deeper problems. So, I represented "WTF, People?" on inverted scales, to illustrate this bias of mine. Yes, it's a bias. So is the idea that vaccines ar important, safe, or effective. A bias can be true.

Anyway, I wrote this mostly to illustrate some important principles of using color to convey information in contrast to the extremely common gross violation of those principles. At very least, please read the Colour Schemes technical note.