Things I Write: November 2019

This is a cautionary tale, specifically about a "standard" principle that should not be standard, and can actually be misleading. It's often stated that you have to adjust events by population sizes, or you will get distorted impressions. For example, if you compare the number of house fires in two regions, you need to adjust by the number of houses in each region to give a realistic impression. You don't, not if what you explicitly want to show are the actual numbers of events. Of course, it would be dishonest to use those raw numbers to infer risk. Risk requires local background adjustments, but you don't always want to show a risk-like outcome. For example, if we look at migration data from tne United Nations (limited to the most recent year in that dataset, the 2019 report, which reports for 2018) and adjust the net migration (immigration minus emigration) by population, we get the following outcomes, expressed as percent of population"

Net Migration per Population

Time for a quiz. Which countries gained the greatest number of people? How about this, did the USA have more or fewer individual people immigrating (minus emigration) than did Qatar? You can't tell from this map or from any of the data included with it. There is no way you can use this map or its data to answer either of those questions. If you want to answer those questions, which are about numbers, not proportions or percentages, you have to look at the unadjusted numbers.

Net Migration, people

Now it's easier to answer questions of "how much", "how many", and "most". Even though the USA is near the middle of the pack when it comes to percent population migration outcomes, it is the undisputed top net recipient of migration. Likewise, while China and India look like they have very little net migration effect (which is true in terms of per population), they among the countries having the greatest numbers of population loss. They simply have the reserve to apparently absorb it better. Long story, short: Adjust by adjustments that are sensible. Both questions are sensible questions, but they are different questions and can't be answered by one map.

Ranks. I hate ranks. People who love data hate ranks because ranks destroy knowledge. When I do my professional analytics, I only use ranks under two circumstances: 1) When my data is such a bizarre mess that I have to "go nonparametric" (that's the nuclear option for data analysts) or when a client foolishly insists on ranks. There are other reasons people use ranks, they boil down to 1) the people are clueless and want something simplistic, and 2) the people in question are dishonest and want to hide something. Whenever it's a political question, option 2 is most likely

This appears to be the case with some data I found on WalletHub. It presents a table of state tax burdens for 2019. All well and good. They had no data methodology other than to lift the numbers from a group that calls itself "Tax Policy Center". A lot of other web pages did that, too, each presenting it their own way. I'm picking on WalletHub because 1) It comes up very early in Google searches on this topic, and 2) Some of the goofy choices they made in presenting the data.

What were the goofy choices? First, when they made their choropleth map, they decided to use ranks instead of the actual tax burdens. They produced something that looked like this.

State Tax Burden, Colored by Rank

I will give credit where credit is due. WalletHub did use a sequential color scheme for color data instead of the extremely common and stupid practice of a diverging scheme. Not the specific color or range I'd have chosen, but at least conceptually okay. What is wrong with their presentation? If ranks were all they had, then there'd be nothing wrong with that. They had more than ranks. To visualize their data, they chose to eliminate information. That tells me the actual data didn't paint the picture they wanted, so they diddled and fiddled until they got something more to their taste. The proof of this pudding is the next, completely illegitimate, pseudo-analysis they pulled off. They classified states into "red" and "blue", without stating their criteria. Why should this matter? After all, aren't red states totally red, no matter what, the same with blue? Perhaps if one is a complete moron, one might think that. The topic in this case, is state taxes.

There are states in the USA that are currently neither red nor blue at the level of state government. Their state governments are split between the parties. Since the WalletHub goobers decided to not explain how they resolved such a situation into "red" or "blue", there is no way to understand what they did. I do suspect that they blindly and stupidly (yes, stupidly) applied the results of the 2016 presidential election, even though the states went through statewide elections since then.

In any case, they then used this division to "compare" the states' tax burdens by political allegiance. They did so by averaging the ranks, not the actual percent burdens. As expected, using ranks, they came up with an enormous difference: Red states had an "average rank" of 30.13 while blue states had 18.4. This is crap. How do I know it's crap, because I repeated their analysis as best I could reconstruct it. I came up with the same "average ranks", but when I averaged actual percent scores, the "red" states had an average of 8.08% vs. "blue" states at 9.27%. Yes, still a difference, but the magnitude is far different. While the "average ranks" had a difference of 63% (or 11.7 rank points), the proper analysis of actual percent burden showed a difference of 15% (1.2 percent points)!

Without technically falsifying data, WalletHub invented a very large difference that didn't reflect reality.

Of course, there are better ways, otherwise, I'd not have written this. First, the overall presentation of ranks is simply dishonest in this situation. Nothing is gained by presenting this data as ranks, as the following maps show:

State Tax Burden, "Squashed" Range	State Tax Burden, Starting at Zero

The map on the left is the actual tax burden percents, scaled to the same color range used by WalletHub. The minimum value is the lowest tax burden (Alaska). The map on the right is the same data except scaled to a minimum value of 0% tax burden. You will notice a difference. If the data is "stretched", differences can still look big. If a natural bottom end is chosen, you see that most of the differences melt into the background. So, what about the blue/red thing? When I consulted Ballotpedia, I got the distribution of state government political domination, Republican, Democrat, or Divided. Running these through the data gave me the following: Republican: 8.10% tax burden; Democrat: 9.39% tax burden; Divided: 8.43% tax burden. This isn't radically different from outcomes based on the red/blue WalletHub categorization, but it's a more complete picture. I won't bother averaging ranks. I'm not that dishonest.

Things I Write

November 18, 2019

When to "adjust" and when not to "adjust".

November 7, 2019

State Tax Burdens, how to lie without technically lying.