October 22, 2019

Vaccine Attitudes Around the World

I was inspired to write this page by a post on Reddit's "Data is Beautiful subreddit. The post in question was inspired by data collected for the Wellcome Global Monitor, 2018, specifically from the "Dataset and crosstabs for all countries" (linked on the Wellcome page). The original post on Reddit was a workmanlike visualization of part of a large and unwieldy dataseet, and it deserves to be appreciated for that. However, it was not without flaw. Ordinarily, I would just have let it pass, but the three mistakes tha author made are so tragically common that I decided to write a bit about them.

First, the maps in that visualization were coded on a scale where the lowest point was 50% and the highest 100%. This is an extremely common mistake made by neophytes in data visualization. In some fields, it's considered an unethical way to manipulate data. I highly doubt that the other visualization's author had any unethical desires. Instead, he succumbed to an extremly common failure of neophyte visualizers, the need to make things "look good". What's wrong with that? Suppose I have a data set that measures frequency of crime in several cities. One is at 52 incidents per 1000 people, one is at 54 incidents per 1000 people, the third is at 56 incidents per 1000 people.

Not a big difference? Not if I set my lowest value to zero, the actual theoretical minimum, but what if I set my visualization so that the minimum value I show is 51 incidents? All of a sudden a difference of 5 incidents per 1000 can be made to look gigantic. The 56-incident town can be represented by a bar that is almost three times as large as the 52-incident town!

On a percent-based visualization, the natural boundaries are 0 and 100. Any visualization based on percents should be based on that scale unless the author has an extremely good reason to deviate from that practice. "It doesn't look nice." is not an extremely good reason. "My boss wants me to." might be extremely good from a keeping your job perspective, but it also means that your boss really doesn't give a damn about the truth. There needs to be a sound theoretical basis for moving the goalposts of your visualization away from any natural locations.

The next mistake the author of the original visualization makes is also extremely common. It even occurs in visualizations made by "professionals". This is the use of a diverging color scheme for sequential data. WTF does that mean? Among the many kinds of data are "diverging" and "sequential". Sequential data follows a single-direction sequence, such as 0% to 100%. Diverging data diverges from some sort of meaningful center point, such as -50% to 50% (centered on 0%). A sequential color scheme follows a sequence of (usually inverse) brightness, but sometimes hue and saturation can be worked into it. What does that mean? This is what that means. There are other color systems in use, but the essense is that dark+intense means more (usually), and more means more.

With a diverging scheme, dark+intense can mean more and it can mean less, and more vs. less depend on the hue. So what? So what is that our brains "get" things differently depending on whether or not they are presented as sequential or diverging color schemes. We are tuned to look at the "middle" of the diverging scheme as a "natural middle", where the middle value has specific intrinsic meaning. The older visualization violates this necessary principle. Instead, 75% is the "middle", with no intrinsic meaning at all. It merely happens to be the numeric mid-point between 50% and 100%. It wasn't chosen, it just happened to fit a simple method.

The third major mistake in the presentation was the choice of colors, in and of themselves. Red vs. green may be the most popular choice of contrasting colors on the Web. It's also the worst possible choice of colors. The most common forms of colorblindness involve red and green. Two "distinct" color spots can be indistinguishable if the "difference" relies on distinguishing between red and green. To understand, you will probably need to look at some simulations of the efects of different types of colorblindness on the red-green scheme.

So, can I do better? Yes, I can, and he could have, too, had he better information and more understanding of the neurology and pyschology of perception. Fortunately, a lot of that has been distilled into an extremely useful document: Colour Schemes, by Paul Tol. A lot of research went into this document, which presents useful color schemes for qualitative, sequential, and diverging data. Anyone who is serious about using color in data presentation needs to know this document very well. Anyone who knows of and ignores it doesn't give a damn about effective color use for data presentation.

How did I do the same data better? Two ways. First, I applied an actual sequential color scheme to the sequential data. Second, I "reconfigured" the sequential data so that a diverging color scheme could be validly applied, and I applied a proper diverging scheme instead of the horribly mis-designed and overly common red/green scheme. My results are below:

Vaccines are important for children to have. (Sequential) Vaccines are important for children to have. (Diverging from average)
Vaccines are safe. (Sequential) Vaccines are safe. (Diverging from average)
Vaccines are effective. (Sequential) Vaccines are effective. (Diverging from average)

What did I do? In the left column, I used a color scheme where "darker = more". The darker the color, the higher the percent of people who somewhat or strongly agreed with the statement. The hues also change along with darkness, but it's the darkness that actually imparts the message. The hue provides a bit of aesthetic enhancement to draw the eye. I could have chosen a monochromatic scheme or even a grayscale to get a similar effect. Indeed, if you converted the left column to grayscale, you'll see the same results.

On the right column, I created a "natural center", specifically the average of all the countries' scores from the left column of each map. I then subtracted this from an individual country's score. I was naughty when I presented this column, because I arbitrarily chose my maximum cutoff at -50% to +50% instead of -100% to +100%. Did anyone catch that before I mentioned it?

For the next two maps, I created my own data, which I call "WTF, People?". This is the "Vaccines are important for children to have." percent minus the lower of "Vaccines are safe." or "Vaccines are effective." It represents the percent of people who think that vaccines aren't safe or effective but still think it's imporatant for children to have them. In other wordes, "WTF, People?" What kind of culture does one have where you think it's important to give children vaccines that you think aren't safe or effective? This score in some ways could imply a lot about the culture of the country in question, or simply show how people in that country really do not think things through.

WTF, People? WTF, People?

Did you notice what I did? Take a look at the legends. Take a look at the numbers vs. the colors. I flipped the scale! Why did I do that? Remember how I wrote earlier that, for sequential schemes darker is (usually) more. This is one situation where more might not be best represented by darker. Why would I think such a thing? Because I have already primed you, the audience, to also think that darker is better. We generally presume that vaccines are important, safe, and effective, thus, such statements would generally be seen as good things to agree to. Therefore, greater agreement is more desirable. However, "WTF, People?" isn't a desirable trait. A country where people can just go along with something even if it violates their own beliefs probably has multiple deeper problems. So, I represented "WTF, People?" on inverted scales, to illustrate this bias of mine. Yes, it's a bias. So is the idea that vaccines ar important, safe, or effective. A bias can be true.

Anyway, I wrote this mostly to illustrate some important principles of using color to convey information in contrast to the extremely common gross violation of those principles. At very least, please read the Colour Schemes technical note.

No comments:

Post a Comment