March 3, 2017

When Presentation Fights Against Information

I recently came across yet another choropleth (that's "color coded" for those of you who don't speak geek) map, this time from the Leage of Conservation Voters (LCV). It was part of an evaluation of the 2016 Congressional voting records on environmental issues. It presented both the House and the Senate. Since I'm talking about it on my grouchy blog, you already have guessed I have an argument with it. I'm not going to argue about whether or not their basic premise is true. A given congresscritter did vote for or against a given bill, and we can all look that up for ourselves. My quibble is their presentation of the data. It falls short in two ways. First, they made some choices that were a bit misleading. Second, their presentation falls short in accessibility and ease of comprehension. The on the left, below takes their data for the House of Representative in approximately the same colors that the League used. (If you mouseover the map, you'll get more information by state.)

House Votes by State, Colored by Fifths House Votes by State, Colored by Percent

The first map is an approximation of the LCV's own map. The LCV chose to arbitrarily divide the state average scores into even fifths (aka "quintiles"). This was not necessary. The second map shows what happens when, instead of arbitrarily cutting the data into fifths, you instead represent the actual percentage scores on the same color scale. If you notice that the two maps do resemble each other, you are not hallucinating. The point of my comparison is not that a direct representation of scale would give a drastically different result, the point is that arbitrary categorization oversimplified the results and eliminated useful nuances.

Compare the two maps to each other. In the first, everything is rigidly defined and perfectly clear-cut. The "green" states are all nicely and strictly "green" (in their particular shades), while the "red" states are nicely and strictly "red", with no variation in their "redness". Finally, the "orange" states are all staunchly in the middle. The reality, if you look at the second map, is that none of this is true. Let's look at Washington, Maine, and Illinois. According to the first map, Washington and Maine have no meaningful difference from Illinois, and Illinois and Wisconsin are the same as each other. Why? Because the entirely arbitrary cutoff used by LCV put them into different categories. In the second map, Illinois, Washington, and Maine actually are all closer to each other than Illinois is to Wisconsin or than Washington and Maine are to California. That's how the actual percentage ratings come out. The first map misrepresents these differences and similarities.

Let's look at Wisconsin vs. Michigan. In the first map, Wisconsin is a middling "orange" state while Michigan is in a "red" category. In the second map, they are very difficult to tell apart. I will not go so far as to claim that the intent of LCV is to mislead people, but the unfortunate outcome of their poor data presentation choices is misleading. The worst part is that it is immediately obvious that there was no need at all to use arbitrary categories. The data they wish to present is easily comprehended on the second map. The simplification of fifths combined being unnecessary with being misleading. Let's put it another way, is 19% (Louisiana) closer to 24% (Indiana) or to 0% (South Dakota)? The LCV method portrays 19% as not being meaningfully different from 0% but entirely a different category than 24%. Does that even make sense?

In addition to the unnecessary and misleading use of categories, the LCV made two other fundamental and very common errors in data presentation. First, their choice of colors is almost tailor-made to mean that people with the most common forms of color-blindness will have difficulty interpreting their graphic. For people with full color vision, red and green have excellent mutual discernability. Unfortunately, red-green colorblindness is the most common form of colorblindness. Fortunately, there has been a good deal of research on color schemes that can be interpreted by people with multiple types of colorblindness. Much of that research in useful palettes is even available online.

The other problem is that LCV chose the wrong kind of color scheme, altogether. The data in their scorecard is "sequential". Sequential data has "low" and "high" values with no "natural" center. They chose a color scheme with two extremes (green vs. red) and imposed an arbitrary center (orange). Such a color scheme is appropriate for "divergent" data, when you actually want to emphasize how the data deviates from a specific central value. The LCV certainly does not believe that "roughly 50%" environmental friendliness is a natural center or ideal from which everything deviates. But their choice of color gives that impression. The middle of a data set is not always a "natural" center. If what you want to emphasize is that there are "bad" situations and "better" situations, and the "ideal" is the "best", then you have sequential data. For sequential data, you want to use a sequential color scheme, where greater intensity means "more". What would the LCV map look like with a proper sequential color scheme?

House Votes by State, Colored by Fifths with a Sequential Color Scheme House Votes by State, Colored by Percent with a Sequential Color Scheme

The map on the left is the same set of "fifths" that the LCV used, but colored according to its appropriate data type. You can immediately see at a glance which states had higher environmental voting scores for their House of Representatives members. You do not need to look at a "key" or muddle through attempting to interpret color choices with no inherent link to the rankings. This is immediately more informative than the original LCV map. However, it is still as misleading as the original LCV map.

This is corrected by the map on the right, which presents actual percentage scores by color intensity. Again, at a glance, you can see not only which states are stronger in environmental voting record but even just how much they are stronger or weaker than other states. Again, the misleading impressions given for states like Wisconsin vs. Michigan or Washington vs. Illinois are no longer misleading. They relationships are accurately portrayed.

I don't know why LCV presented their data as they did. I don't think they intended to mislead, but I can speculate a bit. First, very few people are given adequate training in proper data presentation. Use of proper color schemes for different data types is rarely taught in formal coursework. Sensitively to colorblindness is also rarely covered in the incidental information given on data presentation. Second, arbitrary and rigid categories are comforting and comfortable. We like to think that the world is nicely split up into "our side" and "the enemy". Admitting that situations can be more gradual, and that there might not be simplistic steps between "saved" and "damned" makes us uncomfortable. The politically dedicated are very often most prone to such thinking.

However, in the end, I can't claim to actually know motivations for bad data presentation choices, but I can offer suggestions on how to more accurately and more effectively present the data.

February 20, 2017

How do the Marion County neighborhoods compare to each other?

(There are several maps and a couple of charts that may take a little time to load. If you see blank spots, don't panic, just wait. If they don't fill in, try reloading the page.)

Data can be a very good thing. Indeed, without information, we're flailing around blind. For example, we might get a feeling for how Indianapolis area neighborhoods are faring, but to get more than a feeling, we need hard data. It just so happens that the Polis Center at IUPUI has a project called SAVI. SAVI has helped a group called Indy 2020 to set up a web site called Indy Vitals. You can get a lot of information from Indy Vitals. You can get so much information that it's probably difficult to actually conclude anything from that information. Raw data can be useful, but when enough is piled up, it ends up providing confusion rather than solutions. This doesn't mean that data is bad or that Indy Vitals is bad. Neither is true. Indeed, if all you wanted to do was compare two Marion County neighborhoods, bit by bit, over multiple data points, Indy Vitals would serve you well. But what if you were interested in a bigger picture? What if you wondered how the neighborhoods of Marion County compared as groups? What are the larger similarities and differences? I'm hoping this post will help people see how the data available at Indy Vitals can be used to look at such questions.

What I Mean

I was wondering how, or even if, the neighborhoods in Indy Vitals “clustered” in any way. That is, were there actual meaningful groups of similarity vs. differences. To explore this, I first looked at a single neighborhood, let's say “Eagledale”. I discovered that, if you click a plus sign by the list of data for Eagledale, you can get more information. Part of that information is a "Data Table" tab. If you click that tab, you end up with a table of--you guessed it--data, but not just for Eagledale. You get all the neighborhoods' population, or unemployment rate, or walk score, or whichever specific data you originally clicked. It was a bit of work, but I used this to get the data for 26 different traits for 99 different neighborhoods in Marion County. I looked at it in one table, and it meant nothing at all to me. After all, it's potentially 2,574 data points.

But that's not a problem. There are a lot of things you can do to make sense out of overwhelming amounts of data. If you're a data professional, a mere 2,574 points isn't much, but the human brain isn't designed to look at that many distinct numbers and immediately see any patterns. Fortunately, there are ways to find hidden patterns. One of these ways is called “exploratory factor analysis” (EFA). What EFA does is look for how parts of a data set may be related to each other and groups those parts together. This can be important because as a set of data gets larger, it is more likely that more and more categories will correlate. That is, as a value in one category goes up or down, values in other categories tend to go up or down alongside (or in opposition). This could be because there is some hidden “factor” that these data points describe. EFA allows for these factors to be guessed at in a reasonable fashion.

I will get into the nuts and bolts of the EFA after presenting results. That way, if you only want the results, you won't have to wade through a bunch of statistical chatter. The EFA identified three factors, which I titled “Discomfort”, “Density”, and “People of Color”. The first two factors are fairly well defined, but the third one included other elements as well. Again, the details come after the data. Note that I left the Airport and Park 100 regions out of the analysis. How do the neighborhoods stack up regarding these factors?

Marion County Neighborhood Factors
(More blue = lower than county average.)
(More red = higher than county average)
(Ivory = near county average)
People of Color“Stacked” Factors

Mapping the factors does show us that there could be patterns and clusters, but how do we find them? How do we find patterns for even one factor? It is popular among pollsters to use ranks or arbitrary groups, such as “quintiles”. A quintile is easy to create. It's just one fifth of a data set. You have the lowest, the next, etc. up to the highest. Quintiles can be useful, but much of the time they are misleading. If your factor scores are very gradual, then the boundary between one quintile and the next can be much smaller than the width of one quintile. That is, the highest score in the bottom quintile can be closer to the next quintile than it is to the lowest score in the same quintile! So, presenting information, especially social information, in quintiles can be very misleading.

And how do we handle combining factors? One simple way might be to just “stack” the factor scores. A lot of pollsters (like Gallup) do this, by adding. It is, after all, very simple, but it has problems. The fourth map is just such a mathematical "stacking" of the first three, where I averaged the color values. But does it mean anything? How do we evaluate a single scale that combines “discomfort”, “density”, and “people of color”? I'll make it easy: We can't. It makes no sense, at all, to just “add” those three factors like that. Yes, you can get a single number, but it means nothing. This is because each factor in an EFA is supposed to be measuring a different thing. They won't add and they can't add in any meaningful fashion.

This doesn't mean we can't combine factors in a meaningful fashion, it just means we have to use a different way than simple addition/averaging. This is where “clustering” comes in. There are a lot of ways to cluster data. For all kinds of technical reasons, I chose “k-medoids clustering”. I first tested it on each factor. What did I find? No clusters! That's right, each individual factor was so gradually spread out from lowest to highest scores that there were no meaningful clumpings or places to break it up! So, while the lowest and highest were certainly different from each other, there was no way to draw a line and say that one group of neighborhoods was different from another group (on a single factor) without being entirely arbitrary. I could just clump the neighborhoods into five roughly-equally sized groups and call it a day, but there is no rational basis for doing that. Re-read that. Remember that the next time you see some other web page or article throwing "quintiles" or other arbitrary groups at you.

The Clustering

My clustering analysis of the EFA produced a “best” solution at six clusters. I named the clusters for the cluster “medoids”. A medoid is whichever real data point that comes closes to the “center” of a cluster. The medoids were North Perry, Forest Manor, Five Points, Eagle Creek, Canterbury-Chatard, and Delaware Trails.

Clustering Results of Indy Vitals Data
North Perry+++Forest Manor+++Five Points+++Eagle Creek+++Canterbury-Chatard+++Delaware Trails+++

On the left, above, is a map of the clustered neighborhoods, color coded to cluster names. On the right is a summary of how the clusters relate to the factors. As you can see, each factor played a unique role in the clustering. Two clusters had particularly high Discomfort ranges: North Perry and Forest Manor. They appear to primarily differ from each other by People of Color. Three clusters appear to have fairly low average Discomfort: Eagle Creek, Canterbury-Chatard, and Delaware Trails. They appear to differ among each other mostly by density. The last cluster, Five Points, has an average Discomfort close to the county average, but with lower Density and People of Color.

Is there any meaning?

So, I created clusters and can describe them in terms of factors, but what does any of that mean? We have to open the lid on my analysis. This is where things start to get technical. The first step of exploratory factor analysis (EFA) is to find how the data elements relate to each other. This is usually measured by some sort of correlation coefficient. The thing about correlation is that ordinary correlation assumes a lot about the data. First, it assumes there are no missing values, every possible point needs a value. If any is missing you have to leave that data out or “impute” (guess) a value for that point. The data I downloaded had several missing values. In many cases, these could not be imputed with any confidence, given how the data category was defined. For example, “High School Graduation Rate” only was calculated for neighborhoods that had high schools within their borders, even though quite a few kids in Marion County attend a high school not within their neighborhood borders. I deleted most of these categories.

Why most and not all? A few categories had missing values that could be reasonably imputed or otherwise accounted for. By “otherwise accounted for”, I mean deleting the entries for Airport and Park 100. I consider this acceptable for my purpose because those “neighborhoods” are far more industrial districts than neighborhoods. After I did this, only one category had missing values: “Violent Crime per 1000”, which were missing for Speedway, Lawrence, and Lawrence-Fort Ben-Oaklandon. I took values from the Neighborhood Scout web site. Probably not as reliable as those from IMPD for the rest of the neighborhoods but probably not too far off the mark. That source only had one number for both of the Lawrence-based neighborhoods, so I repeated it for them. This left me with a working data set of 97 neighborhoods and 22 variables (2,134 data points).

InputDiscomfortDensityFactor 3
Associates or Higher Degree-1.01
Median Assessed Value-0.97
Poverty Rate0.69
Tree Cover-0.66
Median Household Income-0.62
Unemployment Rate0.60
Tax Delinquent Properties0.60
Violent Crime per 10000.59
Vacancy Rate0.47
Median Age-0.44
Walk Score0.97
Permeable Surface-0.85
Food Access0.85
Housing Density0.68
Non Car Commuters0.44
People of Color0.95
Access to Quality Pre K0.58
Housing Cost Burdened0.52
Births with Low weight0.43

The other assumption that correlation makes is that the data is “normally distributed”. I checked my data with a utility to test this. The data failed. It was not normally distributed. Ordinary correlation would not give a realistic estimate of how the data categories were interrelated.

Fortunately, there are several methods of “robust” correlation. I ended up using a method called “Orthogonalized Gnanadesikan-Kettenring”. For most people, that will mean nothing, of course, but anyone who wants to check my math will need to know that. I provide the correlation matrix if you want to look it over.

One of the quirks about EFA and related methods is that they don't automatically tell you how many factors best describe your data. You have to tell the method how many to use. There has been a lot of discussion over the decades about how to figure out how many factors to use in EFA. I chose what is called parallel analysis. (Warning the link goes to a very technical description.) Roughly put, parallel analysis compares principal components or factor analysis outcomes for multiple factors against randomly-generated data sets of the same size. The result that produces the most factors that is still better than the all-random comparison is considered the best choice. My initial parallel analysis suggested 4 factors. However, the outcome of EFA produced one strange factor that consisted only of the Population and Jobs variables. I dropped these two variables and repeated the parallel analysis and EFA. The same remaining factors appeared. I decided that raw population and counts of jobs (not employment, just number of jobs) would not add much meaning to the analysis and went on without the two variables.

I then did my final EFA, producing three factors. The table above and to the right describes how strongly the factors relate to the variables. The numbers are called “loadings”. To render presentation more clear, I don't report loadings that are less than 0.40, which is a commonly used cut-off. I used these loadings to guide how I named the factors. The first factor was a combination of lacking higher adult education, low median residence assessed value, higher disability rates, higher poverty rate, less tree cover, lower household income, greater unemployment, greater violent crime, etc. In short, it made sense to call this factor Discomfort, since places with such features are probably less comfortable places to live. The second factor combined a good walk score with lots of pavement, close grocery stores, dense housing, and higher rates of non-car commuters. It made sense to call this Density. The third factor was difficult to describe, since it had several different types of variables in it. I finally chose People of Color because of that variable had much stronger influence than did the others in the factor. Once I had the EFA result, I was able to use it along with my trimmed data set to produce factor scores.

Once I had the factor scores, I used k-medoids clustering to create clusters. But, first, back up. It is common for factors in EFA to be correlated to each other. This is not a flaw in the EFA result, because in the real world, such correlations are common. However, to get the clustering, I had to still combine the factors. For this, I calculated pairwise “Mahalanobis distances”. While somewhat tricky to calculate, Mahalanobis distances take these correlations into account to produce a more realistic description of the data. Then I did the cluster analysis on these distances. I used a utility called pamk to discover the optimum number of k-medoids clusters for the data. This came to, as I already stated, six clusters. The chart to the right illustrates the factor scores vs. specific clusters. X = Discomfort, Y = Density, Z = People of Color. The chart can by rotated by clicking and dragging on it.

So, now what?

I'm not a social scientist. I am just a person with curiosity. I wanted to see what, if any, associations could be made with the data in the IndyVitals web site. Where you go with those associations probably depends on your outlook and ideas about the city. If anything, I hope that someone can find something more concretely useful to do with this work.

February 3, 2017

How to mislead with maps. The Gallup State Well-Being Rankings for 2016

Gallup has recently released another population survey, this time it is the 2016 State Well-Being Rankings. Gallup's accompanying map (last page of the rankings) is, as you can see, split into quintiles. If you want, you can go over there and look at their map or look at the first map below. It represents the cut-offs in approximately same colors. (If you mouseover the map, you'll get more information by state.)

State Health Ratings, Colored by Quintile
State Health Ratings, "Squashed" Range
State Health Ratings, Full Range

This map is an excellent example of how data presentation choices mislead. People are supposed to use quintiles, quartiles, percentiles, and other such non-parametric numbers to represent either data that has a long, uneven, and strung-out range (like achievement test scores), or to group a different set of data to show how it is distributed (like wealth per quintile). It just so happens that you can look at the well-being scores for yourself in the linked report. Notice that the data is not strung-out and scattered. In fact, it is very densely-packed. It also is not explicitly linked to some other unevenly-distributed data.

The actual range goes from 58.9 to 65.2. Is a difference of about 6.3 score points worth that much a visual difference?

How else could we represent the difference so people can get an idea of reality instead of a visual trick?? The second, or "squashed scale" map does that. The "worst color" (light gray-green) is matched to score of 58.9. The "best color" is matched to 65.2. The range between is then evenly filled in among the five color points. Look different? It does. There is some rough correspondence between the misleading map that comes from Gallup and the (somewhat) more truthful map I created, but you can now immediately see that the country is not divided into stark and extreme categories. You can also immediately see that the distances between categories are not sharply defined.

But I'm not finished. You see a third map. This is a map where the "best color" corresponds to a score of 100 (maximum theoretical possible score) and "worst color" corresponds to 0 (minimum possible theoretical score) Changes in color now correspond to linear differences along the full possible range. Having a hard time telling the states apart? That is because the differences among them in this index really are tiny. This map shows you what that actual difference looks like in context of the full scale.

So, why does Gallup do this? Why do people eagerly swallow such representation of data? First, explaining Gallup. I don't work there, so this is speculation, but Gallup makes its money off controversy. Anything they publish that will stir the pot will inspire more surveys that they can sell. Likewise, presenting things in extreme ways ensures that there will be more arguments, leading to more survey commissions, leading to similar data presentation, leading to more arguments. It's a lucrative circle for Gallup.

But why do people so eagerly devour this quasi-information? First, it's simple. People like very stark, very simple things to natter on about with each other. People do not like complex and shaded descriptions. They want things to be very neatly pigeonholed, and this comforts them. In addition, people with agendas want things presented as rigidly and extremely as possible to the public, all the better to sound the panic alarm. Finally, we are often taught by society that only rigid and extreme answers can be "true". We are indoctrinated to see the world as "good" and "evil" with nothing in between. We are taught that someone who is able to see gradual differences is a "fence-sitter" or "spineless". We are told that only extremism is good--although it's only actually extremism when it's someone you don't like doing it.

I don't know if this changed the way you see the world, but I hope it helped you understand and be more critical of "studies", "surveys" and "polls" that we are not flooded with.