Things I Write: February 2017

(There are several maps and a couple of charts that may take a little time to load. If you see blank spots, don't panic, just wait. If they don't fill in, try reloading the page.)

Data can be a very good thing. Indeed, without information, we're flailing around blind. For example, we might get a feeling for how Indianapolis area neighborhoods are faring, but to get more than a feeling, we need hard data. It just so happens that the Polis Center at IUPUI has a project called SAVI. SAVI has helped a group called Indy 2020 to set up a web site called Indy Vitals. You can get a lot of information from Indy Vitals. You can get so much information that it's probably difficult to actually conclude anything from that information. Raw data can be useful, but when enough is piled up, it ends up providing confusion rather than solutions. This doesn't mean that data is bad or that Indy Vitals is bad. Neither is true. Indeed, if all you wanted to do was compare two Marion County neighborhoods, bit by bit, over multiple data points, Indy Vitals would serve you well. But what if you were interested in a bigger picture? What if you wondered how the neighborhoods of Marion County compared as groups? What are the larger similarities and differences? I'm hoping this post will help people see how the data available at Indy Vitals can be used to look at such questions.

What I Mean

I was wondering how, or even if, the neighborhoods in Indy Vitals “clustered” in any way. That is, were there actual meaningful groups of similarity vs. differences. To explore this, I first looked at a single neighborhood, let's say “Eagledale”. I discovered that, if you click a plus sign by the list of data for Eagledale, you can get more information. Part of that information is a "Data Table" tab. If you click that tab, you end up with a table of--you guessed it--data, but not just for Eagledale. You get all the neighborhoods' population, or unemployment rate, or walk score, or whichever specific data you originally clicked. It was a bit of work, but I used this to get the data for 26 different traits for 99 different neighborhoods in Marion County. I looked at it in one table, and it meant nothing at all to me. After all, it's potentially 2,574 data points.

But that's not a problem. There are a lot of things you can do to make sense out of overwhelming amounts of data. If you're a data professional, a mere 2,574 points isn't much, but the human brain isn't designed to look at that many distinct numbers and immediately see any patterns. Fortunately, there are ways to find hidden patterns. One of these ways is called “exploratory factor analysis” (EFA). What EFA does is look for how parts of a data set may be related to each other and groups those parts together. This can be important because as a set of data gets larger, it is more likely that more and more categories will correlate. That is, as a value in one category goes up or down, values in other categories tend to go up or down alongside (or in opposition). This could be because there is some hidden “factor” that these data points describe. EFA allows for these factors to be guessed at in a reasonable fashion.

I will get into the nuts and bolts of the EFA after presenting results. That way, if you only want the results, you won't have to wade through a bunch of statistical chatter. The EFA identified three factors, which I titled “Discomfort”, “Density”, and “People of Color”. The first two factors are fairly well defined, but the third one included other elements as well. Again, the details come after the data. Note that I left the Airport and Park 100 regions out of the analysis. How do the neighborhoods stack up regarding these factors?

Marion County Neighborhood Factors (More blue = lower than county average.) (More red = higher than county average) (Ivory = near county average)

Discomfort	Density

People of Color	“Stacked” Factors

Mapping the factors does show us that there could be patterns and clusters, but how do we find them? How do we find patterns for even one factor? It is popular among pollsters to use ranks or arbitrary groups, such as “quintiles”. A quintile is easy to create. It's just one fifth of a data set. You have the lowest, the next, etc. up to the highest. Quintiles can be useful, but much of the time they are misleading. If your factor scores are very gradual, then the boundary between one quintile and the next can be much smaller than the width of one quintile. That is, the highest score in the bottom quintile can be closer to the next quintile than it is to the lowest score in the same quintile! So, presenting information, especially social information, in quintiles can be very misleading.

And how do we handle combining factors? One simple way might be to just “stack” the factor scores. A lot of pollsters (like Gallup) do this, by adding. It is, after all, very simple, but it has problems. The fourth map is just such a mathematical "stacking" of the first three, where I averaged the color values. But does it mean anything? How do we evaluate a single scale that combines “discomfort”, “density”, and “people of color”? I'll make it easy: We can't. It makes no sense, at all, to just “add” those three factors like that. Yes, you can get a single number, but it means nothing. This is because each factor in an EFA is supposed to be measuring a different thing. They won't add and they can't add in any meaningful fashion.

This doesn't mean we can't combine factors in a meaningful fashion, it just means we have to use a different way than simple addition/averaging. This is where “clustering” comes in. There are a lot of ways to cluster data. For all kinds of technical reasons, I chose “k-medoids clustering”. I first tested it on each factor. What did I find? No clusters! That's right, each individual factor was so gradually spread out from lowest to highest scores that there were no meaningful clumpings or places to break it up! So, while the lowest and highest were certainly different from each other, there was no way to draw a line and say that one group of neighborhoods was different from another group (on a single factor) without being entirely arbitrary. I could just clump the neighborhoods into five roughly-equally sized groups and call it a day, but there is no rational basis for doing that. Re-read that. Remember that the next time you see some other web page or article throwing "quintiles" or other arbitrary groups at you.

The Clustering

My clustering analysis of the EFA produced a “best” solution at six clusters. I named the clusters for the cluster “medoids”. A medoid is whichever real data point that comes closes to the “center” of a cluster. The medoids were North Perry, Forest Manor, Five Points, Eagle Creek, Canterbury-Chatard, and Delaware Trails.

Clustering Results of Indy Vitals Data

North Perry	+++	Forest Manor	+++	Five Points	+++	Eagle Creek	+++	Canterbury-Chatard	+++	Delaware Trails	+++

On the left, above, is a map of the clustered neighborhoods, color coded to cluster names. On the right is a summary of how the clusters relate to the factors. As you can see, each factor played a unique role in the clustering. Two clusters had particularly high Discomfort ranges: North Perry and Forest Manor. They appear to primarily differ from each other by People of Color. Three clusters appear to have fairly low average Discomfort: Eagle Creek, Canterbury-Chatard, and Delaware Trails. They appear to differ among each other mostly by density. The last cluster, Five Points, has an average Discomfort close to the county average, but with lower Density and People of Color.

Is there any meaning?

So, I created clusters and can describe them in terms of factors, but what does any of that mean? We have to open the lid on my analysis. This is where things start to get technical. The first step of exploratory factor analysis (EFA) is to find how the data elements relate to each other. This is usually measured by some sort of correlation coefficient. The thing about correlation is that ordinary correlation assumes a lot about the data. First, it assumes there are no missing values, every possible point needs a value. If any is missing you have to leave that data out or “impute” (guess) a value for that point. The data I downloaded had several missing values. In many cases, these could not be imputed with any confidence, given how the data category was defined. For example, “High School Graduation Rate” only was calculated for neighborhoods that had high schools within their borders, even though quite a few kids in Marion County attend a high school not within their neighborhood borders. I deleted most of these categories.

Why most and not all? A few categories had missing values that could be reasonably imputed or otherwise accounted for. By “otherwise accounted for”, I mean deleting the entries for Airport and Park 100. I consider this acceptable for my purpose because those “neighborhoods” are far more industrial districts than neighborhoods. After I did this, only one category had missing values: “Violent Crime per 1000”, which were missing for Speedway, Lawrence, and Lawrence-Fort Ben-Oaklandon. I took values from the Neighborhood Scout web site. Probably not as reliable as those from IMPD for the rest of the neighborhoods but probably not too far off the mark. That source only had one number for both of the Lawrence-based neighborhoods, so I repeated it for them. This left me with a working data set of 97 neighborhoods and 22 variables (2,134 data points).

Input	Discomfort	Density	Factor 3
Associates or Higher Degree	-1.01
Median Assessed Value	-0.97
Disability	0.70
Poverty Rate	0.69
Tree Cover	-0.66
Median Household Income	-0.62
Unemployment Rate	0.60
Tax Delinquent Properties	0.60
Violent Crime per 1000	0.59
Vacancy Rate	0.47
Median Age	-0.44
Walk Score		0.97
Permeable Surface		-0.85
Food Access		0.85
Housing Density		0.68
Non Car Commuters		0.44
People of Color			0.95
Access to Quality Pre K			0.58
Housing Cost Burdened			0.52
Births with Low weight			0.43

The other assumption that correlation makes is that the data is “normally distributed”. I checked my data with a utility to test this. The data failed. It was not normally distributed. Ordinary correlation would not give a realistic estimate of how the data categories were interrelated.

Fortunately, there are several methods of “robust” correlation. I ended up using a method called “Orthogonalized Gnanadesikan-Kettenring”. For most people, that will mean nothing, of course, but anyone who wants to check my math will need to know that. I provide the correlation matrix if you want to look it over.

One of the quirks about EFA and related methods is that they don't automatically tell you how many factors best describe your data. You have to tell the method how many to use. There has been a lot of discussion over the decades about how to figure out how many factors to use in EFA. I chose what is called parallel analysis. (Warning the link goes to a very technical description.) Roughly put, parallel analysis compares principal components or factor analysis outcomes for multiple factors against randomly-generated data sets of the same size. The result that produces the most factors that is still better than the all-random comparison is considered the best choice. My initial parallel analysis suggested 4 factors. However, the outcome of EFA produced one strange factor that consisted only of the Population and Jobs variables. I dropped these two variables and repeated the parallel analysis and EFA. The same remaining factors appeared. I decided that raw population and counts of jobs (not employment, just number of jobs) would not add much meaning to the analysis and went on without the two variables.

I then did my final EFA, producing three factors. The table above and to the right describes how strongly the factors relate to the variables. The numbers are called “loadings”. To render presentation more clear, I don't report loadings that are less than 0.40, which is a commonly used cut-off. I used these loadings to guide how I named the factors. The first factor was a combination of lacking higher adult education, low median residence assessed value, higher disability rates, higher poverty rate, less tree cover, lower household income, greater unemployment, greater violent crime, etc. In short, it made sense to call this factor Discomfort, since places with such features are probably less comfortable places to live. The second factor combined a good walk score with lots of pavement, close grocery stores, dense housing, and higher rates of non-car commuters. It made sense to call this Density. The third factor was difficult to describe, since it had several different types of variables in it. I finally chose People of Color because of that variable had much stronger influence than did the others in the factor. Once I had the EFA result, I was able to use it along with my trimmed data set to produce factor scores.

Once I had the factor scores, I used k-medoids clustering to create clusters. But, first, back up. It is common for factors in EFA to be correlated to each other. This is not a flaw in the EFA result, because in the real world, such correlations are common. However, to get the clustering, I had to still combine the factors. For this, I calculated pairwise “Mahalanobis distances”. While somewhat tricky to calculate, Mahalanobis distances take these correlations into account to produce a more realistic description of the data. Then I did the cluster analysis on these distances. I used a utility called pamk to discover the optimum number of k-medoids clusters for the data. This came to, as I already stated, six clusters. The chart to the right illustrates the factor scores vs. specific clusters. X = Discomfort, Y = Density, Z = People of Color. The chart can by rotated by clicking and dragging on it.

So, now what?

I'm not a social scientist. I am just a person with curiosity. I wanted to see what, if any, associations could be made with the data in the IndyVitals web site. Where you go with those associations probably depends on your outlook and ideas about the city. If anything, I hope that someone can find something more concretely useful to do with this work.

Gallup has recently released another population survey, this time it is the 2016 State Well-Being Rankings. Gallup's accompanying map (last page of the rankings) is, as you can see, split into quintiles. If you want, you can go over there and look at their map or look at the first map below. It represents the cut-offs in approximately same colors. (If you mouseover the map, you'll get more information by state.)

State Health Ratings, Colored by Quintile

State Health Ratings, "Squashed" Range

State Health Ratings, Full Range

This map is an excellent example of how data presentation choices mislead. People are supposed to use quintiles, quartiles, percentiles, and other such non-parametric numbers to represent either data that has a long, uneven, and strung-out range (like achievement test scores), or to group a different set of data to show how it is distributed (like wealth per quintile). It just so happens that you can look at the well-being scores for yourself in the linked report. Notice that the data is not strung-out and scattered. In fact, it is very densely-packed. It also is not explicitly linked to some other unevenly-distributed data.

The actual range goes from 58.9 to 65.2. Is a difference of about 6.3 score points worth that much a visual difference?

How else could we represent the difference so people can get an idea of reality instead of a visual trick?? The second, or "squashed scale" map does that. The "worst color" (light gray-green) is matched to score of 58.9. The "best color" is matched to 65.2. The range between is then evenly filled in among the five color points. Look different? It does. There is some rough correspondence between the misleading map that comes from Gallup and the (somewhat) more truthful map I created, but you can now immediately see that the country is not divided into stark and extreme categories. You can also immediately see that the distances between categories are not sharply defined.

But I'm not finished. You see a third map. This is a map where the "best color" corresponds to a score of 100 (maximum theoretical possible score) and "worst color" corresponds to 0 (minimum possible theoretical score) Changes in color now correspond to linear differences along the full possible range. Having a hard time telling the states apart? That is because the differences among them in this index really are tiny. This map shows you what that actual difference looks like in context of the full scale.

So, why does Gallup do this? Why do people eagerly swallow such representation of data? First, explaining Gallup. I don't work there, so this is speculation, but Gallup makes its money off controversy. Anything they publish that will stir the pot will inspire more surveys that they can sell. Likewise, presenting things in extreme ways ensures that there will be more arguments, leading to more survey commissions, leading to similar data presentation, leading to more arguments. It's a lucrative circle for Gallup.

But why do people so eagerly devour this quasi-information? First, it's simple. People like very stark, very simple things to natter on about with each other. People do not like complex and shaded descriptions. They want things to be very neatly pigeonholed, and this comforts them. In addition, people with agendas want things presented as rigidly and extremely as possible to the public, all the better to sound the panic alarm. Finally, we are often taught by society that only rigid and extreme answers can be "true". We are indoctrinated to see the world as "good" and "evil" with nothing in between. We are taught that someone who is able to see gradual differences is a "fence-sitter" or "spineless". We are told that only extremism is good--although it's only actually extremism when it's someone you don't like doing it.

I don't know if this changed the way you see the world, but I hope it helped you understand and be more critical of "studies", "surveys" and "polls" that we are not flooded with.

Things I Write

February 20, 2017

How do the Marion County neighborhoods compare to each other?

What I Mean

The Clustering

Is there any meaning?

So, now what?

February 3, 2017

How to mislead with maps. The Gallup State Well-Being Rankings for 2016