(There are several maps and a couple of charts that may take a little time to load. If you see blank spots, don't panic, just wait. If they don't fill in, try reloading the page.)
Data can be a very good thing. Indeed, without information, we're flailing around blind. For example, we might get a feeling for how Indianapolis area neighborhoods are faring, but to get more than a feeling, we need hard data. It just so happens that the Polis Center at IUPUI has a project called SAVI. SAVI has helped a group called Indy 2020 to set up a web site called Indy Vitals. You can get a lot of information from Indy Vitals. You can get so much information that it's probably difficult to actually conclude anything from that information. Raw data can be useful, but when enough is piled up, it ends up providing confusion rather than solutions. This doesn't mean that data is bad or that Indy Vitals is bad. Neither is true. Indeed, if all you wanted to do was compare two Marion County neighborhoods, bit by bit, over multiple data points, Indy Vitals would serve you well. But what if you were interested in a bigger picture? What if you wondered how the neighborhoods of Marion County compared as groups? What are the larger similarities and differences? I'm hoping this post will help people see how the data available at Indy Vitals can be used to look at such questions.
What I Mean
I was wondering how, or even if, the neighborhoods in Indy Vitals “clustered” in any way. That is, were there actual meaningful groups of similarity vs. differences. To explore this, I first looked at a single neighborhood, let's say “Eagledale”. I discovered that, if you click a plus sign by the list of data for Eagledale, you can get more information. Part of that information is a "Data Table" tab. If you click that tab, you end up with a table of--you guessed it--data, but not just for Eagledale. You get all the neighborhoods' population, or unemployment rate, or walk score, or whichever specific data you originally clicked. It was a bit of work, but I used this to get the data for 26 different traits for 99 different neighborhoods in Marion County. I looked at it in one table, and it meant nothing at all to me. After all, it's potentially 2,574 data points.
But that's not a problem. There are a lot of things you can do to make sense out of overwhelming amounts of data. If you're a data professional, a mere 2,574 points isn't much, but the human brain isn't designed to look at that many distinct numbers and immediately see any patterns. Fortunately, there are ways to find hidden patterns. One of these ways is called “exploratory factor analysis” (EFA). What EFA does is look for how parts of a data set may be related to each other and groups those parts together. This can be important because as a set of data gets larger, it is more likely that more and more categories will correlate. That is, as a value in one category goes up or down, values in other categories tend to go up or down alongside (or in opposition). This could be because there is some hidden “factor” that these data points describe. EFA allows for these factors to be guessed at in a reasonable fashion.
I will get into the nuts and bolts of the EFA after presenting results. That way, if you only want the results, you won't have to wade through a bunch of statistical chatter. The EFA identified three factors, which I titled “Discomfort”, “Density”, and “People of Color”. The first two factors are fairly well defined, but the third one included other elements as well. Again, the details come after the data. Note that I left the Airport and Park 100 regions out of the analysis. How do the neighborhoods stack up regarding these factors?
|Marion County Neighborhood Factors
(More blue = lower than county average.)
(More red = higher than county average)
(Ivory = near county average)
|People of Color||“Stacked” Factors|
Mapping the factors does show us that there could be patterns and clusters, but how do we find them? How do we find patterns for even one factor? It is popular among pollsters to use ranks or arbitrary groups, such as “quintiles”. A quintile is easy to create. It's just one fifth of a data set. You have the lowest, the next, etc. up to the highest. Quintiles can be useful, but much of the time they are misleading. If your factor scores are very gradual, then the boundary between one quintile and the next can be much smaller than the width of one quintile. That is, the highest score in the bottom quintile can be closer to the next quintile than it is to the lowest score in the same quintile! So, presenting information, especially social information, in quintiles can be very misleading.
And how do we handle combining factors? One simple way might be to just “stack” the factor scores. A lot of pollsters (like Gallup) do this, by adding. It is, after all, very simple, but it has problems. The fourth map is just such a mathematical "stacking" of the first three, where I averaged the color values. But does it mean anything? How do we evaluate a single scale that combines “discomfort”, “density”, and “people of color”? I'll make it easy: We can't. It makes no sense, at all, to just “add” those three factors like that. Yes, you can get a single number, but it means nothing. This is because each factor in an EFA is supposed to be measuring a different thing. They won't add and they can't add in any meaningful fashion.
This doesn't mean we can't combine factors in a meaningful fashion, it just means we have to use a different way than simple addition/averaging. This is where “clustering” comes in. There are a lot of ways to cluster data. For all kinds of technical reasons, I chose “k-medoids clustering”. I first tested it on each factor. What did I find? No clusters! That's right, each individual factor was so gradually spread out from lowest to highest scores that there were no meaningful clumpings or places to break it up! So, while the lowest and highest were certainly different from each other, there was no way to draw a line and say that one group of neighborhoods was different from another group (on a single factor) without being entirely arbitrary. I could just clump the neighborhoods into five roughly-equally sized groups and call it a day, but there is no rational basis for doing that. Re-read that. Remember that the next time you see some other web page or article throwing "quintiles" or other arbitrary groups at you.
My clustering analysis of the EFA produced a “best” solution at six clusters. I named the clusters for the cluster “medoids”. A medoid is whichever real data point that comes closes to the “center” of a cluster. The medoids were North Perry, Forest Manor, Five Points, Eagle Creek, Canterbury-Chatard, and Delaware Trails.
|Clustering Results of Indy Vitals Data|
|North Perry||+++||Forest Manor||+++||Five Points||+++||Eagle Creek||+++||Canterbury-Chatard||+++||Delaware Trails||+++|
On the left, above, is a map of the clustered neighborhoods, color coded to cluster names. On the right is a summary of how the clusters relate to the factors. As you can see, each factor played a unique role in the clustering. Two clusters had particularly high Discomfort ranges: North Perry and Forest Manor. They appear to primarily differ from each other by People of Color. Three clusters appear to have fairly low average Discomfort: Eagle Creek, Canterbury-Chatard, and Delaware Trails. They appear to differ among each other mostly by density. The last cluster, Five Points, has an average Discomfort close to the county average, but with lower Density and People of Color.
Is there any meaning?
So, I created clusters and can describe them in terms of factors, but what does any of that mean? We have to open the lid on my analysis. This is where things start to get technical. The first step of exploratory factor analysis (EFA) is to find how the data elements relate to each other. This is usually measured by some sort of correlation coefficient. The thing about correlation is that ordinary correlation assumes a lot about the data. First, it assumes there are no missing values, every possible point needs a value. If any is missing you have to leave that data out or “impute” (guess) a value for that point. The data I downloaded had several missing values. In many cases, these could not be imputed with any confidence, given how the data category was defined. For example, “High School Graduation Rate” only was calculated for neighborhoods that had high schools within their borders, even though quite a few kids in Marion County attend a high school not within their neighborhood borders. I deleted most of these categories.
Why most and not all? A few categories had missing values that could be reasonably imputed or otherwise accounted for. By “otherwise accounted for”, I mean deleting the entries for Airport and Park 100. I consider this acceptable for my purpose because those “neighborhoods” are far more industrial districts than neighborhoods. After I did this, only one category had missing values: “Violent Crime per 1000”, which were missing for Speedway, Lawrence, and Lawrence-Fort Ben-Oaklandon. I took values from the Neighborhood Scout web site. Probably not as reliable as those from IMPD for the rest of the neighborhoods but probably not too far off the mark. That source only had one number for both of the Lawrence-based neighborhoods, so I repeated it for them. This left me with a working data set of 97 neighborhoods and 22 variables (2,134 data points).
|Associates or Higher Degree||-1.01|
|Median Assessed Value||-0.97|
|Median Household Income||-0.62|
|Tax Delinquent Properties||0.60|
|Violent Crime per 1000||0.59|
|Non Car Commuters||0.44|
|People of Color||0.95|
|Access to Quality Pre K||0.58|
|Housing Cost Burdened||0.52|
|Births with Low weight||0.43|
The other assumption that correlation makes is that the data is “normally distributed”. I checked my data with a utility to test this. The data failed. It was not normally distributed. Ordinary correlation would not give a realistic estimate of how the data categories were interrelated.
Fortunately, there are several methods of “robust” correlation. I ended up using a method called “Orthogonalized Gnanadesikan-Kettenring”. For most people, that will mean nothing, of course, but anyone who wants to check my math will need to know that. I provide the correlation matrix if you want to look it over.
One of the quirks about EFA and related methods is that they don't automatically tell you how many factors best describe your data. You have to tell the method how many to use. There has been a lot of discussion over the decades about how to figure out how many factors to use in EFA. I chose what is called parallel analysis. (Warning the link goes to a very technical description.) Roughly put, parallel analysis compares principal components or factor analysis outcomes for multiple factors against randomly-generated data sets of the same size. The result that produces the most factors that is still better than the all-random comparison is considered the best choice. My initial parallel analysis suggested 4 factors. However, the outcome of EFA produced one strange factor that consisted only of the Population and Jobs variables. I dropped these two variables and repeated the parallel analysis and EFA. The same remaining factors appeared. I decided that raw population and counts of jobs (not employment, just number of jobs) would not add much meaning to the analysis and went on without the two variables.
I then did my final EFA, producing three factors. The table above and to the right describes how strongly the factors relate to the variables. The numbers are called “loadings”. To render presentation more clear, I don't report loadings that are less than 0.40, which is a commonly used cut-off. I used these loadings to guide how I named the factors. The first factor was a combination of lacking higher adult education, low median residence assessed value, higher disability rates, higher poverty rate, less tree cover, lower household income, greater unemployment, greater violent crime, etc. In short, it made sense to call this factor Discomfort, since places with such features are probably less comfortable places to live. The second factor combined a good walk score with lots of pavement, close grocery stores, dense housing, and higher rates of non-car commuters. It made sense to call this Density. The third factor was difficult to describe, since it had several different types of variables in it. I finally chose People of Color because of that variable had much stronger influence than did the others in the factor. Once I had the EFA result, I was able to use it along with my trimmed data set to produce factor scores.
Once I had the factor scores, I used k-medoids clustering to create clusters. But, first, back up. It is common for factors in EFA to be correlated to each other. This is not a flaw in the EFA result, because in the real world, such correlations are common. However, to get the clustering, I had to still combine the factors. For this, I calculated pairwise “Mahalanobis distances”. While somewhat tricky to calculate, Mahalanobis distances take these correlations into account to produce a more realistic description of the data. Then I did the cluster analysis on these distances. I used a utility called pamk to discover the optimum number of k-medoids clusters for the data. This came to, as I already stated, six clusters. The chart to the right illustrates the factor scores vs. specific clusters. X = Discomfort, Y = Density, Z = People of Color. The chart can by rotated by clicking and dragging on it.