Things I Write: October 2017

UPDATE: As of November 21, 2019, the underlying data has not been updated, so it's as current as it can be.

(Note: There is a map and a chart that may take a little time to load. If you see blank spots, don't panic, just wait. If they don't fill in, try reloading the page)

A few months ago, I wrote about the data available from the Polis Center at IUPUI, SAVI. The SAVI data is the input for Indy Vitals. You can get a lot of information from Indy Vitals to compare the neighborhoods. You can get so much that it's probably impossible to actually conclude anything. It becomes a blur. That's what happened to me. Just to make sense of it I did an analysis of a part of the IndyVitals data. Since then, IndyVitals was updated, so I decided to take a look at the newer data and a larger chunk of it. Like before, I cleaned, analyzed, and clustered the data into "similar" neighborhoods.

Clustering Results of Indy Vitals Data

+++	+++	+++	+++	+++	+++	+++
IM	MB	CH	KC	SD	NC	BR

Above is the neighborhood map, color coded to cluster. You should be able to click the map to get more specific information about a neighborhood. Warning, the slideover will cover most of the map. I'm not a master of iframes. I do not have the neighborhood borders outlined, because I wanted to emphasize the great "sea of nothing much" in which various islands of more extreme situations float. Several neighborhoods are labeled. If you zoom in, more labels will appear. Most of Marion county is "middling", which is what we would expect. A partial ring of unpleasantness surrounds the Downtown neighborhood, and the most comfortable neighborhoods are in the north. None of this is a surprise to anyone who is familiar with Marion County. Two big islands associated with more generally comfortable conditions are in the southwest and southeast corners, and there are two very unfortunate-looking neighborhoods.

If you click any of the neighborhoods to get the pop-up, there are several numbers associated with it. These are what I used to create the clusters. I will get into how I created those numbers in a moment. Briefly, though, the numbers were compared as groups per neighborhood to generate "distances" among every neighborhood, and those distances were examined for where neighborhoods would "bunch up" as "clusters". My clustering method (more below) also calculated the "prototypes" of each cluster. A prototype would be the neighborhood that most resembled the cluster as a whole. I named each cluster for the prototypes. I used initials to keep the names shorter. Also, one more thing is that the clusters are not the same size. The clusters at the "ends" are smaller than those in the middle. This is what you should expect from an "honest" clustering method of any complex data. Extreme values should be rare. Never, never, never trust any kind of culture- or social-related thing that splits cities, states, neighborhoods, or anything up into nicely-spaced, evenly-sized groups.

I came up with seven total clusters that could be arranged on a gradient of "pleasantness". These clusters could be related to each other. The tree diagram on the right illustrates this. The worst cluster (IM) is off by itself, away from the rest of the county. The two most favored clusters (BR and NC) are also pretty isolated from the rest of the county. Remember, when I say "isolated", I don't mean physically distant, I mean that the neighborhood traits are pretty far from the rest of the county, not the location. The rest of Marion County clumps together, but there are still enough differences to split into four more clusters. If you're interested in how I got my results and how individual neighborhoods might "stack up" without having to go through every one on the map, read on.

The Data

This time around, I got the data for 45 different traits for 99 different neighborhoods in Marion County. I looked at it in one table, and it still meant nothing at all to me. It was 4455 points of data, all at once, that's usually not very meaningful. There were a lot of things I could have done. The most popular (unfortunately) when one compares cities or neighborhoods is to turn each data variable into a "rank", then add them to produce a final "score" or "rank". I get why people do it. It's simple. It's looks like you are "analyzing" the data. But when you do that, you are not analyzing the data. You're just smashing it together. The "rank and add" method ignores a lot of important things.

To interrupt myself, many data sets also need to be "cleaned" before being analyzed. This sometimes means you drop an entry because it is too incomplete. I dropped the "Airport" and "Park 100" neighborhoods from the analysis because of missing data. Sometimes, you might be able to get a substitute for missing data from another sort. I used a real estate site to get property and violent crime per 1000 residents for the two Lawrence neighborhoods, Speedway, and Beech Grove. I do give the specific site further down, if you want to check for yourself.

Anyway, back to what is wrong with add-and-rank. First, it ignores that data can be uneven. When you compare ranks, you have taken out how far apart each thing you're comparing is. For example, suppose you have "average niceness". Seven towns have "average niceness" values of 1, 4, 17, 92, 93, 94, and 1001. If you rank them, lowest to highest, you get 1, 2, 3, 4, 5, 6, 7. You don't have to be a math genius to see how that's a problem. There is no way that the distance between 93 and 94 is the same as the distance between 1 and 94 or 94 and 1001. But when you use ranks, that's what happens.

Second, it ignores that data can be redundant. The fact that you can count or measure five different things doesn't mean that each of those five different measurements make an equal contribution to an accurate overall picture. Some measurements will closely track others, because they both reflect a deeper underlying connection. In effect, if you just add the contributions of very closely-tracking variables, you're actually "double-counting" the single underlying effect.

The technique of “exploratory factor analysis” (EFA) can handle these issues, if used correctly. What EFA does is look for how parts of a data set are related to each other and groups those parts together. This can be important because as a set of data gets larger, it is more likely that more and more categories will relate together or "co-relate". Oddly enough, when data co-relates, it's called "correlation"--really, that's all it means. This could be because there is some hidden “factor” that these data points describe. EFA allows for these factors to be guessed at in a reasonable fashion.

But, just to keep myself honest, correlation could mean nothing at all! How? The distance between North America and Europe and my waistline track each other very closely. They both increase by a small amount every year. That doesn't mean that one causes the other or that either are caused by some underlying factor.

Clusters vs. Factors

This time, I ended up with five factors instead of three. I clustered the factors, but I used a different method that is less prone to making artificially even-sized clusters. When I looked at the clusters vs. the factors, I noticed that they actually could allow for the neighborhoods to be ranked into seven categories of what most people would consider desirability. The chart summarizes how the clusters relate to the factors. You will understand exactly how I named the factors if you keep reading, but in a nutshell, Comfort is how comfortable a neighborhood appears. Difficulty is how common certain other difficult or unpleasant individual life conditions are in that neighborhood. Deterioration is the physical state of the neighborhood's buildings. Crime is crime reports per population in that neighborhood. Density is the population and building density and some measure of how convenient daily necessities are.

As you can see for yourself, crime is the standout factor for cluster IM. Cluster MB has high Difficulty and very high Deterioration. Cluster CH has less Deterioration but nearly as much Difficulty as cluster MB. Cluster KC is middling. It doesn't have much of any of the factors. Cluster SD is somewhat improved on KC. It's not particularly comfortable, but at least it has lower Difficulty and less Deterioration and Crime. it's also the least dense of the clusters. Cluster NC is very comfortable. However, it is still beat out by cluster BR, primarily because cluster BR also has the lowest Difficulty. It also has the highest Density, which includes nearby availability of foodstuffs. Where did the clusters come from, and why seven? That is explained in the next section.

The Method, the Madness

This is where I explain how I got my numbers. The first thing that must be said is that these numbers only matter within Marion County. They were generated only using the IndyVitals data, so they can't be used to compare the neighborhoods to anything in Hamilton County, for example. I would have to find a comparable Hamilton County dataset and repeat the analysis with both datasets combined to create a two-county model.

I downloaded the data from IndyVitals. There was a lot more this time than before. A few categories had missing values that could be reasonably imputed or otherwise accounted for. By “otherwise accounted for”, I mean deleting the entries for Airport and Park 100. I consider this acceptable for my purpose because those “neighborhoods” are far more industrial districts than neighborhoods. After I did this, only two categories had missing values: “Violent Crime per 1000” and “Property Crime per 1000”, which were missing for Speedway, Lawrence, and Lawrence-Fort Ben-Oaklandon. I took values from the Area Vibes web site. Probably not as reliable as those from IMPD for the rest of the neighborhoods but probably not too far off the mark. That source only had one number for both of the Lawrence-based neighborhoods, so I repeated it for them. This left me with a working data set of 97 neighborhoods and 45 variables (4,365 data points). Some of the variables were problematic. First, there are two variables that were identical. These were Tax Delinquent Properties and Tax Sale Properties. Every single point matched, perfectly. I took this to mean that they were actually the same variable, so I deleted one of the two. Second, two variables had a lot of zero values. These were Parcels with Greenway Access (54 out of 97) and Demolition Orders (72 out of 97). I could have deleted these, but there are ways to handle variables with lots of zeroes.

My starting data set was 44 variables for 97 neighborhoods, with two variables needing special treatment. This special treatment was "jittering", where a very small value is added or subtracted at random from each value in a variable. This usually does not change the behavior of the variable but makes it possible to analyze by methods that can't handle large numbers of zeroes.

As before, I used exploratory factor analysis (EFA) to try to make sense of the data. It is based on correlation. A major assumption that correlation makes is that the data is “normally distributed”. I checked this data with a utility to test this. It was not normally distributed. Ordinary correlation would not give a realistic basis for analysis. So, as before, I ended up using a method called “Orthogonalized Gnanadesikan-Kettenring”. For most people, that will mean nothing, of course, but anyone who wants to check my work would want to know it.

Input	Comfort	Difficulty	Deterioration	Crime Risk	Density
Per Capita Income	0.901
Median Age	0.854
Associate or Higher Degree	0.704
Median Assessed Value	0.684
Tree Cover	0.652
Employment Density	0.608
Without Health Insurance	-0.550
Median Household Income	0.519
Poverty Rate	-0.482
Births with First Trimester Prenatal Care	0.446
Labor Force Involvement		-0.832
Population with Disability		0.642
Housing Cost Burden		0.516
Mowing Orders			0.914
Boarding Orders			0.910
Tax Delinquent Properties			0.857
Trash Orders			0.849
Surplus Properties			0.772
Unemployment			0.535
Property Crimes per 1000				0.911
Violent Crimes per 1000				0.784
Resident Employment in Neighborhood				-0.674
Housing Density					0.947
Income Density					0.891
Pop Density					0.787
Land Value Density					0.697
Food Access					0.653
Permeable Surface Area					-0.635
Walk Score					0.562

Parallel analysis suggested 7 factors. When I looked at the factors, I noticed that some of the input variables had very low "loadings". A loading is a measure of how much a variable contributes to a factor. By itself a single low loading is not a problem, but if a variable has low loadings on all the factors, that means that its influence is very mixed among the factors and it does not make a good contribution to the analysis. A common cut-off is an absolute value of 0.4. Therefore, if any variable had no loading with an absolute value of 0.4 and a "communality" of less than 0.6, I deleted it from the EFA and repeated the process, starting from re-calculating a correlation matrix. I repeated this until no variables had at least one absolute loading value of less than 0.4 or communality of 0.6. This produced an EFA outcome with five factors (the table).

The table describes how strongly the factors relate to the variables. The numbers in the tables are the “loadings”. I used these loadings to guide how I named the factors. The first factor was a combination of better income and education, more trees, lower poverty, better prenatal care, etc. It made sense to call this factor Comfort, since places with such features are probably more comfortable places to live. The second factor combined a high proportion of handicapped residents and housing cost burden with low labor force participation. It made sense to call this Difficulty, since people with those traits probably have a difficult time getting by. The third factor was all negative property-related variables, plus unemployment. Since it was mostly property traits that people wouldn't want in their neighborhoods, I called it "Deterioration". The "Crime Risk" factor corresponded to higher rates of property and violent crimes, along with high unemployment. "Crime Risk" was a good name. The final factor amounted to overall "Density", since it was four "Density" factors along with two measures that amounted to "lots of stores nearby".

I used the factors to produce factor scores and the factor scores to produce clusters. This time around, I used "minimax hierarchical clustering". However, it is not uncommon common for factors in EFA to be correlated to each other. This is not a flaw in the EFA result. However, to get the clustering, I still had to estimate how "distant" each neighborhood was from the other in terms of factor scores. For this, I calculated pairwise “Mahalanobis distances”. While somewhat tricky to calculate, Mahalanobis distances take these correlations into account to produce a more realistic description of the data. Then I did the cluster analysis on these distances. As I already mentioned, I used minimax hierarchical clustering. Like all clustering methods, it might create clusters but doesn't tell you how many are the optimal number. This time, I computed clustering sums of squares for successive numbers of clusters and used the number that produced an "elbow". This turned out to be seven clusters. How do the clusters relate to factor scores? Since I had five final factors, I couldn't really chart them. However, if I looked at the factor scores vs. cluster assignments, it appeared that three of the five had a larger contribution, overall, to the clustering, than the other two. I built a rotatable chart that plotted these three factors vs. cluster. "x" is Comfort; "y" is Difficulty, and "z" is Deterioration. If you click and drag on it, you can rotate the chart. Points are colored by cluster, using colors similar to those for the map. You will notice that IM is not nicely separated in the 3D chart. This is because it is set apart by Crime levels, which are the fourth factor

But how do the neighborhoods RANK?

I am sure that some people have come all the way to the bottom to find the "ranks" of each individual neighborhood. This is flatly wrong-headed, and I already explained why. That being said, if you like, you can download a table of the neighborhoods that shows cluster and factor scores and create your own ranks.

Things I Write

October 10, 2017

Comparing Indianapolis Neighborhoods

The Data

The Method, the Madness

But how do the neighborhoods RANK?