March 4, 2019

Making America Tax Greatly Again?

Taxes and Greatness

So, when was America great before? That's been cleared up. It was the late 1940s and the 1950s. Whether or not one agrees with that assessment, at least that's the parameter that is proposed. Let's go with it. So, if we are to talk about "greatness" for America, we'd also need to consider how America's government ran things, and one thing that was very different from the current day is the structure of income taxes.

It just so happens that the people at the Tax Foundation have published all US tax brackets from 1913 to 2013, and those from 2014 to 2018 can be looked up from multiple sources. We can use this along with consumer price index information from the same years to literally draw a picture of total federal income tax load by income for the years 1913-2018.

What I did was to back-adjust incomes of $25k, $50k, $100k, $250k, $1M, $5M, $10M, and $50M to the equivalent values of a prior year. Then I calculated total federal income taxes to be paid (as a percentage of income) for that year and adjusted income, including the standardized exemption. I used the rates and exemptions for married couples filing jointly, no children. If someone wants to pay me my consulting rate, I'll be happy to make it more complicated.

Federal Income Tax as Percent of Total Income

First, remember, that this is on CPI-adjusted income, or "2018 dollars), not nominal income for a year. For example, what would be calculated on $100,000 for 2018 was calculated on $10,665 for 1955 tax percentages. Also, this is total final taxes, not just repeating the highest marginal rate for that income. So, what you will notice is that taxes on the wealthy were far higher during the era that America was "great" than they are in the present day. Indeed, if one takes the timeline of America's "decline" to begin in the 1960s and get ever worse as time goes on, as taxes on the wealthiest decline, so did America.

What is the take-home on this? I'm not claiming that high taxes on the wealthiest make everything "great". I'm also not saying that low tax rates on the wealthiest lead to "decline". If I have a take-home, it's that "greatness" as a package needs to look at all aspects of that package, and if one wants to replicate a long-lost supposed "golden age", one has to look at all aspects of that era and then decide if the costs are worth the alleged benefits.

March 5, 2018

Sample Maps for Internet Privacy Index

Internet Index Internet Grade
Government Index Government Grade
Combined Clustering

October 10, 2017

Comparing Indianapolis Neighborhoods

UPDATE: As of April 2, 2019, the underlying data has not been updated for this post, so it's as current as it can be.

(There is a map and a chart that may take a little time to load. If you see blank spots, don't panic, just wait. If they don't fill in, try reloading the page.)

A few months ago, I wrote about the data available from the Polis Center at IUPUI, SAVI. The SAVI data is the input for Indy Vitals. You can get a lot of information from Indy Vitals to compare the neighborhoods. You can get so much that it's probably impossible to actually conclude anything. It becomes a blur. That's what happened to me. Just to make sense of it I did an analysis of a part of the IndyVitals data. Since then, IndyVitals has been updated, so I decided to take a look at the newer data and a larger chunk of it. Like before, I cleaned, analyzed, and clustered the data into similar neighborhoods.

Clustering Results of Indy Vitals Data
+++ +++ +++ +++ +++ +++ +++

On the left is the neighborhood map, color coded to cluster. You should be able to click the map to get more specific information about a neighborhood to slide over. Warning, the slideover will cover most of the map. I'm not a master of iframes. I do not have the neighborhood borders outlined, because I wanted to emphasize the great "sea of nothing much" in which various islands of more extreme situations float. Several neighborhoods are labeled. If you zoom in, more labels will appear. Most of Marion county is "middling", which is what we would expect. A partial ring of unpleasantness surrounds the Downtown neighborhood, and the most comfortable neighborhoods are in the north. None of this is a surprise to anyone who is familiar with Marion County. Two big islands associated with more generally comfortable conditions are in the southwest and southeast corners, and there are two very unfortunate-looking neighborhoods.

If you click any of the neighborhoods to get the pop-up, there are several numbers associated with it. These are what I used to create the clusters. I will get into how I created those numbers in a moment. Briefly, though, the numbers were compared as groups per neighborhood to generate "distances" among every neighborhood, and those distances were examined for where neighborhoods would "bunch up" as "clusters". My clustering method (more below) also calculated the "prototypes" of each cluster. A prototype would be the neighborhood that most resembled the cluster as a whole. I named each cluster for the prototypes. I used initials to keep the names shorter. Also, one more thing is that the clusters are not the same size. The clusters at the "ends" are smaller than those in the middle. This is what you should expect from an "honest" clustering method of any complex data. Extreme values should be rare. Never, never, never trust any kind of culture- or social-related thing that splits cities, states, neighborhoods, or anything up into nicely-spaced, evenly-sized groups.

image/svg+xml IM MB CH KC SD NC BR

I came up with seven total clusters that could be arranged on a gradient of "pleasantness". These clusters could be related to each other. The tree diagram on the right illustrates this. The worst cluster (IM) is off by itself, away from the rest of the county. The two most favored clusters (BR and NC) are also pretty isolated from the rest of the county. Remember, when I say "isolated", I don't mean physically distant, I mean that the neighborhood traits are pretty far from the rest of the county, not the location. The rest of Marion County clumps together, but there are still enough differences to split into four more clusters. If you're interested in how I got my results and how individual neighborhoods might "stack up" without having to go through every one on the map, read on.

The Data

This time around, I got the data for 45 different traits for 99 different neighborhoods in Marion County. I looked at it in one table, and it still meant nothing at all to me. It was 4455 points of data, all at once, that's usually not very meaningful. There were a lot of things I could have done. The most popular (unfortunately) when one compares cities or neighborhoods is to turn each data variable into a "rank", then add them to produce a final "score" or "rank". I get why people do it. It's simple. It's looks like you are "analyzing" the data. But when you do that, you are not analyzing the data. You're just smashing it together. The "rank and add" method ignores a lot of important things.

To interrupt myself, many data sets also need to be "cleaned" before being analyzed. This sometimes means you drop an entry because it is too incomplete. I dropped the "Airport" and "Park 100" neighborhoods from the analysis because of missing data. Sometimes, you might be able to get a substitute for missing data from another sort. I used a real estate site to get property and violent crime per 1000 residents for the two Lawrence neighborhoods, Speedway, and Beech Grove. I do give the specific site further down, if you want to check for yourself.

Anyway, back to what is wrong with add-and-rank. First, it ignores that data can be uneven. When you compare ranks, you have taken out how far apart each thing you're comparing is. For example, suppose you have "average niceness". Seven towns have "average niceness" values of 1, 4, 17, 92, 93, 94, and 1001. If you rank them, lowest to highest, you get 1, 2, 3, 4, 5, 6, 7. You don't have to be a math genius to see how that's a problem. There is no way that the distance between 93 and 94 is the same as the distance between 1 and 94 or 94 and 1001. But when you use ranks, that's what happens.

Second, it ignores that data can be redundant. The fact that you can count or measure five different things doesn't mean that each of those five different measurements make an equal contribution to an accurate overall picture. Some measurements will closely track others, because they both reflect a deeper underlying connection. In effect, if you just add the contributions of very closely-tracking variables, you're actually "double-counting" the single underlying effect.

The technique of “exploratory factor analysis” (EFA) can handle these issues, if used correctly. What EFA does is look for how parts of a data set are related to each other and groups those parts together. This can be important because as a set of data gets larger, it is more likely that more and more categories will relate together or "co-relate". Oddly enough, when data co-relates, it's called "correlation"--really, that's all it means. This could be because there is some hidden “factor” that these data points describe. EFA allows for these factors to be guessed at in a reasonable fashion.

But, just to keep myself honest, correlation could mean nothing at all! How? The distance between North America and Europe and my waistline track each other very closely. They both increase by a small amount every year. That doesn't mean that one causes the other or that either are caused by some underlying factor.

Clusters vs. Factors

This time, I ended up with five factors instead of three. I clustered the factors, but I used a different method that is less prone to making artificially even-sized clusters. When I looked at the clusters vs. the factors, I noticed that they actually could allow for the neighborhoods to be ranked into seven categories of what most people would consider desirability. The chart summarizes how the clusters relate to the factors. You will understand exactly how I named the factors if you keep reading, but in a nutshell, Comfort is how comfortable a neighborhood appears. Difficulty is how common certain other difficult or unpleasant individual life conditions are in that neighborhood. Deterioration is the physical state of the neighborhood's buildings. Crime is crime reports per population in that neighborhood. Density is the population and building density and some measure of how convenient daily necessities are.

As you can see for yourself, crime is the standout factor for cluster IM. Cluster MB has high Difficulty and very high Deterioration. Cluster CH has less Deterioration but nearly as much Difficulty as cluster MB. Cluster KC is middling. It doesn't have much of any of the factors. Cluster SD is somewhat improved on KC. It's not particularly comfortable, but at least it has lower Difficulty and less Deterioration and Crime. it's also the least dense of the clusters. Cluster NC is very comfortable. However, it is still beat out by cluster BR, primarily because cluster BR also has the lowest Difficulty. It also has the highest Density, which includes nearby availability of foodstuffs. Where did the clusters come from, and why seven? That is explained in the next section.

The Method, the Madness

This is where I explain how I got my numbers. The first thing that must be said is that these numbers only matter within Marion County. They were generated only using the IndyVitals data, so they can't be used to compare the neighborhoods to anything in Hamilton County, for example. I would have to find a comparable Hamilton County dataset and repeat the analysis with both datasets combined to create a two-county model.

I downloaded the data from IndyVitals. There was a lot more this time than before. A few categories had missing values that could be reasonably imputed or otherwise accounted for. By “otherwise accounted for”, I mean deleting the entries for Airport and Park 100. I consider this acceptable for my purpose because those “neighborhoods” are far more industrial districts than neighborhoods. After I did this, only two categories had missing values: “Violent Crime per 1000” and “Property Crime per 1000”, which were missing for Speedway, Lawrence, and Lawrence-Fort Ben-Oaklandon. I took values from the Area Vibes web site. Probably not as reliable as those from IMPD for the rest of the neighborhoods but probably not too far off the mark. That source only had one number for both of the Lawrence-based neighborhoods, so I repeated it for them. This left me with a working data set of 97 neighborhoods and 45 variables (4,365 data points). Some of the variables were problematic. First, there are two variables that were identical. These were Tax Delinquent Properties and Tax Sale Properties. Every single point matched, perfectly. I took this to mean that they were actually the same variable, so I deleted one of the two. Second, two variables had a lot of zero values. These were Parcels with Greenway Access (54 out of 97) and Demolition Orders (72 out of 97). I could have deleted these, but there are ways to handle variables with lots of zeroes.

My starting data set was 44 variables for 97 neighborhoods, with two variables needing special treatment. This special treatment was "jittering", where a very small value is added or subtracted at random from each value in a variable. This usually does not change the behavior of the variable but makes it possible to analyze by methods that can't handle large numbers of zeroes.

As before, I used exploratory factor analysis (EFA) to try to make sense of the data. It is based on correlation. A major assumption that correlation makes is that the data is “normally distributed”. I checked this data with a utility to test this. It was not normally distributed. Ordinary correlation would not give a realistic basis for analysis. So, as before, I ended up using a method called “Orthogonalized Gnanadesikan-Kettenring”. For most people, that will mean nothing, of course, but anyone who wants to check my work would want to know it.

InputComfortDifficultyDeteriorationCrime RiskDensity
Per Capita Income0.901
Median Age0.854
Associate or Higher Degree0.704
Median Assessed Value0.684
Tree Cover0.652
Employment Density0.608
Without Health Insurance-0.550
Median Household Income0.519
Poverty Rate-0.482
Births with First Trimester Prenatal Care0.446
Labor Force Involvement-0.832
Population with Disability0.642
Housing Cost Burden0.516
Mowing Orders0.914
Boarding Orders0.910
Tax Delinquent Properties0.857
Trash Orders0.849
Surplus Properties0.772
Property Crimes per 10000.911
Violent Crimes per 10000.784
Resident Employment in Neighborhood-0.674
Housing Density0.947
Income Density0.891
Pop Density0.787
Land Value Density0.697
Food Access0.653
Permeable Surface Area-0.635
Walk Score0.562

Parallel analysis suggested 7 factors. When I looked at the factors, I noticed that some of the input variables had very low "loadings". A loading is a measure of how much a variable contributes to a factor. By itself a single low loading is not a problem, but if a variable has low loadings on all the factors, that means that its influence is very mixed among the factors and it does not make a good contribution to the analysis. A common cut-off is an absolute value of 0.4. Therefore, if any variable had no loading with an absolute value of 0.4 and a "communality" of less than 0.6, I deleted it from the EFA and repeated the process, starting from re-calculating a correlation matrix. I repeated this until no variables had at least one absolute loading value of less than 0.4 or communality of 0.6. This produced an EFA outcome with five factors (the table).

The table describes how strongly the factors relate to the variables. The numbers in the tables are the “loadings”. I used these loadings to guide how I named the factors. The first factor was a combination of better income and education, more trees, lower poverty, better prenatal care, etc. It made sense to call this factor Comfort, since places with such features are probably more comfortable places to live. The second factor combined a high proportion of handicapped residents and housing cost burden with low labor force participation. It made sense to call this Difficulty, since people with those traits probably have a difficult time getting by. The third factor was all negative property-related variables, plus unemployment. Since it was mostly property traits that people wouldn't want in their neighborhoods, I called it "Deterioration". The "Crime Risk" factor corresponded to higher rates of property and violent crimes, along with high unemployment. "Crime Risk" was a good name. The final factor amounted to overall "Density", since it was four "Density" factors along with two measures that amounted to "lots of stores nearby".

I used the factors to produce factor scores and the factor scores to produce clusters. This time around, I used "minimax hierarchical clustering". However, it is not uncommon common for factors in EFA to be correlated to each other. This is not a flaw in the EFA result. However, to get the clustering, I still had to estimate how "distant" each neighborhood was from the other in terms of factor scores. For this, I calculated pairwise “Mahalanobis distances”. While somewhat tricky to calculate, Mahalanobis distances take these correlations into account to produce a more realistic description of the data. Then I did the cluster analysis on these distances. As I already mentioned, I used minimax hierarchical clustering. Like all clustering methods, it might create clusters but doesn't tell you how many are the optimal number. This time, I computed clustering sums of squares for successive numbers of clusters and used the number that produced an "elbow". This turned out to be seven clusters. How do the clusters relate to factor scores? Since I had five final factors, I couldn't really chart them. However, if I looked at the factor scores vs. cluster assignments, it appeared that three of the five had a larger contribution, overall, to the clustering, than the other two. I built a rotatable chart that plotted these three factors vs. cluster. "x" is Comfort; "y" is Difficulty, and "z" is Deterioration. If you click and drag on it, you can rotate the chart. Points are colored by cluster, using colors similar to those for the map. You will notice that IM is not nicely separated in the 3D chart. This is because it is set apart by Crime levels, which are the fourth factor

But how do the neighborhoods RANK?

I am sure that some people have come all the way to the bottom to find the "ranks" of each individual neighborhood. This is flatly wrong-headed, and I already explained why. That being said, if you like, you can download a table of the neighborhoods that shows cluster and factor scores and create your own ranks.

August 30, 2017

Brief analysis of effect of right-to-work laws on per-person real income availability

This mostly data and accompanying analysis. If you are a casual reader, I apologize for complete lack of background or any real readability. It may be of use to people who are already familiar with the issue. I asked the question "What is the effect of having right-to-work laws on a 'meaningful measure' of income?" First, a look at those states with right-to-work laws as of 2015:

Incidence of Right to Work laws, 2015
Blue: No RTW; Red: RTW

The first question unanswered is what would constitute a "meaningful measure". I began with by-state median income. I chose median instead of mean because median is a far more robust estimator of the central tendency than is mean. Unfortunately, my limited access to data (American FactFinder) meant that I could only get state median incomes for "households" or "families". Since many households are non-family households, I chose median by household. I also downloaded average sizes of household by state. I then obtained the "Implicit Regional Price Deflator" (by state), or IRPD, from the BEA. This combines differences in cost of living by state with inflation per year. It can be used to give a state-adjusted, real-dollar estimate of income. This was only available for the years 2008-2015, which limited my analysis to those years. I finally downloaded total civilian full-time employed and total military employed, both by state. I divided this by the total population of a state for a year. I did not restrict this to "workforce", since children have to be supported, too, even if they are not in the workforce. Each state's status as right-to-work or not was coded as an ordinal variable by year. The basic data set is available for you to check, yourself

Right-to-Work model coefficients
Right to Work†-0.1504+0.1052/-0.0570*
Right to Work × Year0.0002+0.0166/-0.0103
* Factor is significant at p ≤ 0.05 by nonparametric bootstrap.
† Estimate corresponds to state having right-to-work law.

From these numbers, I created my "metric": (((Median Income)/(IRPD/100))/Average Household Size)*(Employment Percent). I call it "Effective Income per Person". I modeled this Metric using generalized linear mixed models. State was the grouping factor for random effects. Sums contrasts were used. The fixed portion was "Metric ~ RTW + Year + RTW*Year". For calculation purposes, year was divided by the standard deviation of all years in the data set. Different error structures were compared by second-order Akaike Information Criterion (AICc). The compared models used Gaussian, gamma, and inverse Gaussian distributions, with identity, inverse, and log link functions. Of these, many did not converge. Of those that converged, the lowest AICc belonged to the model with a gamma distribution and log link. The next-nearest model had a gamma distribution and identity link. Δ AICc was greater than 6.9, indicating very strong evidence to favor the first model over all other models that converged. The model was evaluated by stratified non-parametric bootstrap, "state" as the stratifying feature.

Difference between RTW and non-RTW states

Since this had a log link, the estimate for "Right to Work" means that, on average, a right-to-work state could be expected to have a 15% lower effective income per person. I bootstrapped the estimated average effective income for RTW and non-RTW states for each year and subtracted the RTW average from the non-RTW average. Adjusted for multiple comparisons, the 95% confidence intervals show that the difference was significant for all years examined, as the chart shows. In addition, overall effective income per person dropped by roughly 1% every six months, regardless of right-to-work status. There was no significant interaction between right-to-work and year, meaning the difference due to right-to-work remained constant.

I glossed over using a mixed (or multilevel) model to reach my results. I chose such a model for two reasons. First, this was repeat measures data. The same states were "measured" each year. That means we can presume that the data within each state will be correlated to data for other years from the same state. Second, as has been noted in other analyses of RTW laws, individual state effects may play large roles that could mask overall RTW effects. The mixed model allows one to account for both within-state correlations and individual state effects. What it does not let us do, with the data on hand, is actually identify those individual state effects. That is, we can estimate how large the effects are but not what they are. It's like measuring a hole without knowing what actually made it. You don't need to know how a hole was made to measure how wide and deep it is. I will present those "random effects" in a later post.

An alternate model

After getting snark from someone who believes that a "differences in differences" model magically establishes "causation" better than does a mixed-level glm (Free clue: Neither type of model actually establishes causation.), I ran the magical DID on my data. My results:

DID model coefficients
Right to Work†-0.956.61+412.31/-399.59*
* Factor is significant at p ≤ 0.05 by nonparametric bootstrap.
† Estimate corresponds to state having right-to-work law.

Now, what does this mean? It will make more sense if you understand that "DID" is actually the same thing as interaction between Right-to-Work and year. The only difference is that "Year" has been coded as a 0/1 variable instead of specific years. The cutoff was 2012, which was the only year in which some states swapped from not having RTW to having RTW. While the values of the coefficients are different, the result is the same. DID analysis indicates that, overall, non-RTW states had a higher per-person adjusted income and that imposing RTW did not significantly alter this.

So, what does that mean? It means that, using this metric, there is no net benefit to most people in a state from imposing RTW over not having it. Now, if one believes "the more regulation the better", then one would say "Okay, so impose RTW everywhere, since it doesn't make a difference." However, if one believes that more laws are not good in and of themselves, and that government interference in business practices (interfering in permitted terms of contracts is government interference) should only be done if there is a compelling benefit, then RTW fails to actually grant sufficient benefit.