August 21, 2014

Mass Killings and the Good Old Days

On some message board or another, I saw somebody posting a link to an interactive map from USA Today. It quickly degenerated into a dogmatic flame war about how horrible guns are or aren't, with various racist jabs based on locations. I, being a nerd, took one look at the map and made a quip that killed the whole thread: "Look! Somebody has re-invented the population density map!" (I say things like that.) To my eye, all it really showed is how dense the US population might be. Then somebody finally threw out a comment that got it all roiling again. To wit, that it's all due to the breakdown of American society since the "good old days".

That got me to thinking. My off-the-cuff remark about population maps might or might not be quite on the button, but I really knew nothing about the frequency of mass murder in the USA over time, certainly not since the "good old days". My dachshund-like instincts triggered, I had to dig to the bottom of this. I gleaned the "rampage killing" events that occurred in the USA from the appropriate Wikipedia articles. Yes, I know, but it's a free source, and I did spot checking for several of the entries.

I pulled out information on mass killings that fell into the categories of "school massacres", "workplace killings" (with a subcategory for military), "religious, political, or racial", "familicides", "home intruders", and "rampage killers". "Rampage Killers" was further subdivided into "Vehicular" and "Other Methods". I found that bizarre, since there were examples classified as "school massacres" that used only a vehicle as the weapon, but they were not also classified as "vehicular". So, I collapsed the "vehicular" and "other methods" in with generic "rampage". I also combined non-military and military workplace killings. To make things crazier, Wikipedia has various pages outlining what it called "terrorism" in the USA, but they disagree over what constitutes a terrorist act and likewise classify killings done for insurance fraud as "terrorism". I gleaned the mass killings that I could find under "terrorism" and either reclassified them to "Religious/Racial" (for race riots or killings based entirely on the victims' religion without specific political context) or as "Terrorism", defined as an intentional attack upon civilians for a specific political purpose. This left me with the categories of "Workplace", "Terrorism", "School Attacks", "Religious/Racial", "Rampage", "Home Intrusion", and "Familicide". But how to compare those with the "good old days". There are a lot of ways to slice up time. Individual years is too fine for this purpose, decades seem to arbitrary. "Generations" seem to be a good place to at least start.

Generation	Birth Years	Dominant Years
Missionary	1860–1882	1901–1924
Lost	1883–1900	1925–1942
GI	1901–1924	1943–1960
Silent	1925–1942	1961–1981
Baby Boom	1943–1960	1982–2004
X	1961–1981	2005–2013

Is there a difference in mass killings by generation? There are a lot of ways that cake could be sliced. I picked Strauss and Howe's Generations: The History of America's Future. (1992. Harper Collins). Most of us are roughly familiar with it, even if we don't know it. It does have its flaws. they get a bit kooky, but it has seemed to be roughly useful. The relevant Strauss & Howe generations are thus:

"Birth Years" are the years that members of a given generation were born in. While useful for saying when a generation begins, it's a lousy way to describe when a generation was influential. Taking the name "GI Generation" as a guide, I decided that a generation's major "Dominant Years" are the "Birth Years" of the generation two steps after their own. Thus, the GI Generation was the major cultural force in the USA during the time that the Baby Boomers were born. Finally, I looked up population by state and year from the US Census Bureau.

And what do I do with all that? Raw event counts are often useless. If there are more people, anybody reasonable would expect more of any kind of event to happen. So, the thing to do is to adjust events for people. This makes a very big difference when you're talking about risk. I'll put it another way. In 2000, there were three mass killings in the USA. In 1902, there were three mass killings in the USA. Does that mean the risk of a mass murder occurring in the USA was equal in 1902 and 2000? No, because risk takes into account all the other times an event might have occurred but didn't. If an intersection has 10 collisions per year, it's a much bigger deal if the intersection only has 20 cars using it in that year than if it has 2000 cars using it. The same is true for these mass killings. Taking population into account, 1902 was three and a half times as dangerous for mass killing events to occur than was 2000. Well, then, divide the number of mass killings by population, group by generation, and voila! (Pardon my French.)

Population-Adjusted Mass Killings by Generation and Type

Just use a minute to take this figure in. It's a lot if you're not used to fancy graphics or scientific papers, but it actually is sensible if you take it in stages. First, the height of each column (or stack) is the population-adjusted average of mass killings per year, by generation. I multiplied these numbers by a billion. Yes, a billion. Why a billion? Because, even if they are plastered all over the news, mass killings are a very rare event in terms of total US population. I multiplied by a billion just so the numbers would be readable.

You'll notice the columns are actually stacks of different colors. This is so you could see how the total mass killings were split among different types. If you roll your mouse over each stack, you'll get its summary. The letters above each stack are from what is called a "multiple range test", which is used to divide up samples that have more than one category. If a stack shares a letter with another stack, it's not considered statistically different (in total height) from that other stack. If you want the details on all of this, I've got a nerd's only section at the end where I go through the nuts and bolts of the modeling I did.

But what does all this mean? First, look at the total stack heights, most of them are pretty close to each other. The error bars are from "standard errors" for the model. Consider that the "reasonable wiggle room" that represents variability in each total stack. The letters are the real take-home message. According to the "model" I made of the data, for most of the 20^th century and up into the 21^st until the end of 2013, if we go by generation, overall mass killings were no less common "back then" than they are now. Let that sink in. No less common in the "good old days".

Yes, I do know you're making "Ooh!" noises and pointing at that short little stack for the GI Generation. I'm not ignoring it. I'm just waiting for everyone to notice it. Now that it's been noticed, "we're going to do some science", as my old ecology professor used to say. What Brent (at my alma mater, we were to call professors by their first names) meant is that numbers and charts are summaries. Science is what happens when you try to pull a little meaning out of it. So, what's the meaning we can pull? Short version: The GI Generation was not normal. Long version: We look back to the WWII and post-war years as the defining times of "America". Every aspect of American life is still defined in terms of what the GI Generation did, had, or wanted. How did this happen? It's a collision of several forces. Television. While the Baby Boomers grew up on TV, it was the GI Generation that made the shows they watched and thus defined what the Boomers considered "normal life". The Boomers, themselves, contribute to America's unquestioning acceptance of the GI Generation's dominant period as the standard. It was when they grew up. Thus, the conditions of that era are remembered as "the way things work" by what is still the largest generational group in America. It doesn't matter that "the way things worked" might have been very different before their own childhoods. What we grew up with is usually what we decide is "normal" for the rest of our lives.

But what about the categories? Why did I bother classifying them? And what about maps? This post is already long enough, I'll get to it next time.

Nerd Postlude

Warning: If you have not yet earned your Gold Pocket Protector with 20-Sided Dice Clusters, the following may cause your brain to ooze out your ears like badly-made guacamole, your ears to then slide down to your chin, and your vocabulary to be reduced to repeating "Uhhhhhhhhhh" for an indefinite period of time. If you notice these symptoms, immediately apply an appropriate antidote, including but not limited to funny kitten videos, babes/hunks in bikinis/speedos, cookie recipes, or other uplifting but not painfully technical uses of the World Wide Web. This part is fairly hardcore nerdery, with stuff that would take an enormous amount of space to explain. If it makes no sense to you, it's okay, it doesn't mean you're stupid.

	With Intercept				Without Intercept
Factor	Estimate	Std. Error	z value	p	Estimate	Std. Error
(Intercept)	-11.213	0.067	-166.888	<0.001	NA	NA
Miss. Gen.	0.254	0.146	1.748	0.081	-10.958	0.158
Lost Gen.	0.015	0.161	0.093	0.926	-11.198	0.180
GI Gen.	-0.798	0.209	-3.815	<0.001	-12.011	0.243
Sil. Gen.	0.120	0.121	0.994	0.320	-11.093	0.123
Boomers	0.133	0.108	1.230	0.219	-11.079	0.104
Gen. X	0	NA	NA	NA	-10.938	0.143

As promised, I lift the hood on my analysis. Some of you have, no doubt, already noticed that this entire blog entry was just to present a fairly simple linear model. But preliminaries like identifying my data sources (with all their flaws), introducing my factor definition, etc. can't be disposed of. The model I presented is a simple one-factor linear model, specifically "Events_Population ~ Generation". That is, Events (mass killings), adjusted for Population, grouped by Generation. The question I asked was "Does grouping by generation actually mean anything?" A naive approach to this would have been to do an ANOVA. However, my data is actually counts offset by population. Count data is very often better modeled with a poisson error distribution. I began with a simple generalized linear model (glm). However, testing dispersion revealed that it was underdispersed. Among the many alternatives to deal with this, I chose to use a mixed model (glmm), which sort of "shoves" the dispersion issue onto a random variable. I used the year for this.

I set up orthogonal contrasts to compare each generation against Generation X, since ran the model "Event ~ Generation + offset(log(Population)) + (1|Year)" in the R environment, using the lme4 package. I ran the model with and without an intercept. The no-intercept model was used to generate coefficients and standard errors for the figure. Analysis was done on the with-intercept model. The glmm showed that GI Generation was distinct from Generation X, but I was interested in simultaneous pairwise comparison. For this I used the multcomp package, simultaneous Tukey contrasts. They are summarized in the figure.

August 12, 2014

Tax Migration--Get to the Point, Already!

For the three people who have managed to get this far, I really do have a point to these entries. This is a lot of stuff for people to get through, so I broke it up in the hope that it would be more digestible. The first entry of this series of three pointed out that so-called "income migration" really mostly reflects people migration. The money isn't magically moving around all on its own. It follows people. My followup, which was longer and more boring, pointed out how closely money and people move together (almost 95% of the change in aggregate income went along with change in households, as estimated by tax returns). Then I went on to show a harder to see set of changes, the way that migration altered the average household income of a state. If you look at the states by changes per households that migrated, some of the "big winners" and "big losers" weren't so big. I also pointed out a few states that "lost while winning" or "won while losing". Their total income change went in the opposite direction to the income change per household. Some states got more money overall but gained poorer people while other states lost overall income but gained richer people. Is that good? Is it bad? I got no clue, but it is something to consider if you play the "let's compare the states" game.

And so what? I'm getting to so what, but to do that, I need to introduce a new idea, with new calculations and maps. Have I expressed my sheer joy that people cannot throw rotten vegetables at me through the internet? I do apologize, but this needs to be done to make the point. While it might be useful to look at absolute changes in things like aggregate income, population, or income per household, the real impact of such changes depends upon how big (in number of people) the state that undergoes these changes already is. It's one thing for a state to have a net loss of $5,000 income per household of people who moved into or out of the state if the number of people who moved in or out makes up 1% of the whole state's population. It's a much bigger deal if those people end up being 10% (ten times the other state) of the state's population.

This very important difference is hidden when you report absolute changes. The actual impact on the state will disappear because you simply subtract out the people who stayed put! If you do the math to measure changes as a percent of the state's total population, income, etc., you will factor in a sometimes very large number of people and their income. The picture can change, and it can explain how states act much better than can absolute difference maps. Even absolute changes can be different if what you are comparing is an absolute change in ratios (like income per return) but don't take into account what changed for the people who stayed put.

What I mean is that my last two posts were entirely restricted to the differences of migration and ignored the people who never moved at all. If you think about it, it doesn't make sense to stop there. I'll just cut to the chase on this: If you look at states, proper, the biggest change in number of households amounted to less than one percent of the total population. That means over 99% of the population stayed put. The overall effect on a state due to migration could really amount to no more than a teaspoon in an ocean. So what happens when you do take into account the whole population of a state? We'll first compare aggregate gross income:

Change in Aggregate Gross Income from Migrants Only	Percent Change in Aggregate Gross Income Including non-Migrants

As always, you can roll your mouse over a state to get a little more detail. I use percent on the right because that better reflects the idea of "relative change". If a jump of one million is a hundredth of one state's economy but is only one tenth of that proportion of another state's, then the "felt effect" of the jump would be bigger in the second state, even though the same amount of income might have moved. So, what does this say? In terms of relative change in aggregate income in a state, Florida still gets to give the finger to everyone else. In absolute and relative terms, Florida is bringing it in. Texas is still doing pretty darn well in terms of aggregate income change, but it's not all that special. It's really in a pack with the Rocky Mountain states and some of the southwest. The Carolinas and Tennessee, and New England, also turn out better when looked at this way. When it comes to losers, California is not doing so bad, overall. It helps to have a giant population that can absorb losses, but who knows how long that will work? New York also doesn't look quite as bad, but it still looks pretty bad. The Midwest, on the other hand, is hurting more in terms of aggregate impact than in absolute losses. It loses and has less reservoir to spend off. The big contrast, though, is Alaska. In terms of crude aggregate income, it barely lost anything. When the state as a whole is looked at, Alaska lost pretty badly.

Of course, this only looks at one element, overall change in income. Since I already presented change per household, what does that look like when we consider the state as a whole and include people who didn't move?

Change in Income per Household from Migrants Only	Percent Change in Income per Household Including non-Migrants

This is where things get interesting, really interesting. Regardless of "reality", the map on the right is probably the best visual summary of how the people of a state are likely to feel about their economic well-being in 2011 vs. 2010. It doesn't matter how much a state might lose or gain overall if the difference per household doesn't match. Likewise, it doesn't matter how much each household coming in or leaving might have if the state as a whole completely swamps their effect. Another way of putting it is that if over 99% of the state's households have a net gain in income, the state as a policy unit simply will not care if that less than 1% who moved in has a much lower income--or vice-versa.

What does the new map (on the right) say, then? First, guess what, Florida, in terms of overall personal prosperity, 2011 may have been a good year for you, but you're not the king. The big winner is Colorado. The Rocky Mountain states and some of the Southwest just blow the rest of the country away in terms of household income change--if you take the people who stayed put into account. Sorry, Texas, but however much total money and people you may have pulled in, per household improvement is really weak if you take the Texans who stayed in Texas into account. New England is even better than it previously looked, particularly since Massachusetts's crappy situation is balanced by a dense population that overall didn't do so badly. New York can't brag, but they aren't nearly as stinky when looked at per household, including New Yorkers who stayed put. California can pretty much just shrug. It's got such a large population that the per household effect of 2011 almost gets lost. The big loser on a per household basis is Alaska. Not only did migration hurt, but the state as a whole lost a big percent of household income between 2010 and 2011. Some interesting tidbits pop up in the Midwest. Indiana, for example, is a loser in aggregate income and a (very) mild loser in income per household, unless you take the Hoosiers who stayed put into account. At that point, while its aggregate income loss didn't change, it actually noticeably gained in terms of income per household. There are some other states that flipped like this, in either direction, depending on whether or not you took total population into account.

So, the point? That Californians are great and Texans suck? That Florida isn't all that? That Colorado is the bee's knees? None of the above. The point of all these maps and all these words is to try to convince somebody to not trust these maps and their accompanying words. We have become (mostly) savvy to spin. We understand that nearly anything can be explained away. We don't trust the "why". But we still get so easily suckered in by a flashy "what". Produce a spiffy colored map with popups and a citation to data, and it will be touted by one "side" and ignored by the other of some political debate, you can count on it. Most people just don't have the background to ask "Is this really the best way to make the conclusion that's being pushed?" That's why I presented the exact same source data just turned around slightly differently. Each little turn produced a different "result". I didn't alter the data. I just presented different bits of it at different times. In some cases, whether or not a state "gained" or "lost" depended entirely on how I chose to slice the data.

That's my agenda. I just wanted to give vivid, colorful examples of how you can use the same data set to say very different things. I expect to be ignored, not passed along, never linked to, and simply forgotten. After all, these three posts would be difficult to use by either of the two pseudo-sides of the American political non-debate. But it's something I wanted to get out there, hoping that somebody might see it, make the effort to read it, and get this point.

August 8, 2014

Tax Migration, again? That trick never works!

And now for "the next post". If you read the previous post, you'll be familiar with these two maps:

Change in Aggregate Gross Income (AGI)	Change in Returns Filed

Left map is net aggregate gross income that has entered (blue through purple) or left (orange through red) a state in 2011. That means that, if a state is blue through purple it gained more aggregate gross income, according to the IRS, than it lost. Right map is same concept, except in number of returns filed. So, the left map presents money and the right map presents people. As you can see, they're very similar. The movement of income is a close (but not perfect) match to the movement of people. (Holding your mouse over the map will pop up the specific state's change in income or returns.) Data for these maps and all other data for this post still comes from the IRS migration data.

How close is this match? For those of you who aren't nerds, there is a number called a correlation coefficient. It can be calculated for any two linked sets of numbers. It's a rough measure of how close one set follows along with the other. If you square it, you get what's called a coefficient of determination. For no good reason, this is abbreviated as R². This final number is a measure of how much one set of numbers "explains" the other. You don't need to worry about the details. All you need to know is that if the R² is zero, then the two sets of numbers are unrelated. If the R² is one, then when one set of numbers goes up, the other set will also always go up, and that this will happen at the same rate. Another way of looking at it is if that the R² is one, you could draw a perfectly straight line on the data if they were plotted against each other.

What does this particular R² turn out to be, anyway? For net returns vs. net annual gross income, the R² is 0.948, which means that nearly 95% of the gain or loss in annual gross income in a state goes along with gain or loss in number of households in a state. This is important, since we presume that income follows people. In this case, the presumption holds. Basically, when we talk about income moving, we're still talking (almost completely) about people moving.

However, there's a lot more to this situation than just income and people. After all, the two maps are only very similar, they're not identical. The rate of people movement isn't the same as the rate of income movement. Fortunately, there is a way to lay this out plainly. We can look at change of annual gross income per return. Now, to make this following map, I had to weight the calculations by state.

Change in Aggregate Gross Income per Return

What this map shows is the net change in terms of dollars per return for each state. A state that is blue through purple has probably attracted people with higher incomes (overall) than those it lost. A state that is orange through red has ended up (overall) getting people with lower incomes than moved out. This map should have some surprises for you unless you are amazing at doing math in your head buy guessing at numbers that correspond to colors. California is a big loser when it comes to aggregate income and people, but when we look at the change in income per return, it's not such a heavy loss. New York and Illinois are a big losers, either way. Alaska really leaps out. They had a slight loss of income, an overall gain of people, and a very visible drop in income per return. I'll leave it to the Alaska legislature to work out the implications of that one.

When we look at winners, there are more surprises. First, the Rocky Mountain states really stand out as a group. States that had net gains in population or income have huge net gains in income per household, and even states with net losses in population (Utah, but a mild gain in aggregate income) actually gain in terms of income per return. On the other hand, Texas, which is so often touted as some economic New Eden because of these income migration maps, only sustains moderate gain in terms of overall income per household. Florida, though, just blows the doors off in all measures, and South Carolina's weaker population gain is so heavily offset by its gain in income per return that it visibly alters the aggregate income gain. But there is another interesting little detail from comparing these two maps, one that might not leap out at you. What is it? Forgive me, but I'm going make that wait until the next post.

August 6, 2014

Tax Migration?

A few maps have made the rounds lately. Most of them come from this page, that page, or another page. All of them show something that gets a little feather ruffling but not a great deal of attention outside a very small circle of people. I refer to "income migration". Basically, it's a hobby for some people to point out how horribly some states are losing "wealth" and how wonderfully other states are gaining "wealth". To a point, they have a point. But I think they may overstate their point and potentially be looking in the wrong direction.

Fortunately, all the maps I pointed out (and many others) come from Tax Foundation, and whatever agenda may drive them, they are pretty good about citing their sources. In this case, their source was the IRS's "SOI Tax Stats--Migration Data". With that in hand, I decided to look at what they said. I took the data for 2011 (the most recent year currently available). A little computer wizardry, and I had my own little map, to wit:

Net Change in Aggregate Gross Income due to Migration, 2011

(Hold mouse over state to pop up a window with state total in millions of dollars.)

States colored blue through purple have improving levels of "net gain" in taxed income due to migration into a state from other states and foreign countries. Orange through red is worsening levels of "net loss" in taxed income due to migration out of a state to other states and foreign countries. It pretty much looks like most of the maps that are circulated, especially by people and groups with an axe to grind. It's easy to look at this map and say "California, New York, and Illinois are doomed." But if they're that severely doomed, why haven't they already completely collapsed? If all the money is running away from them, why isn't it all Mad Max there? Things might be unpleasant, but they're not Somalia levels of collapse.

This is where it becomes important to consider how large the states in question are, not in territory, but in people. Why in people? Let's compare the first map to a different map, side-by-side (with popup information if you hold your mouse over a state):

Aggregate Gross Income (AGI)	Returns Filed

The map on the right is the net gain or loss in returns filed in 2011--essentially the migration of households into or out of a state. More detail on this in the next post.