March 3, 2017

When Presentation Fights Against Information

I recently came across yet another choropleth (that's "color coded" for those of you who don't speak geek) map, this time from the Leage of Conservation Voters (LCV). It was part of an evaluation of the 2016 Congressional voting records on environmental issues. It presented both the House and the Senate. Since I'm talking about it on my grouchy blog, you already have guessed I have an argument with it. I'm not going to argue about whether or not their basic premise is true. A given congresscritter did vote for or against a given bill, and we can all look that up for ourselves. My quibble is their presentation of the data. It falls short in two ways. First, they made some choices that were a bit misleading. Second, their presentation falls short in accessibility and ease of comprehension. The on the left, below takes their data for the House of Representative in approximately the same colors that the League used. (If you mouseover the map, you'll get more information by state.)

House Votes by State, Colored by Fifths House Votes by State, Colored by Percent

The first map is an approximation of the LCV's own map. The LCV chose to arbitrarily divide the state average scores into even fifths (aka "quintiles"). This was not necessary. The second map shows what happens when, instead of arbitrarily cutting the data into fifths, you instead represent the actual percentage scores on the same color scale. If you notice that the two maps do resemble each other, you are not hallucinating. The point of my comparison is not that a direct representation of scale would give a drastically different result, the point is that arbitrary categorization oversimplified the results and eliminated useful nuances.

Compare the two maps to each other. In the first, everything is rigidly defined and perfectly clear-cut. The "green" states are all nicely and strictly "green" (in their particular shades), while the "red" states are nicely and strictly "red", with no variation in their "redness". Finally, the "orange" states are all staunchly in the middle. The reality, if you look at the second map, is that none of this is true. Let's look at Washington, Maine, and Illinois. According to the first map, Washington and Maine have no meaningful difference from Illinois, and Illinois and Wisconsin are the same as each other. Why? Because the entirely arbitrary cutoff used by LCV put them into different categories. In the second map, Illinois, Washington, and Maine actually are all closer to each other than Illinois is to Wisconsin or than Washington and Maine are to California. That's how the actual percentage ratings come out. The first map misrepresents these differences and similarities.

Let's look at Wisconsin vs. Michigan. In the first map, Wisconsin is a middling "orange" state while Michigan is in a "red" category. In the second map, they are very difficult to tell apart. I will not go so far as to claim that the intent of LCV is to mislead people, but the unfortunate outcome of their poor data presentation choices is misleading. The worst part is that it is immediately obvious that there was no need at all to use arbitrary categories. The data they wish to present is easily comprehended on the second map. The simplification of fifths combined being unnecessary with being misleading. Let's put it another way, is 19% (Louisiana) closer to 24% (Indiana) or to 0% (South Dakota)? The LCV method portrays 19% as not being meaningfully different from 0% but entirely a different category than 24%. Does that even make sense?

In addition to the unnecessary and misleading use of categories, the LCV made two other fundamental and very common errors in data presentation. First, their choice of colors is almost tailor-made to mean that people with the most common forms of color-blindness will have difficulty interpreting their graphic. For people with full color vision, red and green have excellent mutual discernability. Unfortunately, red-green colorblindness is the most common form of colorblindness. Fortunately, there has been a good deal of research on color schemes that can be interpreted by people with multiple types of colorblindness. Much of that research in useful palettes is even available online.

The other problem is that LCV chose the wrong kind of color scheme, altogether. The data in their scorecard is "sequential". Sequential data has "low" and "high" values with no "natural" center. They chose a color scheme with two extremes (green vs. red) and imposed an arbitrary center (orange). Such a color scheme is appropriate for "divergent" data, when you actually want to emphasize how the data deviates from a specific central value. The LCV certainly does not believe that "roughly 50%" environmental friendliness is a natural center or ideal from which everything deviates. But their choice of color gives that impression. The middle of a data set is not always a "natural" center. If what you want to emphasize is that there are "bad" situations and "better" situations, and the "ideal" is the "best", then you have sequential data. For sequential data, you want to use a sequential color scheme, where greater intensity means "more". What would the LCV map look like with a proper sequential color scheme?

House Votes by State, Colored by Fifths with a Sequential Color Scheme House Votes by State, Colored by Percent with a Sequential Color Scheme

The map on the left is the same set of "fifths" that the LCV used, but colored according to its appropriate data type. You can immediately see at a glance which states had higher environmental voting scores for their House of Representatives members. You do not need to look at a "key" or muddle through attempting to interpret color choices with no inherent link to the rankings. This is immediately more informative than the original LCV map. However, it is still as misleading as the original LCV map.

This is corrected by the map on the right, which presents actual percentage scores by color intensity. Again, at a glance, you can see not only which states are stronger in environmental voting record but even just how much they are stronger or weaker than other states. Again, the misleading impressions given for states like Wisconsin vs. Michigan or Washington vs. Illinois are no longer misleading. They relationships are accurately portrayed.

I don't know why LCV presented their data as they did. I don't think they intended to mislead, but I can speculate a bit. First, very few people are given adequate training in proper data presentation. Use of proper color schemes for different data types is rarely taught in formal coursework. Sensitively to colorblindness is also rarely covered in the incidental information given on data presentation. Second, arbitrary and rigid categories are comforting and comfortable. We like to think that the world is nicely split up into "our side" and "the enemy". Admitting that situations can be more gradual, and that there might not be simplistic steps between "saved" and "damned" makes us uncomfortable. The politically dedicated are very often most prone to such thinking.

However, in the end, I can't claim to actually know motivations for bad data presentation choices, but I can offer suggestions on how to more accurately and more effectively present the data.

February 20, 2017

How do the Marion County neighborhoods compare to each other?

(There are several maps and a couple of charts that may take a little time to load. If you see blank spots, don't panic, just wait. If they don't fill in, try reloading the page.)

Data can be a very good thing. Indeed, without information, we're flailing around blind. For example, we might get a feeling for how Indianapolis area neighborhoods are faring, but to get more than a feeling, we need hard data. It just so happens that the Polis Center at IUPUI has a project called SAVI. SAVI has helped a group called Indy 2020 to set up a web site called Indy Vitals. You can get a lot of information from Indy Vitals. You can get so much information that it's probably difficult to actually conclude anything from that information. Raw data can be useful, but when enough is piled up, it ends up providing confusion rather than solutions. This doesn't mean that data is bad or that Indy Vitals is bad. Neither is true. Indeed, if all you wanted to do was compare two Marion County neighborhoods, bit by bit, over multiple data points, Indy Vitals would serve you well. But what if you were interested in a bigger picture? What if you wondered how the neighborhoods of Marion County compared as groups? What are the larger similarities and differences? I'm hoping this post will help people see how the data available at Indy Vitals can be used to look at such questions.

What I Mean

I was wondering how, or even if, the neighborhoods in Indy Vitals “clustered” in any way. That is, were there actual meaningful groups of similarity vs. differences. To explore this, I first looked at a single neighborhood, let's say “Eagledale”. I discovered that, if you click a plus sign by the list of data for Eagledale, you can get more information. Part of that information is a "Data Table" tab. If you click that tab, you end up with a table of--you guessed it--data, but not just for Eagledale. You get all the neighborhoods' population, or unemployment rate, or walk score, or whichever specific data you originally clicked. It was a bit of work, but I used this to get the data for 26 different traits for 99 different neighborhoods in Marion County. I looked at it in one table, and it meant nothing at all to me. After all, it's potentially 2,574 data points.

But that's not a problem. There are a lot of things you can do to make sense out of overwhelming amounts of data. If you're a data professional, a mere 2,574 points isn't much, but the human brain isn't designed to look at that many distinct numbers and immediately see any patterns. Fortunately, there are ways to find hidden patterns. One of these ways is called “exploratory factor analysis” (EFA). What EFA does is look for how parts of a data set may be related to each other and groups those parts together. This can be important because as a set of data gets larger, it is more likely that more and more categories will correlate. That is, as a value in one category goes up or down, values in other categories tend to go up or down alongside (or in opposition). This could be because there is some hidden “factor” that these data points describe. EFA allows for these factors to be guessed at in a reasonable fashion.

I will get into the nuts and bolts of the EFA after presenting results. That way, if you only want the results, you won't have to wade through a bunch of statistical chatter. The EFA identified three factors, which I titled “Discomfort”, “Density”, and “People of Color”. The first two factors are fairly well defined, but the third one included other elements as well. Again, the details come after the data. Note that I left the Airport and Park 100 regions out of the analysis. How do the neighborhoods stack up regarding these factors?

Marion County Neighborhood Factors
(More blue = lower than county average.)
(More red = higher than county average)
(Ivory = near county average)
DiscomfortDensity
People of Color“Stacked” Factors

Mapping the factors does show us that there could be patterns and clusters, but how do we find them? How do we find patterns for even one factor? It is popular among pollsters to use ranks or arbitrary groups, such as “quintiles”. A quintile is easy to create. It's just one fifth of a data set. You have the lowest, the next, etc. up to the highest. Quintiles can be useful, but much of the time they are misleading. If your factor scores are very gradual, then the boundary between one quintile and the next can be much smaller than the width of one quintile. That is, the highest score in the bottom quintile can be closer to the next quintile than it is to the lowest score in the same quintile! So, presenting information, especially social information, in quintiles can be very misleading.

And how do we handle combining factors? One simple way might be to just “stack” the factor scores. A lot of pollsters (like Gallup) do this, by adding. It is, after all, very simple, but it has problems. The fourth map is just such a mathematical "stacking" of the first three, where I averaged the color values. But does it mean anything? How do we evaluate a single scale that combines “discomfort”, “density”, and “people of color”? I'll make it easy: We can't. It makes no sense, at all, to just “add” those three factors like that. Yes, you can get a single number, but it means nothing. This is because each factor in an EFA is supposed to be measuring a different thing. They won't add and they can't add in any meaningful fashion.

This doesn't mean we can't combine factors in a meaningful fashion, it just means we have to use a different way than simple addition/averaging. This is where “clustering” comes in. There are a lot of ways to cluster data. For all kinds of technical reasons, I chose “k-medoids clustering”. I first tested it on each factor. What did I find? No clusters! That's right, each individual factor was so gradually spread out from lowest to highest scores that there were no meaningful clumpings or places to break it up! So, while the lowest and highest were certainly different from each other, there was no way to draw a line and say that one group of neighborhoods was different from another group (on a single factor) without being entirely arbitrary. I could just clump the neighborhoods into five roughly-equally sized groups and call it a day, but there is no rational basis for doing that. Re-read that. Remember that the next time you see some other web page or article throwing "quintiles" or other arbitrary groups at you.

The Clustering

My clustering analysis of the EFA produced a “best” solution at six clusters. I named the clusters for the cluster “medoids”. A medoid is whichever real data point that comes closes to the “center” of a cluster. The medoids were North Perry, Forest Manor, Five Points, Eagle Creek, Canterbury-Chatard, and Delaware Trails.

Clustering Results of Indy Vitals Data
North Perry+++Forest Manor+++Five Points+++Eagle Creek+++Canterbury-Chatard+++Delaware Trails+++

On the left, above, is a map of the clustered neighborhoods, color coded to cluster names. On the right is a summary of how the clusters relate to the factors. As you can see, each factor played a unique role in the clustering. Two clusters had particularly high Discomfort ranges: North Perry and Forest Manor. They appear to primarily differ from each other by People of Color. Three clusters appear to have fairly low average Discomfort: Eagle Creek, Canterbury-Chatard, and Delaware Trails. They appear to differ among each other mostly by density. The last cluster, Five Points, has an average Discomfort close to the county average, but with lower Density and People of Color.

Is there any meaning?

So, I created clusters and can describe them in terms of factors, but what does any of that mean? We have to open the lid on my analysis. This is where things start to get technical. The first step of exploratory factor analysis (EFA) is to find how the data elements relate to each other. This is usually measured by some sort of correlation coefficient. The thing about correlation is that ordinary correlation assumes a lot about the data. First, it assumes there are no missing values, every possible point needs a value. If any is missing you have to leave that data out or “impute” (guess) a value for that point. The data I downloaded had several missing values. In many cases, these could not be imputed with any confidence, given how the data category was defined. For example, “High School Graduation Rate” only was calculated for neighborhoods that had high schools within their borders, even though quite a few kids in Marion County attend a high school not within their neighborhood borders. I deleted most of these categories.

Why most and not all? A few categories had missing values that could be reasonably imputed or otherwise accounted for. By “otherwise accounted for”, I mean deleting the entries for Airport and Park 100. I consider this acceptable for my purpose because those “neighborhoods” are far more industrial districts than neighborhoods. After I did this, only one category had missing values: “Violent Crime per 1000”, which were missing for Speedway, Lawrence, and Lawrence-Fort Ben-Oaklandon. I took values from the Neighborhood Scout web site. Probably not as reliable as those from IMPD for the rest of the neighborhoods but probably not too far off the mark. That source only had one number for both of the Lawrence-based neighborhoods, so I repeated it for them. This left me with a working data set of 97 neighborhoods and 22 variables (2,134 data points).

InputDiscomfortDensityFactor 3
Associates or Higher Degree-1.01
Median Assessed Value-0.97
Disability0.70
Poverty Rate0.69
Tree Cover-0.66
Median Household Income-0.62
Unemployment Rate0.60
Tax Delinquent Properties0.60
Violent Crime per 10000.59
Vacancy Rate0.47
Median Age-0.44
Walk Score0.97
Permeable Surface-0.85
Food Access0.85
Housing Density0.68
Non Car Commuters0.44
People of Color0.95
Access to Quality Pre K0.58
Housing Cost Burdened0.52
Births with Low weight0.43

The other assumption that correlation makes is that the data is “normally distributed”. I checked my data with a utility to test this. The data failed. It was not normally distributed. Ordinary correlation would not give a realistic estimate of how the data categories were interrelated.

Fortunately, there are several methods of “robust” correlation. I ended up using a method called “Orthogonalized Gnanadesikan-Kettenring”. For most people, that will mean nothing, of course, but anyone who wants to check my math will need to know that. I provide the correlation matrix if you want to look it over.

One of the quirks about EFA and related methods is that they don't automatically tell you how many factors best describe your data. You have to tell the method how many to use. There has been a lot of discussion over the decades about how to figure out how many factors to use in EFA. I chose what is called parallel analysis. (Warning the link goes to a very technical description.) Roughly put, parallel analysis compares principal components or factor analysis outcomes for multiple factors against randomly-generated data sets of the same size. The result that produces the most factors that is still better than the all-random comparison is considered the best choice. My initial parallel analysis suggested 4 factors. However, the outcome of EFA produced one strange factor that consisted only of the Population and Jobs variables. I dropped these two variables and repeated the parallel analysis and EFA. The same remaining factors appeared. I decided that raw population and counts of jobs (not employment, just number of jobs) would not add much meaning to the analysis and went on without the two variables.

I then did my final EFA, producing three factors. The table above and to the right describes how strongly the factors relate to the variables. The numbers are called “loadings”. To render presentation more clear, I don't report loadings that are less than 0.40, which is a commonly used cut-off. I used these loadings to guide how I named the factors. The first factor was a combination of lacking higher adult education, low median residence assessed value, higher disability rates, higher poverty rate, less tree cover, lower household income, greater unemployment, greater violent crime, etc. In short, it made sense to call this factor Discomfort, since places with such features are probably less comfortable places to live. The second factor combined a good walk score with lots of pavement, close grocery stores, dense housing, and higher rates of non-car commuters. It made sense to call this Density. The third factor was difficult to describe, since it had several different types of variables in it. I finally chose People of Color because of that variable had much stronger influence than did the others in the factor. Once I had the EFA result, I was able to use it along with my trimmed data set to produce factor scores.

Once I had the factor scores, I used k-medoids clustering to create clusters. But, first, back up. It is common for factors in EFA to be correlated to each other. This is not a flaw in the EFA result, because in the real world, such correlations are common. However, to get the clustering, I had to still combine the factors. For this, I calculated pairwise “Mahalanobis distances”. While somewhat tricky to calculate, Mahalanobis distances take these correlations into account to produce a more realistic description of the data. Then I did the cluster analysis on these distances. I used a utility called pamk to discover the optimum number of k-medoids clusters for the data. This came to, as I already stated, six clusters. The chart to the right illustrates the factor scores vs. specific clusters. X = Discomfort, Y = Density, Z = People of Color. The chart can by rotated by clicking and dragging on it.

So, now what?

I'm not a social scientist. I am just a person with curiosity. I wanted to see what, if any, associations could be made with the data in the IndyVitals web site. Where you go with those associations probably depends on your outlook and ideas about the city. If anything, I hope that someone can find something more concretely useful to do with this work.

February 3, 2017

How to mislead with maps. The Gallup State Well-Being Rankings for 2016

Gallup has recently released another population survey, this time it is the 2016 State Well-Being Rankings. Gallup's accompanying map (last page of the rankings) is, as you can see, split into quintiles. If you want, you can go over there and look at their map or look at the first map below. It represents the cut-offs in approximately same colors. (If you mouseover the map, you'll get more information by state.)

State Health Ratings, Colored by Quintile
State Health Ratings, "Squashed" Range
State Health Ratings, Full Range

This map is an excellent example of how data presentation choices mislead. People are supposed to use quintiles, quartiles, percentiles, and other such non-parametric numbers to represent either data that has a long, uneven, and strung-out range (like achievement test scores), or to group a different set of data to show how it is distributed (like wealth per quintile). It just so happens that you can look at the well-being scores for yourself in the linked report. Notice that the data is not strung-out and scattered. In fact, it is very densely-packed. It also is not explicitly linked to some other unevenly-distributed data.

The actual range goes from 58.9 to 65.2. Is a difference of about 6.3 score points worth that much a visual difference?

How else could we represent the difference so people can get an idea of reality instead of a visual trick?? The second, or "squashed scale" map does that. The "worst color" (light gray-green) is matched to score of 58.9. The "best color" is matched to 65.2. The range between is then evenly filled in among the five color points. Look different? It does. There is some rough correspondence between the misleading map that comes from Gallup and the (somewhat) more truthful map I created, but you can now immediately see that the country is not divided into stark and extreme categories. You can also immediately see that the distances between categories are not sharply defined.

But I'm not finished. You see a third map. This is a map where the "best color" corresponds to a score of 100 (maximum theoretical possible score) and "worst color" corresponds to 0 (minimum possible theoretical score) Changes in color now correspond to linear differences along the full possible range. Having a hard time telling the states apart? That is because the differences among them in this index really are tiny. This map shows you what that actual difference looks like in context of the full scale.

So, why does Gallup do this? Why do people eagerly swallow such representation of data? First, explaining Gallup. I don't work there, so this is speculation, but Gallup makes its money off controversy. Anything they publish that will stir the pot will inspire more surveys that they can sell. Likewise, presenting things in extreme ways ensures that there will be more arguments, leading to more survey commissions, leading to similar data presentation, leading to more arguments. It's a lucrative circle for Gallup.

But why do people so eagerly devour this quasi-information? First, it's simple. People like very stark, very simple things to natter on about with each other. People do not like complex and shaded descriptions. They want things to be very neatly pigeonholed, and this comforts them. In addition, people with agendas want things presented as rigidly and extremely as possible to the public, all the better to sound the panic alarm. Finally, we are often taught by society that only rigid and extreme answers can be "true". We are indoctrinated to see the world as "good" and "evil" with nothing in between. We are taught that someone who is able to see gradual differences is a "fence-sitter" or "spineless". We are told that only extremism is good--although it's only actually extremism when it's someone you don't like doing it.

I don't know if this changed the way you see the world, but I hope it helped you understand and be more critical of "studies", "surveys" and "polls" that we are not flooded with.

May 20, 2016

Right to Work, more recent data

So, what's the argument over?

Do Right-to-Work laws fulfill their claimed benefits to workers? (Executive summary: They don't). Arguments in favor of right-to-work (RTW) boil down to claiming a better overall life for ordinary workers in a state. I'm going to explain my personal bias: No law should ever be made without compelling need. Thus, the entire burden of proof is not that a law will not make things any worse, it is that a law must make things better. This is why I'm not explicitly testing anti-right-to-work claims. Anti-any-law automatically is favored as the "null hypothesis". Of course, some laws are trivial to justify. The damage done to people and society by practices such as child prostitution are so enormous, and the moral issue so clear-cut, that it is trivial to show an overriding social need for a law against such practices. When it comes to labor law, things can start to become less immediately clear-cut.

Let's Talk Money

All State Differences
Median Income
Orange: Right-to-work state does better.
Blue: Non-RTW state does better.

One conventional measurement that dominates the argument about right-to-work is income. If you know me, though, you'll already know that I will not look at them in conventional ways. Attempts to attack or defend RTW based on "average" incomes are just plain silly. It's easy to see how. The "average" is only a good representation if income is evenly distributed, which it isn't. A very small portion of Americans in any state have much higher incomes than most of that state does. Instead of average, I will use median. The median is literally the number in the middle. Half of the households in a state makes no more than the median, half make no less than the median. Thus, it represents a truer estimate when the data is highly skewed.

Okay, so now that I've chosen median income (for 2014, since all data for 2015 isn't finalized, yet at the US Census) as the basis of comparison (conveniently available from the Burea of Labor Statistics, how to compare? One way is to aggregate the two groups of states (RTW vs. non-RTW), subtract one aggregation from the other, et voila! But "simple" isn't always so simple. If there is a difference between the two, is that difference meaningful? The data covers all the possible comparisons (all 50 states for that year). What aggregation should be used?

There are only 50 states. Of these 24 were RTW in 2014, 26 were not. That's not much. That's only 624 pairwise comparisons of states. We have spreadsheets in the modern day, so 624 subtractions are nothing. Okay, so I can do 624 subtractions of one state's median wage from another's, then what? Aggregate the subtractions and present column charts with error bars and all kinds of statistical gobbie-goo?

I could, but it would only hide more than reveal. After all, when I've got that few points of data, why not just present them all and let the reader see directly? That's what I did. The first figure displays every single comparison, grouped in "income difference" brackets. Orange columns are where an RTW state had a higher median income than a non-RTW state. Blue columns are the other way around. Orange = RTW better. Blue = RTW worse. If you mouse over, you'll see the limits of each bracket and the actual number of comparisons that fell into that bracket. Overall, an RTW state was better in 145 comparisons. A non-RTW state was better in 479 comparisons. So, that's a 73% disadvantage to RTW. Does that mean anything?

Let's look at it another way, what is the possibility that this difference could occur by random chance? If it is likely to have just been random chance, then we shouldn't let the difference lead us to any conclusions. I used a method called "bootstrapping" to estimate the probability that this outcome was by random chance. You can look up bootstrapping in any statistics textbook if you are really into the nuts and bolts. To make a long story shorter, it turns out that the probability of this outcome just being random chance is roughly 0.0004. Statistical "significance" begins when probability is equal to or less than 0.05. We can safely exclude random chance from explaining this outcome

All State Differences
Cost-of-Living Adjusted Median Income
Orange: Right-to-work state does better.
Blue: Non-RTW state does better.

But money goes out, too.

However, as Mark Twain long ago tried to point out in A Connecticut Yankee, income is only half the story of individual prosperity. If you make twice as much money as the cobbler in the next village but have to pay twice as much for everything, you're no better off than the cobbler in the next village, no matter how big your income might look before you pay your bills. A more accurate idea of the effect of RTW is to factor in cost of living as a relative state ratio. Thus, a more expensive state (New York) would have a 2014 cost of living at 1.316, while a cheaper state (Oklahoma) would have one of 0.921. What this means is that an income of $50,000 in New York would be roughly equivalent to an income of $35,000 in Oklahoma. The New Yorker might make more money on paper, but he'd be no better off than a Sooner making $15,000 less a year! That's a pretty important factor. I do have to caution that statewide costs of living are very approximate, and it is easy to find exceptions. Manhattan would be even worse, for example.

When we factor in cost of living to median income, the picture changes. RTW states now have a slight advantage. Cost-of-living adjusted median wages are better in 55% of the comparisons (554 out of 612). However, bootstrapping showed a 45% probability that this was just due to random chance. In other words, the effect of RTW on cost-of-living adjusted wages is a toss-up, almost 50-50. So, it seems that the RTW advocates may be right on one thing: States without RTW will have a higher cost of living. The opponents of RTW are also correct: RTW goes hand-in-hand with lower wages. In short, it's a wash. In terms of income including expenses, RTW has no net effect.

All State Differences
Median Wages Adjusted by
Employment, Population, and Cost of Living
Orange: Right-to-work state does better.
Blue: Non-RTW state does better.

Let's Talk Money and Jobs

The tale is not told, though. After all, what if RTW states (like Texas) happen to be very populous states and non-RTW states (like Alaska) are sparsely populated? Then, even though on a pure state-by-state basis, RTW might not do well, in terms of overall prosperity of human beings, it might shine! But how to measure that? If we are thinking primarily of ordinary people—and we should, since the arguments about RTW always get down to whether it helps ordinary people, we can start with median income, again. If everyone were making the median income, the median income would not change. Then multiply the median income by the number of employed people in a state to get an aggregate income estimate. We also have to take into account that a state may have a lot more people to support on top of those who are working. Divide the aggregate income by the state's total population. This "population-adjusted income" can give us an idea of how well each state does vs. another in terms of the comfort of its mass of people.

Let us not forget cost of living, since higher wages are passed on to consumers by the businesses paying them. What does that give us? You've probably already been looking over the last graph. As you can see, RTW does slightly worse than non-RTW in terms of income, adjusted by employment, population, and cost of living. In 292 comparisons, an RTW state did better than a non-RTW state, but in 332 comparisons, a non-RTW state did better than an RTW state. Bootstrapping tells us, though, that this is probably (70%) just random chance. That is, when you get down to the wire, RTW makes no overall difference.

What does this ultimately mean? The claim I tested is "Right to work improves the lot of the worker". In the end, taking into account cost of living, or combined cost of living, employment and population, it's a toss-up. While Right to work might not guarantee misery, it also does nothing to improve the overall condition of the vast majority of Americans. What it tells me, personally, is that Right to Work is a failure. It does not benefit ordinary workers in any way that cannot be just as easily explained by random chance.

What is the take-home? Given the data at hand, a compelling worker-benefit based argument in favor of enacting or maintaining RTW cannot be made. By and large, RTW is not a policy that produces enough benefit to an ordinary worker to be worthy of being kept as law, not even when macro-economic factors such as overall employment are taken into account.

January 12, 2016

Presidential Candidates, Politifact, and Who is Close to Whom

Tree of Candidates, 12 January
image/svg+xml Pelosi O’Malley Johnson Carson Cruz Fiorina Huckabee Clinton Obama Sanders Bush Christie Kasich Paul Rubio Santorum Trump Cluster1 Cluster2 Cluster3 Cluster4

A few months ago, I plotted out the Presidential candidates from the two major parties in terms of their truthfulness. I did this with a "tree" (more of a "bush") diagram based on the Politifact Truth-o-Meter. As I mentioned before, Politifact does provide individual summary charts for each person and a description of the various statements used to create the charts. Unfortunately, comparing profiles isn't straightforward, especially if you want to look at several of them at once. That's where nerdistry comes in.

In addition, a new candidate has formally entered the race since my last attempt. Thus, I again went to the data on each politician's page, ran it through some nerd magic, and came up with a new tree. I restricted myself to formally filed candidates who have more than 4 rulings on Politifact. I also have Barack Obama and Nancy Pelosi, for reference. I color-coded the names by party. You can click on any name to lead you to that person's Politifact page. Four "meaningful" clusters (see below for what "meaningful" means) appeared in the data and have curves drawn around them. The differences among politicians inside the same "meaningful" cluster are not worth noting. Yes, this means that, when it comes to truthfulness, as measured by Politifact, Santorum, Fiorina, Cruz, and Huckabee lump in with Pelosi. Clinton and Sanders (and Obama) are pretty much the same as Bush, Christie, Kasich, Paul, and Rubio.

As last time, the clusters roughly summarize what end of the "True" vs. "Pants-on-Fire" profile a candidate sits on. The top left cluster (Let's call it Clinton-Bush) leans more to "True" and "Mostly True". The cluster on the bottom right (we can call it the Pelosi Cartel, just for giggles) tends to prefer "half true" and "mostly false". The bottom left cluster is heavily dominated by "False", with a dash of "Pants on Fire". O'Malley and Johnson (a Libertarian candidate) are in their own outlying cluster that is more "middling" between true/false. However, both these candidates have relatively few statements in their files.

And the take-home message? Two messages: First, if you agree with Politifact, it's a rough indication of who is more trustworthy. If you reject Politifact's conclusions, just invert the true/false interpretations. Second, you can see who resembles each other in terms of trustworthiness and that this hasn't changed much since August. Agree with or reject Politifact, this part is consistent. Politicians in the same cluster seem to have the same basic character as each other when it comes to honesty or its lack. Like I said, if you dislike Politifact, just flip the interpretation of truth.

Nerd Section

This is a repeat of Agust's methods. I used a copy of "R" statistical language and the "cluster", "gclus", "ape", "clue", "protoclust", "multinomialCI" and "GMD" packages. Then I gathered up the names of declared candidates for US President. I did not intend to limit this to only Republicans or Democrats. Unfortunately, when I looked people up on Politifact, it was only Republicans or Democrats who had more than 4 rulings. Why more than 4? A rough estimate of the "standard error" of count data is the square root of the total. The square root of 4 is 2, which means that if a candidate had 4 rulings, the accuracy was plus or minus 2. Such a large wobble was too much for my taste. This time, I ended up with 17 candidates.

Comparing them required a distance metric. I could have assigned scores to each ruling level and then calculated an average total per ruling. While this might be tempting, it is also wrong. Why is it wrong? Because that method would make a loose cannon the same as a muddled fence-sitter. Imagine a candidate who only tells the complete truth or complete whoppers. If you assign scores and average, this will come out being the same as a candidate who never commits but only makes halfway statements. Such people should show up as distinct in any valid comparison.

Fortunately, there are other ways to handle this question. I decided to use a metric based on the chi distance. Chi distance is based on the square of the difference between two counts divided by the expected value. It's used for comparing pictures, among other uses. However, a raw chi distance depends very much upon the total, and the totals were very different among candidates. The solution to this was easy, of course. I just took the relative counts (count divided by total) for each candidate.

I needed one more element for my metric. Politifact does not rate every single statement someone makes. They pick and choose. Eventually, if they get enough statements, their profiles probably present an accurate picture, but until they get a very large number of statements, there is always some uncertainty. Fortunately, multinomialCI estimates that uncertainty. I ran the counts through multinomialCI and got a set of "errors" for each candidate. I combined these with the chi distances to obtain "uncertainty-corrected distance" between each candidate. Long story short, this was done by dividing the chi distance by the square root of the sums of the squares of the errors. What that meant is that a candidate with a large error (few rulings) was automatically "closer" to every other candidate due to the uncertainty of that candidate's actual position.

I then created a series of hierarchical clustering trees from this set of distances. There is a good deal of argument over which tree creation method is best. I decided to combine multiple methods. I created trees using "nearest neighbor", "complete linkage", "UPGMA", "WPGMA", "Ward's", "Protoclust", and "Flexible Beta" methods. The "clue" package was designed to combine such trees in a rational fashion. Feel free to look it up if you want to follow all the math. I used clue to create the "consensus tree", which is the structure I posted on my blog. But clue doesn't tell you how to "cut" the clusters. For that, I turned to the "elbow method".

The elbow method is an old statistical rule of thumb. Basically, any set of "clustering" has multiple ways you can slice it to say "these things fall into those groups and smaller groups don't really matter". The "elbow method" compares the "variance" of each possible way of cutting the clusters and charts them on the basis of number of clusters vs. "variance explained" by that number of clusters. The math is not simple. What you do is then plot the "variance explained" vs. the number of clusters. What you look for is a "scree" or an "elbow". The line will always be descending. The idea is that you hope there is some point where there is a sharp bend in your line. At the point of that bend is the "elbow". More clusters won't add enough additional explanation to be worth the cut. In this case, my elbow was at four clusters.