Faster computers. Bigger and better databases. Quicker and more accurate data collection. Robust modeling software. Easy presentation graphics. These are all good things, right?
Mark Twain is believed to be the source of the quote “There are three types of lies: lies, damned lies, and statistics.” But in Twain’s day statistics were difficult to come by: all data collection, tabulation and calculation was done by hand, with the attendant errors. Now an intern with no mathematical or statistical training beyond high school can create all sorts of charts, ratios, and cross tabulations on freely available datasets and data models with which they may have no real familiarity.
The world is now awash with data: data science and data-driven analysis. By appealing to data, an argument seems to gain the stamp of approval of science. But in some cases where people avail themselves of data science, they are either ignorant of or deliberately set aside data science’s foundations in mathematics and statistics, and in particular the limits to which the underlying techniques can be applied. In short there is a disturbing trend lately for people to use data as a form of “language intimidation” (a phrase I borrow from Jon Shepherd at Camden Depot), instead of a tool to pursue truth. Let us call this phenomenon “data intimidation.”
Unfortunately the recipients of data intimidation seldom distinguish themselves, with the typical responses being one of:
a.) Anecdotal – “Your statistics don’t describe my experience”
b.) Equally disingenuous – “I read that data differently”
c.) Simply offensive – “Shut up, smarty-pants”
But I’m here to help. I humbly present to you some simple rules for recognizing and defending yourself from data intimidation.
- Learn the difference between descriptive statistics and predictive statistics. Descriptive statistics attempt to explain an existing dataset in a short form. For example, the statistic “In 2010, gun deaths were the third leading cause of death in US children ages 5-14” is descriptive, summarizing a large dataset in a single phrase. Predictive statistics attempt to predict future behavior based on an existing dataset. “Based on historical norms, we expect 13 hurricanes this season” is a predictive statistic. One often hears the term “model” in this context.
- Carefully examine the choice of descriptive statistics. When people engage in data intimidation, it is common to see statistics that are neither raw numbers nor percentages. Ordinal statistics are typical offenders, e.g., “second most”, “third lowest”. Try to imagine a more useless statistic than “Gary Johnson was the third leading votegetter in the 2012 US presidential election.” If the choice of statistic leaves you with more questions than answers, the provider may be hiding something. Call them on it, and ask for the raw numbers. Citing a source is not sufficient.
- Beware restrictions of data to subpopulations. It is not always disingenuous: for example, our “Gun deaths were the third leading cause of death in US children age 5-14” example is restricted to the subpopulation of children and not the US population at large. If the conversation is about gun deaths in children, this is a natural, logical restriction. But if the statistic were presented as “Gun deaths are the leading cause of death in children age 5-14 in the southern US”, you should wonder why the writer shifted to the “southern US” subpopulation. The answer is probably to get a more dramatic talking point. You should have no qualms asking the writer why he or she chose to restrict the population unnecessarily.
- Question binning of data. Let’s look at “gun deaths”; what does that mean? Does it mean gun violence (e.g., homicide)? Does it mean any death where a gun is involved, including accidents and suicides? Data intimidators can often create catch-all terms such as gun deaths, where they want the statistic to convey “gun violence”, but they get to include accidents and suicides in their counts. Lest you think I am paranoid or heartless in this example, this is exactly what the CDC does in calculating their “injury by firearms” statistic. Always ask for the exact definition of terms, and make sure that the writer’s language matches their statistics.
- Models cannot be proven incorrect. Don’t try. The law of large numbers says that outliers happen for any model. No amount of data can conclusively prove a predictive model is incorrect. Do not think that 2014’s polar vortices prove that global warming models are a crock; they don’t.
- Models cannot be proven correct. Don’t let anybody else try. Respectfully, the phrase “the science is settled” is a pantload. That phrase is a clear attempt to use data intimidation to squelch debate. To the users’ credit, often they are sick of engaging in ignorant debate, e.g. “Shut up, smarty-pants.” But the blanket statement “the science is settled” suggests a belief that a model is proven correct. That is impossible, and should be anathema to anyone who really believes in science.
- You are very unlikely to shake confidence in a model. Unless the data scientist who built a model is grossly incompetent or dishonest (e.g., Lysenko’s theory of genetics), the data scientist has most likely identified the most important factors in creating a prediction across the general population. Moreover, there is confirmation bias towards the model. The data scientist has a model and you don’t. There is a high likelihood that the data scientist is more familiar with the subject at large than you, and a very high likelihood that the data scientist is more familiar with the specific data in the model than you. Respect the work the data scientist has done, and know your stuff if you are going to engage.
- Defend the netural ground. It is simply impossible to tell whether any one event that deviates from a model prediction is a genuine outlier, i.e., random variation that the model allows, or is evidence against the model. But if enough deviations occur, you can perform your own analysis of those deviations. If you can spot some sort of correlation in those deviations, that may be evidence that there is some subpopulation where the model predicts poorly. It may also be a set of well-correlated outliers. Ultimately, the accuracy of a model is a belief. Statistics can compare two models and suggest one is better, but there is no magic bullet to say one model is the best possible model, bar none. If you keep an open mind and present your evidence in the right manner, you may change some minds.
- Always, always challenge the fallacy that any inconsistent result is necessarily an outlier. Never let anyone, even the super smart bubba who designed a model, say “this is an outlier”. This is a symptom of intellectual laziness on the part of the data intimidator, because it is a small step from “this is an outlier” to “all anomalies are outliers” to “the science is settled”. A honest data scientist may say “I believe this is an outlier”, or “I’m not sure if it’s an outlier, but it’s not that unusual and I have better things to do.” You should never let the bald condemnation of “Outlier!” pass unchallenged.
While the tone of this post is kinda negative, I’m not really all that pessimistic about data science. I generally live by not attributing to malice what can more readily be explained by laziness, ignorance or greed. But in case you come across “that guy,” I hope I’ve armed you with some intelligent ways to disagree in a civil manner.