“How are courts doing when it comes to interpreting the statistical data that goes into their decision-making?” That was a question posed by someone in the audience at a presentation I gave recently. I was discussing, among other things related to the perils of litigating statistical inferences, the recent paper “Robust Misinterpretation of Confidence Intervals.” It reports on the results of a study designed to determine how well researchers and students in a field that relies heavily on statistical inference actually understand their statistical tools. What it found was a widespread “gross misunderstanding” of those tools among both students and researchers. “[E]ven more surprisingly, researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever.” So, returning to the very good question, how are our courts doing?

To find out I ran the simple search “confidence interval” across Google Scholar’s Case Law database with the date range set to “Since 2014″. The query returned 56 hits. Below are eight representative quotes taken from those orders, reports and opinions. Can you tell which ones are correct and which constitute a “gross misunderstanding”?

  1. The school psychologist noted that there was a 95% confidence interval that plaintiff’s full scale IQ fell between 62 and 70 based on this testing.” And later: “A 90% confidence interval means that the investigator is 90% confident that the true estimate lies within the confidence interval”
  2. A 95% confidence interval means that there is a 95% chance that the “true” ratio value falls within the confidence interval range.”
  3. Once we know the SEM (standard error of measurement) for a particular test and a particular test-taker, adding one SEM to and subtracting one SEM from the obtained score establishes an interval of scores known as the 66% confidence interval. See AAMR 10th ed. 57. That interval represents the range of scores within which “we are [66%] sure” that the “true” IQ falls. See Oxford Handbook of Child Psychological Assessment 291 (D. Saklofske, C. Reynolds, & V. Schwean eds. 2013).”
  4. Dr. Baker applied his methodology to the available academic research and came up with a confidence interval based on that research. The fact that the confidence interval is high may be a reason for the jury to disagree with his approach, but it is not an indication that Dr. Baker did not apply his method reliably.”
  5. A 95 percent confidence interval indicates that there is a 95 percent certainty that the true population mean is within the interval.”
  6. Statisticians typically calculate margin of error using a 95 percent confidence interval, which is the interval of values above and below the estimate within which one can be 95 percent certain of capturing the “true” result.
  7. Two fundamental concepts used by epidemiologists and statisticians to maximize the likelihood that results are trustworthy are p-values, the mechanism for determining “statistical significance,” and confidence intervals; each of these mechanisms measures a different aspect of the trustworthiness of a statistical analysis. There is some controversy among epidemiologists and biostatisticians as to the relative usefulness of these two measures of trustworthiness, and disputes exist as to whether to trust p-values as much as one would value confidence interval calculations
  8. The significance of this data (referring to calculated confidence intervals) is that we can be confident, to a 95% degree of certainty, that the Latino candidate received at least three-quarters of the votes cast by Latino voters when the City Council seat was on the line in the general election.

Before I give you the answers (and thereafter some hopefully helpful insights into confidence intervals) I’ll give you the questionnaire given to the students and researchers in the study referenced above along with the answers. Thus armed you’ll be able to judge for yourself how our courts are doing.

Professor Bumbledorf conducts an experiment, analyzes the data, and reports:

The 95% confidence interval for the mean ranges from 0.1 to 0.4

Please mark each of the statements below as “true” or “false”. False means that the statement does not follow logically from Bumbledorf’s result.

(1) The probability that the true mean is greater than 0 is at least 95%.

Correct Answer: False

(2) The probability that the true mean equals 0 is smaller than 5%.

Correct Answer: False

(3) The “null hypothesis” that the true mean equals zero is likely to be incorrect.

Correct Answer: False

(4) There is a 95% probability that the true mean lies between 0.1 and 0.4.

Correct Answer: False

(5) We can be 95% confident that the true mean lies between 0.1 and 0.4.

Correct Answer: False

(6) If we were to repeat the experiment over and over, then 95% of the time the true mean would fall between 0.1 and 0.4.

Correct Answer: False

Knowing that these statements are all false it’s easy to see that the statements (A), (B), (C), (E), (F), and (H) found in the various orders, reports and opinions are equally false. I included (D) and (G) as examples typical of those courts that were sharp enough to be wary about saying too much about what confidence intervals might be but which fell into the same trap nonetheless. And that trap is believing that confidence intervals have anything to say about whether the parameter (again typically the mean/average – something like the average age of recently laid off employees) that has been estimated is true or even likely to be true. (G), by the way, manages to get things doubly wrong. Not only does it repeat the false claim that estimations falling within the confidence interval are “trustworthy” it also repeats the widely held but silly claim that confidence intervals are perhaps more reliable than p-values. Confidence intervals you see are made out ofp-values (see “Problems in Common Interpretations of Statistics in Scientific Articles, Expert Reports, and Testimony” by Greenland and Poole if you don’t believe me) so that the argument (albeit unintentionally) being made in (G) is that p-values are more reliable than p-values. Perhaps unsurprisingly, of the 56 hits I only found two instances of courts not getting confidence intervals wrong and in both cases they avoided any discussion of confidence intervals and instead merely referenced the section on the topic from the Reference Manual on Scientific Evidence, Third Edition.

Why do courts (and students and researchers) have such a hard time with confidence intervals? Here’s a guess. I suspect that most people have a pretty profound respect for science. As a result, when they first encounter “the null hypothesis racket” (please see Revenge of the Shoe Salesmen for details) they simply refuse to believe that a finding published in a major peer reviewed scientific journal could possibly be the result of an inferential method that would shock a Tarot Card reader. People are only just coming to realize what the editor of The Lancet wrote this Spring: scientific journals are awash in “statistical fairy tales”.

Now there’s nothing inherently suspect about confidence intervals – trouble only arises when they’re put to purposes for which they were never intended and thereafter “grossly misunderstood”. To understand what a confidence interval is is to know the two basic assumptions on which it rests. The first is that you know something about how the world works. So, for example, if you’re trying to estimate the true ratio of black marbles to white marbles in a railcar full of black and white marbles you know that everything in the railcar is a marble and each either black or white. Therefore whenever you take a sample of marbles you can be certain that what you’re looking at is reliably distinguishable and countable. The second is that you can sample your little corner of nature over and over again, forever, without altering it and without it changing on its own.

Without getting too deep in the weeds those two assumptions alone ought to be enough to make you skeptical of claims about largely unexplained, extraordinarily complex processes like cancer or IQ or market fluctuations estimated from a single sample; especially when reliance on that estimate is urged “because it fell within the 95% confidence interval”. And here’s the kicker. Even in the black and white world of hypothetical marbles the confidence interval says nothing about whether the sample of marbles you took is representative of, or even likely to be representative of, the true ratio of black to white marbles. All it says is that if your assumptions are correct, and given the sample size you selected, then over the course of a vast number of samples your process for capturing the true ratio (using very big nets two standard deviations on either side of each estimate) will catch it 95% (or 99% or whatever percent you’d like) of the time. There is no way to know (especially after only one sample) whether or not the sample you just took captured the true ratio – it either did or it didn’t. Thus the confidence interval says nothing about whether you caught what you were after but rather speaks to the size of the net you were using to try to catch it.

Thus my take on all the judicial confusion surrounding confidence intervals: they’re like judges at a fishing competition where contestants keep showing up with nets but no fish and demanding that their catches be weighed “scientifically” according to the unique characteristics of their nets and their personal beliefs about fish. Who wouldn’t be confused?