Friday, July 29, 2016

Stop saying confidence intervals are "better" than p values

One of the common tropes one hears from advocates of confidence intervals is that they are superior, or should be preferred, to p values. In our paper "The Fallacy of Placing Confidence in Confidence Intervals", we outlined a number of interpretation problems in confidence interval theory. We did this from a mostly Bayesian perspective, but in the second section was an example that showed why, from a frequentist perspective, confidence intervals can fail. However, many people missed this because they assumed that the paper was all Bayesian advocacy. The purpose of this blog post is to expand on the frequentist example that many people missed; one doesn't have to be a Bayesian to see that confidence intervals can be less interpretable than the p values they are supposed to replace. Andrew Gelman briefly made this point previously, but I want to expand on it so that people (hopefully) more clearly understand the point.

Understanding the argument I'm going to lay out here is critical to understanding both p values and confidence intervals. As we'll see, fallacies about one or the other are what lead advocates of confidence intervals to falsely believe that CIs are "better".

p values and "surprise"

First, we must define a p value properly and understand its role in frequentist inference. The p value is the probability of obtaining a result at least as extreme as the one we observed, under some assumption about the true distribution of the data. A low p value is taken as indicating that the result observed was very extreme under the assumptions, and hence calls the assumptions into doubt. One might say that a low p value is "surprising" under the assumptions. I will not question this mode of inference here.

It is critical to keep in mind that a low p value can call an assumption into doubt, but a high p value does not "confirm" anything. This is consistent with falsificationist logic. We often see p values used in the context of null hypothesis significance testing (NHST), where a single p value is computed that indicates how extreme the data under the assumption of a null hypothesis; however, we can compute p values for any hypothesis we like. As an example, suppose we are interested in whether reading comprehension scores are affected by caffeine. We apply three different doses to N=10 people in each group in a between-subjects design, and test their reading comprehension. For the sake of the example, we assume normality, homogeneity of variance, etc. We apply a one-way ANOVA to the reading comprehension scores and obtain an F statistic of F(2,27)=8.

If we were to assume that there was no relationship between the reading scores and caffeine dose, then the resulting p value for this F statistic is p=0.002. This indicates that we would only expect F statistics as extreme as this one .2% of the time, if there were no true relationship.

The curve shows the distribution of F(2,27) statistics when the null hypothesis is true. The area under the curve to the right of the observed F statistic is the p value.
This low p value would typically be regarded as strong evidence against the null hypothesis, because -- as the graph above shows -- an F statistic as extreme as the observed on would be quite rare, if indeed there were no relationship between reading scores and caffeine.

So far, this is all first-year statistics (though it is often misunderstood). Although we typically see p values computed for a single hypothesis, there is nothing stopping us from computing it for multiple hypotheses. Suppose we are interested in the true size of the effect between reading scores and caffeine dosage. One statistic that quantifies this relationship is ω2, the proportion of the total variance in the reading scores that is "accounted for" by caffeine (see Steiger, 2004 for details). We won't get into the details of how this is computed; we need only know that:

  • When ω2=0, there is no relationship between caffeine and reading scores. All variance is error; that is, knowing someone's reading score does not give any information about which dose group they were in.
  • When ω2=1, there is the strongest possible relationship between caffeine and readings scores. No variance is error; that is, by knowing someone's reading score one can know with certainty which does group they were in.
  • As ωgets larger, larger and larger F statistics are predicted. 
We have computed the p value under the assumption that ω2=0, but what about all other ωvalues? Try this shiny app to find the predicted distribution of F statistics, and hence p values, for other values of ω2. Try to find the value of ωthat would yield a p value of exactly 0.05; it should be about ω2=0.108. 

A Shiny app for finding p values in a one-way ANOVA with three groups.

All values of ωless than 0.108 yield p values of less than 0.05. If we designate p<0.05 as "surprising" p values, then F=8 would be surprising under the assumption of any value of ωbetween 0 and 0.108.

Using the Shiny app, we can see that a F=8 yields a right-tailed p value of about 0.05 when ω2 is approximately 0.108. 

Notice that the p values we've computed thus far are "right-tailed" p values; that is, "extreme" is defined as "too big". We can also ask about whether the F statistic we've found is extreme in the other direction: that is, is it "too small". A p value used to indicate whether the F value is too small is called a "left-tailed" p value. Using the Shiny app, one can work out the value of ω2 such that F=8 would be "surprisingly" small at the p=0.05 level; that value is ω2=0.523. Under any true value of ωgreater than 0.523, F=8 would be surprisingly small.

Using the Shiny app, we can see that a F=8 yields a left-tailed p value of about 0.05 when ω2 is approximately 0.523.

  • If 0 ≤ ω≤ 0.108, the observed F statistic would be surprisingly large (that is, the right-tailed p ≤ 0.05)
  • If 0.523 ≤ ω≤ 1, the observed F statistic would be surprisingly small (that is, the left-tailed p ≤ 0.05)
  • If 0.108 ≤ ω0.523, the observed F statistic would not be surprisingly large or small. 

Critically, we've used p values to make all of these statements. The p values tell us whether values would be "surprisingly extreme", under particular assumptions; p values allow us, under frequentist logic, to rule out true values of ω2, but not to rule them in.

p values and confidence intervals

Many people are aware of the relationship between p values and confidence intervals. A typical X% (two-tailed) confidence interval contains all parameter values such that neither one-sided p values are less than (1-X/100)/2. That sounds complicated, but it isn't; for a 90% confidence interval, we need just need all the values for which the observed data would not be "too surprising" (p<0.05, for one of the two-sided tests).

We've already computed the 90% confidence interval for ωin our example; for all values in [0.108, 0.523], the p value for both one sided tests is p>0.05. From each of two-sided tests we get an error rate of 0.05, and hence the confidence coefficient is 100 times 1 - (0.05 + 0.05) = 90%.

How can we interpret the confidence interval? Confidence interval advocates would have us believe that the interval [0.108, 0.523] gives "plausible" or "likely" values for the parameters, and that the width of this interval tells us the precision of our estimate. But remember how the CI was computed: using p values. We know that nonsignificant high p values do not rule in parameter values as plausible; rather, the values outside the interval have been ruled out, due to the fact that if those were the true values, the observed data would be surprising.

So rather than thinking of the CI as values that are "ruled in" as "plausible" or "likely" by the data, we should rather (from a frequentist perspective, at least) think of the confidence interval as values that have not yet been ruled out by a significance test.

Does this matter?

This distinction matters a great deal for understanding both p values and confidence intervals. In order to use p values in any way that approaches reasonability, we need to understand the "surprise" interpretation, and we need to realise that we can compute p values for many hypotheses, not just the null hypothesis. In order to interpret confidence intervals well, we need to understand the "fallacy of acceptance": Just because a value is in the CI, doesn't mean it is plausible; it only means that it has not yet been ruled out.

To see the real consequences of this fallacy, consider what we would infer if F(2,27)=0.001 (p=0.999). Any competent data analyst would notice that there is something wrong; the means are surprisingly similar. Under the null hypothesis, when all error is due to error within the groups, we expect the means to vary. This F statistic indicates that the means are so similar that even under the null hypothesis -- where the true means are exactly the same -- we would expect more similar observed means only one time in a thousand.

In fact, the F statistic is so small that under all values of ω, the left-tailed p value is at most 0.001. Why? Because ωcan't be any lower than 0, and this represents the null hypothesis. If we built a 90% confidence interval, it would be empty because there are no values of ωthat yield p>0.05. For all true values of ω, the observed data are "surprising". Now this presents no particular problem for an interpretation of p values that rests solely on their relationship with p values. But note that the very high p value tells us more than the confidence interval; the CI depends on the confidence, and is simply empty. The p value and the F statistic have the information we want; they tells us that the means are much more similar than we would typically expect under any hypothesis. A competent data analyst would, at this point, check the procedure or data for problems. The entire model is suspect.

But what does this mean for a confidence interval advocate who is invested in the (incorrect) interpretation of the CI in terms of "plausible values" or "precision"? Consider Steiger (2004), who suggests replacing a missing bound with "0" in the CI for ω2. This is an awful suggestion. In the example above with F=0.001, this would imply that the confidence interval includes a single value, 0. But the observed data F=0.001 would be very surprising if ω0. Under frequentist logic, the value -- and all other values -- should be ruled out. Moreover, a CI of (0) is infinitesimally thin. Steiger admits that this obviously does not imply infinite precision, but neither Steiger nor any other CI advocate give a formal reason why CIs must, in general have an interpretation in terms of precision. When the interpretation obviously fails, this should make us doubt whether the interpretation was correct in the first place. The p value tells the story much better than the CI, without encouraging us to fall into fallacies of acceptance or precision.

Where to go from here?

It is often claimed that confidence interval is more informative than p values. This assertion is based on a flawed interpretation of confidence intervals, which we call the "likelihood" or "plausibility" fallacy, and is related to Mayo's "fallacy of acceptance". A proper interpretation of confidence intervals in, terms of the underlying significance tests, avoids this fallacy and prevents bad interpretations of the CIs, in particular when the model is suspect. The entire concept of the "confidence interval" encourages the fallacy of acceptance, and it is probably best if CIs were abandoned altogether. If one does not want to be Bayesian one option that is more useful than confidence intervals -- where all values are either rejected or not at a fixed level of significance -- is viewing curves of p values (for similar use of p value curves, see Mayo's work on "severity").
Curves of right- and left-tailed p values for the two F statistics mentioned in this post.
Consider the plot on the left above, which shows all right- and left-tailed p values for F=8. The horizontal line at p=0.05 allows us to find the 90% confidence interval. For any value of ωsuch that either the blue or red line is lower than the horizontal line, the observed data would be "surprising". It is easy to see that for p=0.05, these values are [0.108, 0.523]. The plot easily shows the necessary information without encouraging the fallacy of acceptance.

Now, consider the plot on the right. For F=0.001, however, all values of ωyield a left-tailed p value of less than 0.05, and hence F=0.001 would be "surprising". There are no values for which both the red and left lines are above p=0.05. The plot does not encourage us to believe that ωis small or 0, it also does not encourage any interpretation in terms of precision; instead, it shows that all values are suspect.

The answer to fallacious interpretations of p values is not to move to confidence intervals; confidence intervals only encourage related fallacies, which one can find in any confidence interval advocacy paper. If we wish to rid people of fallacies involving p values, more p values are needed, not fewer. Confidence intervals are not "better" than p values. The only way to interpret CIs reasonably is in terms of p values, and considering entire p value curves enables us to jettison the reliance on an arbitrary confidence coefficient, and helps us avoid fallacies.


  1. Thank you for sharing such a nice and interesting blog with us. Hope it might be much useful for us. keep on updating...!!
    seo company in india
    digital marketing company in india

  2. A very detailed information which is a lot more helpful for many. Nice Blog. Freelance Developer