Friday, April 10, 2015

All about that "bias, bias, bias" (it's no trouble)


At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem “biased” toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.


Uri Simonsohn has a blog post critiquing default Bayes factors for their supposed “bias” toward the null hypothesis for small sample sizes. I have several brief responses:
  • We do not suggest a “default” prior; we suggest a family of default priors, which an adjustable prior parameter (see also this paper describing our view, which is conditionally accepted at Multivariate Behavioral Research). If you're looking for a small effect, adjust the prior.
  • The whole point of a Bayes factor analysis is that you do not know what the true effect size is (see Jeff Rouder and Joe Hilgard's response here). Noting that the Bayes factor will mislead when you know there's a small effect, but you use a prior that says the effect size is probably moderate to large is not useful. Bayes factors just do what you ask them to do!
  • More broadly, though, I think it is helpful to think about this supposed “bias”. Is it what we would expect for a reasonable method? Sometimes our intuitions fail us, and we end up thinking something undesirable, when actually we should be worried if that thing didn't happen.
The third point is what this blog post is about. Here, I show that the “bias” toward the null for small effect sizes is exactly what must happen for any reasonable method that meets four, simple desiderata.

We start with the idea of a measure of evidence comparing some composite alternative hypothesis to the null hypothesis. For our purposes here, it could be any measure of evidence; it does not have to be a Bayes factor. What we will do is set a number of reasonable desiderata on the properties of this evidence measure, and show that the so-called “bias” in favor of the null for small effect sizes must occur.

We assume that our data can be summarized in terms of an effective sample size and an (observed) effect size measure. This effect size should have a “nullest” member (for instance, d=0, or R2=0). For any given sample size, the evidence against the null will be an increasing function of this observed effect size. We also need the concept of “no”, or equivocal, evidence; that is, that the data do not favor either hypothesis. This defines a 0 point on the evidence scale, whatever it is.

The important concept for our demonstration is the idea of a bivariate space of sample size vs evidence. Sample size begins at 0 and increases along the x axis, and “no” evidence is marked on the y axis. We can think of sample size abstractly as indexing the amoung of information in the data. We are going to imagine fixing an observed effect size and varying the sample size, which will trace a curve through this bivariate space:

A bivariate sample size / evidence space.


We can now give four desired properties that any evidence measure will have.

Desiderata 1: The evidence with no data is “equivocal”.


If we observe no data, the strength of the evidence does not favor either hypothesis. Whatever the “0 evidence” point in the evidence space, having no data must put you there.



[For a Bayes factor, this means that prior odds and the posterior odds are the same — with no data, they don't change — and the log Bayes factor is 0.]

Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.


For instance, if we observe d=0 with N=1000 participants, this is more convincing evidence in favor of the null than of we had observed d=0 with N=10. Obviously, this null observed effect should offer the most evidence possible, for a given sample size.


Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.


If we observe d=.3, with 10 participants, this isn't terribly convincing; but if we observed d=.3 with more and more participants, we are increasingly sure that the null hypothesis is false. In the bivariate space, this means that all non-null effect size curves eventually must end up either at -∞ or at an asymptote at some large value in favor of the alternative.


Desiderata 4: The closer an observed effect size is to the null effect size, the more it's curve “looks like” the null


This is just a smoothness assumption. The conclusions we obtain from observing d=0 should be very close to the ones we obtain from d=.001 and even closer to those we obtain from d=.0000001. Of course, this smoothness should also hold for all other observed effect sizes, not just the null, but for our purposes here the observed null is what is important.



For small sample sizes, this means that the curves for small effect sizes must be near the null effect size lines in the bivariate space. As we increase the sample size, of course, those lines must diverge downward.



The effect of these four desiderata is to ensure that small effect sizes “look” null. This is not a consequence of the Bayes factor, or the prior, but rather of very reasonable conditions that any evidence measure would fulfil. For a Bayes factor, of course, how these lines move through the bivariate space — and how small an effect size will need to be in order to “look” null — will be sensitive to the prior on the alternative, as it must be. But behaviour described by Simonsohn is natural consequence of very reasonable assumptions.

Although it is counter intuitive, we would be worried if it didn't happen for some measure of evidence.

6 comments:

  1. Great Post. This info needs to get out to fully understand the appeal of BF.

    That said, Uri's example is quite different. If you look carefully at his Figure 1 (http://datacolada.org/2015/04/09/35-the-default-bayesian-test-is-prejudiced-against-small-effects/), you see that the effect size is not constant. He is lowering it as sample size increases so that a 50% power is preserved. So, he is choosing a different one of your blue lines for different sample sizes. I haven't worked out the limiting behavior, largely because I don't care about 50% power curves, but it could be to the red line or not. What Uri simulated is different than what is presented here, but what is presented here is the far more understandable case.

    ReplyDelete
  2. Dear Richard Morey,

    But do we need an alternative to test the hypothesis that the true effect size is close to z?

    No. As sample size increases, the observed effect size will be increasingly closer to zero.

    Isn't it sufficient to show that d = .04 with a 95%CI of d = -.01 to d = .09.

    Sampling error still produces an observed effect size that is not 0, but the tight confidence interval includes 0 and the null-hypothesis cannot be rejected.

    If you want to be more certain that the null-hypothesis is true, you need to collect more data and examine whether 0 is still included in an even tighter confidence interval.

    What is gained by comparing the hypothesis d=0 against an alternative hypothesis with an unknown effect size that is simply a guess by a researcher? "the devil is in the selection of the prior"

    When people say that it is impossible to test the null-hypothesis, they really are saying it is impossible to test an exact point prediction, d = 1 / infinity. But this is also true for any other effect size, d = .3 is not equal to d = .300000000000000001.

    It is now clear to me when Bayesian statistics is useful. When you have two theories that make competing quantitative predictions (e.g., d = 0 vs. d = 3; d = .5 vs. d = 1, etc), the BF is useful whether the evidence is more consistent with theory A or B.

    When the aim of a study is simply to demonstrate that an effect exists, a simple significance tests is more useful. The current crisis of confidence is not a problem of p-values, but a problem of not reporting non-significant results. This is not a statistical problem, so we cannot fix it by changing statistics.

    Sincerely, Dr. R

    ReplyDelete
    Replies
    1. >> But do we need an alternative to test the hypothesis that the true effect size is close to z?

      I think you do. Under NHST you cannot distinguish between false positive and true effect near zero.

      >> No. As sample size increases, the observed effect size will be increasingly closer to zero.

      Yes and the CI will also become narrower. This relationship is on average more or less constant (I briefly discuss that in my BSE manuscript - I had some data on it too but I felt that was a bit too much for it. Maybe I will post about it sometime). Therefore it follows the same principle as NHST in that false positive rate will remain constant regardless of sample size.

      One way that I think is appropriate though is to defined a null range of irrelevant non-zero effects and to quantify how distinct the CI is from that range. This is basically what Zoltan Dienes describes here as an alternative to Bayes factors (he prefers the latter but seems to suggest that these are largely equivalent approaches):
      journal.frontiersin.org/Journal/10.3389/fpsyg.2014.00781/abstract

      Delete
    2. Dr. R.,

      Are you implying that P(effect exists) = 1 whenever one finds a single significant result? (regardless of the size of that effect?) It is hard to interpret your statement otherwise unless you provide clarification.

      "When the aim of a study is simply to demonstrate that an effect exists, a simple significance tests is more useful."

      Delete
  3. Dr. R,

    > "Sampling error still produces an observed effect size that is not 0, but the tight confidence interval includes 0 and the null-hypothesis cannot be rejected. If you want to be more certain that the null-hypothesis is true, you need to collect more data and examine whether 0 is still included in an even tighter confidence interval."

    No. This is a basic, longstanding fallacy about confidence intervals; see Neyman (1941, 1952).

    > 'What is gained by comparing the hypothesis d=0 against an alternative hypothesis with an unknown effect size that is simply a guess by a researcher? "the devil is in the selection of the prior" '

    That's a great question that has been answered at length in many papers about Bayesian statistics over the past 70 years.

    Jeff:

    Yes, I saw that he simulated things. The purpose of the presentation here is different, but addresses the same question: should small observed effect sizes yield evidence (even strong evidence!) for the null. The argument above shows that yes, they should. We can question how the curves should move through that bivariate space, but they must have certain properties and one of them is that "small effect sizes look null", regardless of whether you're using Bayes factor or not.

    ReplyDelete
  4. This makes sense to me and I'm trying to put some of this into a simple summary. A fundamental problem with assessing evidence for small effects is that 'small' is not a well-defined concept, but depends on context. Thus it is essential that any tool for distinguishing a negligible/zero effect from a small effect be calibrated for the context in question.

    ReplyDelete