Sunday, January 18, 2015

Multiple Comparisons with BayesFactor, Part 2 - order restrictions


In my previous post, I described how to do multiple comparisons using the BayesFactor package. Part 1 concentrated on testing equality constraints among effects: for instance, that the the effects of two factor levels are equal, while leaving the third free to be different. In this second part, I will describe how to test order restrictions on factor level effects. This post will be a little more involved than the previous one, because BayesFactor does not currently do order restrictions automatically.

Again, I will note that these methods are only meant to be used for pre-planned comparisons. They should not be used for post hoc comparisons.

Our Example

I will use the same example and data as I did in the previous post; if you have not read that post, I suggest you go back and read it before delving further here. As a reminder, our data consists of (hypothetical) “moral harshness” ratings of undocumented migrants from 150 participants in three conditions:
  • No odor during questionnaire
  • Pleasant, clean odor (lemon) during questionnaire
  • Disgusting odor (sulfur) during questionnaire
Under the idea that moral disgust and physical disgust are related physiologically (the so-called “embodied” viewpoint; Schnall et al, 2008; but see also Johnson et al., 2014 and Landy & Goodwin, in press) the prediction is that the odor will have an effect on the harshness ratings, as feelings of physical disgust are “transferred” to objects of moral judgment.

In the previous post, I showed the classical ANOVA results, which just failed to reach significance. I also showed how to do a basic test of the null hypothesis against the hypothesis that all three means are unequal using anovaBF. The Bayes factor was about 1/0.774 = 1.3, meaning that neither the null hypothesis nor the “full” model (that all three means are unequal) was favored:
library(BayesFactor)
bf1 = anovaBF(score ~ condition, data = disgust_data)
1 / bf1
## Bayes factor analysis
## --------------
## [1] Intercept only : 1.292 ±0.01%
## 
## Against denominator:
##   score ~ condition 
## ---
## Bayes factor type: BFlinearModel, JZS
More data is needed, to test these hypotheses against one another; but as we'll see, data that are uninformative for one comparison may be more informative for another.

Testing the “right” hypothesis

At this point it is important to note that neither the classical test (which supposedly tests the fitness of the null hypothesis) nor the Bayes factor test of the “full” model against the null hypothesis are the “right” tests for the hypothesis at hand. The null hypothesis may be false in ways that are not consistent with the research predictions. The right test in this case is to test the hypothesis that lemon < control < sulfur. The means fall in the predicted direction:



Out of necessity, however, a classical analysis normally ends with the rejection of the null hypothesis that all means are equal. There is no way in classical statistics to rigorously test order-restrictions; one can only point to the ordering of the means and note that they are in the predicted order. This, however, ignores the uncertainty inherent in the estimation of the means.

Occasionally, researchers perform post hoc tests on the individual means to “ensure” that they are really different, given that they are in the correct order, but this has the disadvantage of extremely low power, which means that this method is only deployed opportunistically. Failure of these post hoc tests might not be reported at all, and would certainly never be reported as a failure to achieve sufficient evidence in favor of the hypothesis (the excuse will always be low power) but rejection will always be trumpeted.

What is needed is a principled way of testing order restrictions. Luckily, this is possible — even straightforward — with Bayes factors.

A refresher on Bayes factor logic

One of the neat features of Bayes factors is their transitivity. If I know that Model A outperforms Model B by 3, and I know that Model B outperforms Model C by 4, then I know that Model A outperforms Model C by \(3 \times 4 = 12\). The reason for this is that Bayes factors are simply ratios (see this previous post). The Bayes factor is the ratio of the likelihood of the data under two hypotheses. Since
\[ \frac{p(y \mid {\cal M}_A)}{p(y \mid {\cal M}_B)} \times \frac{p(y \mid {\cal M}_B)}{p(y \mid {\cal M}_C)} = \frac{p(y \mid {\cal M}_A)}{p(y \mid {\cal M}_C)} \]
…we can use two Bayes factors to get a third. This suggests how we can compute a Bayes factor for an order restriction:
  1. Compare the unrestricted “full” model to the null (already done, with anovaBF)
  2. Compare the unrestricted “full” model to an order restriction
  3. Use the resulting two Bayes factors to compare the null to the order restriction.
This all hinges on Step 2, which we perform in the next section.


Order restrictions versus full models

For this section, we need to remember that the Bayes factor is the degree to which posterior odds change from prior odds. So, if we can compute the prior odds of a restriction against the full model, and compute the posterior odds of a restriction against the full model, then we can obtain the Bayes factor. For the models in the Bayes factor package, the prior odds are easy. All order restrictions have the same probability, so the odds of any single order restriction is against the full model are just
\[ \frac{1}{\mbox{Number of possible orderings}} \]
With three factor levels, there are 6 orderings, so the prior odds are 1/6.

To compute the posterior odds, we need an additional trick: we sample from the posterior, and work out the posterior probability that our prediction holds in the factor level effects. In this way, we account for the estimation uncertainty in these effects.

We use the posterior() function to sample from the posterior distribution, and for demonstration I've shown the first samples from the posterior:
## Sample from the posterior distribution of the full model
## (that is, the numerator of bf1)
samples = posterior(bf1, iterations = 10000)
head(samples)
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1 
## End = 7 
## Thinning interval = 1 
##         mu condition-control condition-lemon condition-sulfur  sig2
## [1,] 30.24          -0.13343          -1.551           1.6841 42.06
## [2,] 30.49          -0.43687          -1.125           1.5617 42.70
## [3,] 29.35           0.04984          -1.514           1.4644 42.59
## [4,] 29.55           0.38246          -1.233           0.8507 46.79
## [5,] 29.38          -0.21434          -1.071           1.2850 46.31
## [6,] 29.85          -0.85003          -1.043           1.8927 53.68
## [7,] 29.40          -0.79867          -1.003           1.8020 54.84
##      g_condition
## [1,]     0.10766
## [2,]     0.18695
## [3,]     0.03047
## [4,]     0.03034
## [5,]     0.59331
## [6,]     0.10248
## [7,]     0.04045

Notice that columns 1 through 4 contain the estimates of the effects of our factor levels. We need to estimate the probability that these order in the specified way. A simple estimate can be had by working out the proportion of samples in which the order constraint holds:
## Check order constraint
consistent = (samples[, "condition-control"] > samples[, "condition-lemon"]) &
  (samples[, "condition-sulfur"] > samples[, "condition-control"])
N_consistent = sum(consistent)

For each posterior sample, the variable consistent codes whether the sample was consistent with the order restriction; the variable N_consistent contains how many of these samples were consistent, which in this case is 7245. Our estimate of the posterior probability of the restriction is thus N_consistent/10000, because we drew 10000 samples from the posterior distribution. The posterior probability is about 0.7245.

As it turns out, the posterior odds of the restriction to the full model is just 0.7245/1, because every sample is consistent with the full model. We can now compute the Bayes factor of the restriction to the full model by just dividing the posterior odds by the prior odds:
bf_restriction_against_full = (N_consistent / 10000) / (1 / 6)
bf_restriction_against_full
## [1] 4.347
The Bayes factor is 4.347, which shows that the data change our opinion in favor of the restriction by a factor of about 4, against the full model.

Another way to think about the above calculation is that the prior odds index the “riskiness” of the order-restriction prediction (the lower the odds, the more risky the prediction is), and the posterior odds represent the probability that it worked out, given the data. Under this view, the Bayes factor of the restriction versus the full model is the “boost” we give to the evidence due to these two factors:
\[ \mbox{Evidential boost to order prediction} = \mbox{Probability prediction is true} \times \mbox{Riskiness of prediction} \]
To get a big boost, we thus need a risky prediction and a prediction that works out. Simply noting that the prediction worked out in the data is not enough.

Putting it all together

We can now compute the Bayes factor of our restriction against the null hypothesis through simple multiplication:
## Convert bf1 to a number so that we can multiply it
bf_full_against_null = as.vector(bf1)
## Use transitivity to compute desired Bayes factor
bf_restriction_against_null = bf_restriction_against_full * bf_full_against_null
bf_restriction_against_null
## condition 
##     3.364
This Bayes factor is still moderate, but substantially more respectable than the previous Bayes factor of 0.774.

What have we gained?

Here we see several interesting features of Bayes factors on display.
  1. If a hypothesis makes a prediction, we can test it. Classical testing has limits which arbitrarily limit our ability to test certain hypotheses (eg, order restrictions). Bayes factors are not so limited.
  2. Bayes factors are transitive. If we can test two models against the same third model, we can compare those two models against one another. Although this seems like a reasonable property, \(p\) values have no such property because they are not comparative.
  3. Making a specific prediction pays off. The “full” model was the wrong model to test, because it did not make properly constrained predictions. When the correct order restriction was tested, the evidence increased because the data were consistent with the restriction.
  4. The limit to increase in evidence that a specific prediction gives you is the “riskyness” of the prediction. Note in the above calculation that the limit to the boost that a Bayes factor can get from the order restriction is equal to the odds of that restriction. If we make a prediction that has low a priori odds, then when it works out in the data, the Bayes factor will reward it by that amount, weighted by the posterior probability that the restriction is true.
For more about testing order restrictions with Bayes factors, see Morey and Wagenmakers (2013), “Simple relation between Bayesian order-restricted and point-null hypothesis tests.”

3 comments:

  1. Hello Richard,

    thank you for these two very instructive posts. Can you point me to more information on why these comparisons are only valid if they were pre-planned? I'd be also interested to learn more about alternative approaches that are valid when done post-hoc/in an exploratory fashion. Maybe this is could be a topic for another blog post?

    Best Regards,
    Frederik

    ReplyDelete
  2. Very enjoyable and informative blog. Without wanting to diminish the many benefits of Bayes Factors, though, I would question your statement "There is no way in classical statistics to rigorously test order-restrictions; one can only point to the ordering of the means and note that they are in the predicted order." Surely one can test planned contrasts to achieve this? i.e. there is no hard and fast rule that you have to run the standard ANOVA in the classical approach (despite what some reviewers might think). If there's a more specific apriori hypothesis to test, then compute a directional contrast and test that with a single degree of freedom test. Or compare two such hypotheses by directly comparing their contrast scores.

    ReplyDelete
  3. Thanks for this informative pair of posts. I had a few questions.

    1) Is it possible to do a similar analysis using a repeated measured design?
    2) In that situation, how would I go about making a combined prediction: e.g., that 3 (specified) conditions will be equal, but 2 (specified) conditions would be higher than those other 3?

    ReplyDelete