BayesFactor: Software for Bayesian inference

Stop saying confidence intervals are "better" than p values

2016-07-29T12:18:00.002-07:00

One of the common tropes one hears from advocates of confidence intervals is that they are superior, or should be preferred, to p values. In our paper "The Fallacy of Placing Confidence in Confidence Intervals", we outlined a number of interpretation problems in confidence interval theory. We did this from a mostly Bayesian perspective, but in the second section was an example that showed why, from a frequentist perspective, confidence intervals can fail. However, many people missed this because they assumed that the paper was all Bayesian advocacy. The purpose of this blog post is to expand on the frequentist example that many people missed; one doesn't have to be a Bayesian to see that confidence intervals can be less interpretable than the p values they are supposed to replace. Andrew Gelman briefly made this point previously, but I want to expand on it so that people (hopefully) more clearly understand the point.

Understanding the argument I'm going to lay out here is critical to understanding both p values and confidence intervals. As we'll see, fallacies about one or the other are what lead advocates of confidence intervals to falsely believe that CIs are "better".

p values and "surprise"

First, we must define a p value properly and understand its role in frequentist inference. The p value is the probability of obtaining a result at least as extreme as the one we observed, under some assumption about the true distribution of the data. A low p value is taken as indicating that the result observed was very extreme under the assumptions, and hence calls the assumptions into doubt. One might say that a low p value is "surprising" under the assumptions. I will not question this mode of inference here.

It is critical to keep in mind that a low p value can call an assumption into doubt, but a high p value does not "confirm" anything. This is consistent with falsificationist logic. We often see p values used in the context of null hypothesis significance testing (NHST), where a single p value is computed that indicates how extreme the data under the assumption of a null hypothesis; however, we can compute p values for any hypothesis we like. As an example, suppose we are interested in whether reading comprehension scores are affected by caffeine. We apply three different doses to N=10 people in each group in a between-subjects design, and test their reading comprehension. For the sake of the example, we assume normality, homogeneity of variance, etc. We apply a one-way ANOVA to the reading comprehension scores and obtain an F statistic of F(2,27)=8.

If we were to assume that there was no relationship between the reading scores and caffeine dose, then the resulting p value for this F statistic is p=0.002. This indicates that we would only expect F statistics as extreme as this one .2% of the time, if there were no true relationship.

The curve shows the distribution of F(2,27) statistics when the null hypothesis is true. The area under the curve to the right of the observed F statistic is the p value.

This low p value would typically be regarded as strong evidence against the null hypothesis, because -- as the graph above shows -- an F statistic as extreme as the observed on would be quite rare, if indeed there were no relationship between reading scores and caffeine.

So far, this is all first-year statistics (though it is often misunderstood). Although we typically see p values computed for a single hypothesis, there is nothing stopping us from computing it for multiple hypotheses. Suppose we are interested in the true size of the effect between reading scores and caffeine dosage. One statistic that quantifies this relationship is ω², the proportion of the total variance in the reading scores that is "accounted for" by caffeine (see Steiger, 2004 for details). We won't get into the details of how this is computed; we need only know that:

When ω²=0, there is no relationship between caffeine and reading scores. All variance is error; that is, knowing someone's reading score does not give any information about which dose group they were in.
When ω²=1, there is the strongest possible relationship between caffeine and readings scores. No variance is error; that is, by knowing someone's reading score one can know with certainty which does group they were in.
As ω²gets larger, larger and larger F statistics are predicted.

We have computed the p value under the assumption that ω²=0, but what about all other ω²values? Try this shiny app to find the predicted distribution of F statistics, and hence p values, for other values of ω². Try to find the value of ω²that would yield a p value of exactly 0.05; it should be about ω²=0.108.

A Shiny app for finding p values in a one-way ANOVA with three groups.

All values of ω²less than 0.108 yield p values of less than 0.05. If we designate p<0.05 as "surprising" p values, then F=8 would be surprising under the assumption of any value of ω²between 0 and 0.108.

Using the Shiny app, we can see that a F=8 yields a right-tailed p value of about 0.05 when ω² is approximately 0.108.

Notice that the p values we've computed thus far are "right-tailed" p values; that is, "extreme" is defined as "too big". We can also ask about whether the F statistic we've found is extreme in the other direction: that is, is it "too small". A p value used to indicate whether the F value is too small is called a "left-tailed" p value. Using the Shiny app, one can work out the value of ω² such that F=8 would be "surprisingly" small at the p=0.05 level; that value is ω²=0.523. Under any true value of ω²greater than 0.523, F=8 would be surprisingly small.

Using the Shiny app, we can see that a F=8 yields a left-tailed p value of about 0.05 when ω² is approximately 0.523.

If 0 ≤ ω²≤ 0.108, the observed F statistic would be surprisingly large (that is, the right-tailed p ≤ 0.05)
If 0.523 ≤ ω²≤ 1, the observed F statistic would be surprisingly small (that is, the left-tailed p ≤ 0.05)
If 0.108 ≤ ω²≤ 0.523, the observed F statistic would not be surprisingly large or small.

Critically, we've used p values to make all of these statements. The p values tell us whether values would be "surprisingly extreme", under particular assumptions; p values allow us, under frequentist logic, to rule out true values of ω², but not to rule them in.

p values and confidence intervals

Many people are aware of the relationship between p values and confidence intervals. A typical X% (two-tailed) confidence interval contains all parameter values such that neither one-sided p values are less than (1-X/100)/2. That sounds complicated, but it isn't; for a 90% confidence interval, we need just need all the values for which the observed data would not be "too surprising" (p<0.05, for one of the two-sided tests).

We've already computed the 90% confidence interval for ω²in our example; for all values in [0.108, 0.523], the p value for both one sided tests is p>0.05. From each of two-sided tests we get an error rate of 0.05, and hence the confidence coefficient is 100 times 1 - (0.05 + 0.05) = 90%.

How can we interpret the confidence interval? Confidence interval advocates would have us believe that the interval [0.108, 0.523] gives "plausible" or "likely" values for the parameters, and that the width of this interval tells us the precision of our estimate. But remember how the CI was computed: using p values. We know that nonsignificant high p values do not rule in parameter values as plausible; rather, the values outside the interval have been ruled out, due to the fact that if those were the true values, the observed data would be surprising.

So rather than thinking of the CI as values that are "ruled in" as "plausible" or "likely" by the data, we should rather (from a frequentist perspective, at least) think of the confidence interval as values that have not yet been ruled out by a significance test.

Does this matter?

This distinction matters a great deal for understanding both p values and confidence intervals. In order to use p values in any way that approaches reasonability, we need to understand the "surprise" interpretation, and we need to realise that we can compute p values for many hypotheses, not just the null hypothesis. In order to interpret confidence intervals well, we need to understand the "fallacy of acceptance": Just because a value is in the CI, doesn't mean it is plausible; it only means that it has not yet been ruled out.

To see the real consequences of this fallacy, consider what we would infer if F(2,27)=0.001 (p=0.999). Any competent data analyst would notice that there is something wrong; the means are surprisingly similar. Under the null hypothesis, when all error is due to error within the groups, we expect the means to vary. This F statistic indicates that the means are so similar that even under the null hypothesis -- where the true means are exactly the same -- we would expect more similar observed means only one time in a thousand.

In fact, the F statistic is so small that under all values of ω², the left-tailed p value is at most 0.001. Why? Because ω²can't be any lower than 0, and this represents the null hypothesis. If we built a 90% confidence interval, it would be empty because there are no values of ω²that yield p>0.05. For all true values of ω², the observed data are "surprising". Now this presents no particular problem for an interpretation of p values that rests solely on their relationship with p values. But note that the very high p value tells us more than the confidence interval; the CI depends on the confidence, and is simply empty. The p value and the F statistic have the information we want; they tells us that the means are much more similar than we would typically expect under any hypothesis. A competent data analyst would, at this point, check the procedure or data for problems. The entire model is suspect.

But what does this mean for a confidence interval advocate who is invested in the (incorrect) interpretation of the CI in terms of "plausible values" or "precision"? Consider Steiger (2004), who suggests replacing a missing bound with "0" in the CI for ω². This is an awful suggestion. In the example above with F=0.001, this would imply that the confidence interval includes a single value, 0. But the observed data F=0.001 would be very surprising if ω²= 0. Under frequentist logic, the value -- and all other values -- should be ruled out. Moreover, a CI of (0) is infinitesimally thin. Steiger admits that this obviously does not imply infinite precision, but neither Steiger nor any other CI advocate give a formal reason why CIs must, in general have an interpretation in terms of precision. When the interpretation obviously fails, this should make us doubt whether the interpretation was correct in the first place. The p value tells the story much better than the CI, without encouraging us to fall into fallacies of acceptance or precision.

Where to go from here?

It is often claimed that confidence interval is more informative than p values. This assertion is based on a flawed interpretation of confidence intervals, which we call the "likelihood" or "plausibility" fallacy, and is related to Mayo's "fallacy of acceptance". A proper interpretation of confidence intervals in, terms of the underlying significance tests, avoids this fallacy and prevents bad interpretations of the CIs, in particular when the model is suspect. The entire concept of the "confidence interval" encourages the fallacy of acceptance, and it is probably best if CIs were abandoned altogether. If one does not want to be Bayesian one option that is more useful than confidence intervals -- where all values are either rejected or not at a fixed level of significance -- is viewing curves of p values (for similar use of p value curves, see Mayo's work on "severity").

Curves of right- and left-tailed p values for the two F statistics mentioned in this post.

Consider the plot on the left above, which shows all right- and left-tailed p values for F=8. The horizontal line at p=0.05 allows us to find the 90% confidence interval. For any value of ω²such that either the blue or red line is lower than the horizontal line, the observed data would be "surprising". It is easy to see that for p=0.05, these values are [0.108, 0.523]. The plot easily shows the necessary information without encouraging the fallacy of acceptance.

Now, consider the plot on the right. For F=0.001, however, all values of ω²yield a left-tailed p value of less than 0.05, and hence F=0.001 would be "surprising". There are no values for which both the red and left lines are above p=0.05. The plot does not encourage us to believe that ω²is small or 0, it also does not encourage any interpretation in terms of precision; instead, it shows that all values are suspect.

The answer to fallacious interpretations of p values is not to move to confidence intervals; confidence intervals only encourage related fallacies, which one can find in any confidence interval advocacy paper. If we wish to rid people of fallacies involving p values, more p values are needed, not fewer. Confidence intervals are not "better" than p values. The only way to interpret CIs reasonably is in terms of p values, and considering entire p value curves enables us to jettison the reliance on an arbitrary confidence coefficient, and helps us avoid fallacies.

Numerical pitfalls in computing variance

2016-05-03T09:29:00.000-07:00

One of the most common tasks in statistical computing is computation of sample variance. This would seem to be straightforward; there are a number of algebraically equivalent ways of representing the sum of squares $S$, such as \[ S = \sum_{k=1}^n ( x_k - \bar{x})^2 \] or \[ S = \sum_{k=1}^n x_k^2 + \frac{1}{n}\bar{x}^2 \] and the sample variance is simply $S/(n-1)$.

What is straightforward algebraically, however, is sometimes not so straightforward in the floating-point arithmetic used by computers. Computers cannot represent numbers to infinite precision, and arithmetic operations can affect the precision of floating-point numbers in unexpected ways.

Consider the numbers .1 and .3/3. These two numbers are equal. However,

.1 - .3/3

## [1] 0.00000000000000001388

is not exactly 0, as one would expect it to be (for more, see "Falling into the Floating
Point Trap", Chapter 4 in the "R Inferno" by Patrick Burns. Multiple ways of computing the variance that are algebraically equivalent do not necessarily yield equal answers in software such as R, and some ways are better than others.

In a series of posts, John D. Cook shows that the seemingly reasonable, commonly used second method above, which he calls the "sum of squares" method, can be extremely unstable in certain circumstances, even giving impossible negative values. He also discusses how to compute the sample variance in a numerically stable way using Welford's method. Both of these posts are well worth reading.

When I read them, I thought two things. First, I was reminded that I use [used; this was written some time ago] the "sum of squares" method in the BayesFactor package. Secondly, I thought I would not be affected by the problem, because I represent numbers as logarithms internally for numerical stability and ease of division. Logarithms make many things easier: very large and very small numbers become easier to work with; exponentiation becomes multiplication; and multiplication and division become addition and subtraction. The tricky part of working with logarithms is addition and subtraction. If we have two numbers, $\exp(a)$ and $\exp(b)$ represented by their logarithms $a$ and $b$, and we want to know the logarithm of their sum, we can make use of the identities
\[
\begin{eqnarray*}
\log(\exp(a) + \exp(b)) = a + \log(1 + \exp(b - a))\\
\log(\exp(a) - \exp(b)) = a + \log(1 - \exp(b - a))
\end{eqnarray*}
\]
Now arithmetic with $a$ and $b$ is addition and subtraction, and we can use accurate floating point approximations of $\log(1+\exp(x))$ and $\log(1-\exp(x))$.

When logarithms don't help

But I wasn't really sure whether I would be affected by the instability of the "sum of squares" method, so I decided to check. It turns out, representing numbers logarithmically doesn't necessarily help. In order to demonstrate this easily, I created an R S4 class that eases arithmetic on logarithmically-represented values. First, we load necessary libraries/files:

# Install the BayesFactor and devtools packages, if you don't already have them

# Load my S4 class for representing real numbers with logarithms
# and performing arithmetic on them
# See the code at https://gist.github.com/richarddmorey/3c77d0065983e31241bff3807482443e
devtools::source_gist('3c77d0065983e31241bff3807482443e')

# set random seed so results are reproducible
set.seed(2)

[Click here to view the R file you'll be sourcing above]

To see the S4 class in action, we need to generate some numbers that are logarithmically-represented. The variables x and y below are equal to $\exp(1)=2.718$ and $\exp(2)=7.389$, respectively. The modulo argument gives the log-represented magnitude of the number, and the sign argument gives the sign (with 1L meaning the integer representation of 1):

x = logRepresentedReal(modulo = 1, sign = 1L)
y = logRepresentedReal(modulo = 2, sign = 1L)

We can add the two numbers together, for instance:

x + y

##  10.10734

Although the result does not look logarithmically-represented, we can verify that it is using the str function:

str( x + y )

## Formal class 'logRepresentedReal' [package ".GlobalEnv"] with 2 slots
##   ..@ modulo: num 2.31
##   ..@ sign  : int 1

The result is of class logRepresentedReal, and the modulo slot tells us $\log(x+y)$. With the arithmetic on the logarithmically-represented numbers defined using the logRepresentedReal class, we can test whether our logarithms help stabilize the estimate of the variance. Following Cook, we will sample values from a uniform distribution, making use of the fact that if $z$ has a uniform distribution, then $−\log(z)$ has an standard exponential distribution:

runif2 = function(n){
  # Sample log values from exponential distribution
  x = -rexp(n)
  # represent all values logarithmically in a list
  lapply(x, logRepresentedReal, sign = 1L)
}

n = 100
z = runif2(n)

We sampled $n=100$ values from a uniform distribution. We can now compute the variance in several ways. The first way is to use the “sum of squares” method on the exponentiated values:

# Sum of squares method
var.sumsq.exp = function(z)
{
  n = length(z)
  z = sapply(z, as.numeric)
  (sum(z^2) - n*mean(z)^2)/(n-1)
}

var.sumsq.exp(z)

## [1] 0.07419988

This presents no problem, since our uniformly-distributed values are rather moderate. We now use Welford’s method on the logarithmically-represented values to compute the variance:

var.welford <- function(z){
  n = length(z)
  M = list()
  S = list()
  M[[1]] = z[[1]]
  S[[1]] = 0

  for(k in 2:n){
    M[[k]] = M[[k-1]] + ( z[[k]] - M[[k-1]] ) / k
    S[[k]] = S[[k-1]] + ( z[[k]] - M[[k-1]] ) * ( z[[k]] - M[[k]] )
  }
  return(S[[n]] / (n - 1))
}

var.welford(z)

##  0.07419988

And finally, we can use the “sum of squares” method on the logarithmically-represented values:

var.sumsq = function(z){
  n = length(z)
  zsqr = sapply(z, function(x) x^2)

  sumz = 0
  sumz2 = 0
  for(k in 1:n){
    sumz = sumz + z[[k]]
    sumz2 = sumz2 + zsqr[[k]] 
  }
  mnz = sumz/n
  ssz = sumz2 - n * mnz^2 
  return(ssz / (n-1))  
}

var.sumsq(z)

##  0.07419988

Again, this presents no problem, since our uniformly-distributed values are moderate. So far, we see no signs of numerical instability, but none are expected. As Cook did in his example, we can add a very large number — in this case, one billion — to all sampled values. This makes the variance quite small compared to the mean, and would be expected to make the sum of squares estimates unstable.

const = 1e9
z = sapply(z, function(x) x + const)

var.sumsq.exp(z)

## [1] -165.4949

var.welford(z)

##  0.07419973

var.sumsq(z)

##  35886

Notice that both the sum of square estimates fail; the logarithmic representation used by var.sumsq does not help the numeric stability. The Welford method, on the other hand, yields an accurate value.

Conclusion

There are many ways of combating numerical instability. Representing numbers as logarithms is one important method. Although representing numbers as logarithms is effective at combating numerical instability from some sources, it does not necessarily help in all cases. The superiority of the Welford method of computing variance, even when numbers are represented logarithmically, shows this clearly.

How to train undergraduate psychologists to be post hoc BS generators

2016-04-03T05:33:00.001-07:00

Teaching undergraduate psychology is difficult for a variety of reasons. Students come in with preconceived notions about what psychological research is and are sometimes disappointed with the mismatch between their preconceptions and reality. Much of what psychologists do is highly specialized and requires skills that are difficult to teach, and psychologists-in-training can't offer much research-wise until they have years of experience. The assignments we ask undergraduates to complete are meant to train their critical thinking skills to prepare them for a more substantive contribution to research. Sometimes, however, they do exactly the opposite; instead, assignments can reward post hoc BS generation rather than actual critical thinking.

If the recent so-called crisis in psychology has highlighted anything, it is the prevalence and danger of post hoc narratives. Although statistical practices (e.g., use if significance testing) have gotten much of the blame -- at least in my corner of the research world -- the main problems are actually a level or two above that. Combining ill- or flexibly-defined theoretical concepts, post hoc reasoning, and publication bias yields a potent mixture that I would argue is responsible for the crisis.

I have been thinking about this in the context of assignments that we ask our undergraduates to do, and how we actually train post hoc reasoning early on. Here I'll offer examples of two undergraduate assignments that I think reward scientists-in-training for their BS generation skills. I'll also elaborate on what I think we can do about this.

The "Texas Sharpshooter" paints the target around his bullet holes.

Example assignment: Critique a peer-reviewed article

The assignment: Students are assigned an article from a peer-reviewed psychological journal and asked to critique it. Ideally, they develop choose a few critiques for which they argue for in their essay.

The basic problem with this assignment is that students are not particularly well-versed in any particular psychological topic, nor in psychological research methods. On the other hand peer-reviewed articles have been reviewed by people who are, which means that whatever problems remain with the research have evaded skilled reviewers. This is not to say that peer-reviewed research does not have major problems, but it does mean that students who have had only a few basic courses and do not have much experience in reading peer-reviewed research are unlikely to be able to find good quality-critiques spontaneously.

Upon reading such an article and having difficulty find a critique, a student is in an awkward position: they must write an essay. So what do they do? They come up with whatever critiques come to mind, which are likely to be low-quality critiques. I suspect readers of this blog have experienced these sorts of critiques in student assignments: maybe there are cultural differences? The sample seems small. Are these really the best stimuli to use? They must choose a number of these arguments, and argue for them, in spite of the fact that they don't have sufficient knowledge on which to base such a critique. We're training them in the fine art of bullshit.

This is not to say that these problems don't occur in some studies. But forming a good argument why takes specialized knowledge they they don't yet have, so we get back noise from the students. And who gets the best marks for such an assignment? Students who can write clearly about things of which they have little actual understanding.

We have to ask ourselves: is it any wonder that we have a replication crisis?

Example assignment: Do an experiment and interpret the results

The assignment: Students are asked to perform a simple experiment (often in groups), analyze the data, and report the results. They must interpret the results in light of the research they've read (often primarily the textbook).

Experience doing simple experiments and analyzing the results is critical to a psychologist-in-training. But how the assignment is framed and marked is critical to whether we are training the skills we want. Students in chemistry, biology, and physics all perform easy experiments and report the results; this is as it should be.

What is different about interpreting the results of a typical psychology experiment and that of a chemistry experiment is that there are very strong reasons to expect something specific to happen in the chemistry experiment. If the psychology experiment doesn't come out as the textbook predicts, though, they must describe why that might be. There are, of course, a hundred possible reasons why this might be the case, including the possibility that the original study was wrong, statistical noise, and sloppiness in their experimental procedure.

But these explanations will not be the ones they will explore. We require students to show creativity and independent reading/thought. In an assignment like this, students know that the best way to get a good mark is to find a paper whose logic might predict the results obtained, and include a cogent argument why this might have caused the differences. The students turn in the paper, and will not test their hypothesis, of course; the argument is simply thrown in to get a better mark. The students who do the most independent reading and form the best-sounding argument will get the best mark.

This should all sound eerily familiar: we are training them in the time-honored tradition of post hoc arguments for "hidden moderators".

Fixing the problems

If we want to train good psychologists, we must be very sensitive to the skills we're actually teaching, as opposed to those we think we are teaching. The practices in the field will be a reflection of what students are taught. How might we use assignments to train critical thinking, without teaching bad practice?

Critiquing pop science

The problem with the critique of the peer-reviewed article is that students are unlikely to be able to spot the real problems with the article. This is somewhat like asking first-year sports therapists to critique a professional sports player's technique; the imperfections are simply too fine, because the professionals have been honing their craft with help for years. It would be better to ask them to critique amateur sports players' techniques, because they will have more glaring problems.

Unfortunately, there is no "amateur" peer-reviewed research. There is, however, a lot of very bad non-peer-reviewed pop science. Psychologists-in-training would benefit from assessing bad popular science (not just popular psychology) assessing, for instance, spurious claims of causation (vs correlation), overblown effect sizes, and mismatches between what is claimed about a research in a pop article versus what was actually done. Critiquing popular science develops the similar skills as critiquing a peer-reviewed article, without the unfortunate side effect of asking students to BS their way to a good mark.

Separating critiques of method from critiques of results

Critiquing methods along side of the results leads to the unfortunate asymmetry that if an experiment yields the expected result, the methods are not critiqued, whereas if it doesn't, the students are encouraged to generate BS reasons why it might not work, with no expectation of testing those reasons. If students were asked to critique methods by themselves, then they would not be rewarded for such post hoc reasoning. Moreover, in an essay of typical length, this leaves more room to discuss reasons why the methods are problematic; for instance, if the sample size is problematic, a methods-only critique would allow space for a power analysis. In a methods+results critique, I often see critiques of sample sizes with no corresponding argument why the sample size is a problem.

Being specific about potential critiques

In whatever assignments we give to undergraduates, we should be specific about what sorts of critiques we are expecting, preferably giving a short list of possible critiques. The students will still have to read the target article, but instead of giving a shot in the dark and being forced to argue for it, students will be forced to ask, for instance, "Does this research suffer from a confound with X?", or "is this experiment sufficiently powered to detect an effect size of Z?" or "Does this DV represent a good operationalization of W?"

Perhaps, for instance, power is not a problem; they would then be in a position to argue that, yes, the experiment is sufficiently powered, instead of always (vaguely) attacking an article. Always asking for critique teaches the students that critical thinking is about dreaming up as many ways to attack an article as possible, and then forming a plausible-seeming argument around them. In contrast, being very specific about possible critiques -- that may not, in fact, turn out to be problems -- will develop critical thinking and argumentation skills better.

Wrap up

If we believe psychology is in crisis, we should look at the way we train undergraduates to see if part of the problem lies there. I think the crisis in psychology is reflected in some ways in our training. Doing better is not just about better statistical training or better open science training; it is also about ensuring a match between what we think we are teaching and what we actually teach.

How to check Likert scale summaries for plausibility

2016-03-30T02:42:00.000-07:00

Suppose you are reading a paper that uses Likert scale responses. The paper reports the mean, standard deviation, and number of responses. If we are -- for some reason -- suspicious of a paper, we might ask, "Are these summary statistics possible for this number of responses, for this Likert scale?" Someone asked me this recently, so I wrote some simple code to help check. In this blog post, I outline how the code works.

Suppose we are reading a paper that uses a 5-category Likert scale response (0-4) and they report that for 100 responses, the mean response was .5, and the standard deviation of the responses was 5. I have intentionally chosen these numbers to be impossible: for the mean to be .5, the responses have to be on average near the bottom of the scale. But if the responses are near the bottom of the scale, the standard deviation also has to be very low, because there is a bound at 0. The standard deviation of 5 is much to large for the mean.

Another possible inconsistency arises due to the discreteness of Likert scales. For 10 Likert responses on a 3 point scale (0-2), the mean must be a multiple of .1. This, in turn, imposes a complicated constraint on the standard deviation.

Checking whether a response pattern is possible may be simple for low N, but it gets complex as N increases. Suppose we are wondering, for a 6-item Likert scale (0-5), whether N=94, M=.83, and SD=1.21 are possible summary statistics. There are several ways we could go about this using R.

The code for both is available here: https://gist.github.com/richarddmorey/787d7b7547a736a49c3f

Option 1: Brute force

In the brute force method, we create all possible response patterns (defined as the number of responses in each response category) and then check them, finding the ones closest to our desired response pattern. The code I linked above has two functions: count.ns, which counts the total number of response patterns for a given Likert scale and N. For the summaries above, this will be

> count.ns(N, nlev)

[1] 71523144

or about 72 million response patterns. Brute force isn't pretty, but this job is possible in a minute or two on a modern PC. A note: on some older Windows computers this may exhaust your memory.

The function get.ns will compute all possible response patterns and put them in a matrix. This may take a bit of time (on my Macbook Pro it takes about a minute and a half):

> x = get.ns( 94, nlev=6)

> x[1:10,]

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0 0 0 0 0 94

[2,] 0 0 0 0 1 93

[3,] 0 0 0 0 2 92

[4,] 0 0 0 0 3 91

[5,] 0 0 0 0 4 90

[6,] 0 0 0 0 5 89

[7,] 0 0 0 0 6 88

[8,] 0 0 0 0 7 87

[9,] 0 0 0 0 8 86

[10,] 0 0 0 0 9 85

As you can see, x now contains the possible response patterns for N=94 responses, with K=6 Likert response categories. Above I show the first 10 patterns.

All that remains now is to compute the mean and standard deviation of each response pattern and compare it to the target. We can use the sum of the squared deviations from the target mean and standard deviation to sort the possible solutions. Note that if we wanted both deviations to be less than .005 (to account for rounding to the nearest .01) then we would want solutions that are no greater than 2 times .005^2 = .00005 in their summed squared error.

The code linked above does all the sorting and places the solutions in an object called res. After running the code, the first 10 rows of res are:

> res[1:10,]

resp0 resp1 resp2 resp3 resp4 resp5 mean sum x^2 std. dev. sum error

[1,] 50 30 0 9 4 1 0.8297872 200 1.206063 1.554821e-05

[2,] 54 22 0 17 0 1 0.8297872 200 1.206063 1.554821e-05

[3,] 56 18 1 18 1 0 0.8297872 200 1.206063 1.554821e-05

[4,] 57 15 4 17 1 0 0.8297872 200 1.206063 1.554821e-05

[5,] 59 3 27 0 4 1 0.8297872 200 1.206063 1.554821e-05

[6,] 59 10 7 18 0 0 0.8297872 200 1.206063 1.554821e-05

[7,] 60 1 27 2 3 1 0.8297872 200 1.206063 1.554821e-05

[8,] 60 7 10 17 0 0 0.8297872 200 1.206063 1.554821e-05

[9,] 41 46 1 0 0 6 0.8297872 200 1.206063 1.554821e-05

[10,] 43 42 2 1 1 5 0.8297872 200 1.206063 1.554821e-05

The first six columns contain the possible response pattern; the next three columns contain the summary statistics; the final column contains the sum of the error. We wished to match M=.83 and SD=1.21, and there are many solutions that yield these summary statistics. There is no reason to be suspicious of these summaries.

Option 2: Linear Inverse Models

The brute force solution above does the job, but it is slow and memory intensive. If we had a 7-item Likert scale, the number of possible response patterns would be about 1.2 billion; add more, and you can see that the amount of time and memory for the brute force method becomes prohibitive. We can actually use an approximate method -- linear inverse models -- to get close.

The idea of linear inverse models is that we can try to minimize a quadratic function, subject to some constraints. In our case, suppose we would like to find proportions of responses that would yield as summary statistics as close as possible to our given summary statistics. We have some reasonable constraints:

Proportions must sum to 1 (equality constraint).
Our summary statistics should be as close as possible to our target summaries (approximate equality constraint)
Proportions must be between 0 and 1 (inequality constraint).

The linSolve package in R allows us to sample approximate solutions for our response proportions according to these constraints. I will not go into detail here about how to define these constraints for the linSolve package; see my code linked above and the linSolve manual. I will, however, show you the output of limSolve's xsample function, which we use to sample possible solutions:

> xs <- limSolve::xsample(A = A, B = B, E = E, F = F, G = G, H = H, sdB = 1)

> xs$X[1:10,]

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141

[2,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141

[3,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141

[4,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141

[5,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141

[6,] 0.5085929 0.3226566 0.08151242 0.01938088 0.03202715 0.03582995

[7,] 0.5085929 0.3226566 0.08151242 0.01938088 0.03202715 0.03582995

[8,] 0.5085929 0.3226566 0.08151242 0.01938088 0.03202715 0.03582995

[9,] 0.5860701 0.2541897 0.04073821 0.03497458 0.01822376 0.06580361

[10,] 0.5860701 0.2541897 0.04073821 0.03497458 0.01822376 0.06580361

The possible solutions for the proportions of responses in each response category are in each row of the xs$X element of the output. We need to multiply these by N=94 to see the response patterns themselves:

> xs$X[1:10,]*94

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192

[2,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192

[3,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192

[4,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192

[5,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192

[6,] 47.80774 30.32972 7.662167 1.821802 3.010553 3.368016

[7,] 47.80774 30.32972 7.662167 1.821802 3.010553 3.368016

[8,] 47.80774 30.32972 7.662167 1.821802 3.010553 3.368016

[9,] 55.09059 23.89383 3.829392 3.287611 1.713034 6.185539

[10,] 55.09059 23.89383 3.829392 3.287611 1.713034 6.185539

There are several things to notice here. First, there are some duplicates, which is to be expected from the sampling method. Second, these solutions are not integers. Since response patterns must be integers, we would have to round these and make sure they sum to N=94 before testing them to see if their summaries are acceptably close.

Take the first solution: we might round this to obtain

49 27 11 2 0 4

However, this only sums to N=93, so we need to add 1 somewhere. Suppose we add it to the fourth category, to obtain

49 27 11 3 0 4

The mean response for this response pattern is almost exactly .83, our target mean. The standard deviation is 1.20, which is only .01 away from our target standard deviation.

Finally, note the similarity of these solutions to the ones we obtained by brute force. The linear inverse model method has gotten us in the neighborhood of the good solutions without the inelegance, time, and memory hogging of the brute force method. However, unlike the brute force method, it is not constrained to integer solutions, so we need to do some post processing.

Conclusion

Both the brute force and linear inverse model solution yield the same answer: for a 6-item Likert scale (0-5), N=94, M=.83, and SD=1.21 are not problematic summary statistics. Which of these methods one uses depends largely on the situation, but one could even combine them for a more efficient search. As a reminder, the complete code can be found here: https://gist.github.com/richarddmorey/787d7b7547a736a49c3f.

New digital feature editorship at the Psychonomic Society

2016-01-19T14:00:00.000-08:00

As of January 1, 2016, I am the new methods editor for the Psychonomic Society digital features. Steve Lewandowsky has written an introductory post with a bit of background, and my first post -- about Arsenault and Buchsbaum's recent article in Psychonomic Bulletin & Review on neuroimaging evidence for motor theories of speech -- is up as well.

"From Bach to Bayes and Wales: the Richard Morey challenge" (by Steve Lewandowsky)
"In the mind's ear: No connection between hearing and speaking in motor cortex" (my first post)

Asymmetric funnel plots without publication bias

2016-01-09T02:29:00.000-08:00

In my last post about standardized effect sizes, I showed how averaging across trials before computing standardized effect sizes such as partial $\eta^2$ and Cohen's d can produce arbitrary estimates of those quantities. This has drastic implications for meta-analysis, but also for the interpretations of these effect sizes. In this post, I use the same facts to show how one can obtain asymmetric funnel plots — commonly taken to indicate publication bias — without any publication bias at all. You should read the previous post if you haven't already.

A funnel plot is a commonly-used meta-analytic technique for the detection of bias in a subset of the scientific literature. The basic thinking is that if a literature is unbiased, the average estimates of an effect should not depend on the sample size (or some other measure of the "precision" of a study). For a given sample size, estimates of the effect size should be spread around the true effect size, with this spread decreasing as sample size gets larger.

Publication bias, which is often assumed to manifest itself as 1) a tendency for statistically significant results to be published, and 2) a tendency for researchers to publish effects consistent with their theoretical outlook, will result in asymmetric funnel plots. Read this Neuroskeptic post about a paper by Shanks and colleagues for an example how asymmetric funnel plots are used to argue for publication bias. Notice that the plots use a standardized effect size on the x axis.

A (not so) hypothetical paradigm

Since many priming effects have been called into question of recently, I will use a priming example. Suppose we are interested in an emotional face priming: we ask participants to perform a lexical decision task, but prior to every trial we "subliminally" (ie, very quickly) present either an excited face or a sad face, thinking that the excited face will speed performance on the task. Participants perform a number of trials in both priming conditions, which are averaged to obtain two "observations" per participant: an average RT in each condition. This is very common in the psychological literature. A paired t test is used to assess the effect of the prime.

Now suppose this same paradigm is used across many labs, with only variation in sample sizes. Each lab reports the standard statistics: the mean difference in RTs across participants, its standard error, and the t statistic. A skeptic comes along, collects all the statistics across all the papers, and computes Hedge's g standardized effect size (a variation on the standardized difference score) from the t statistic. They produce the funnel plot shown below by plotting the sample size¹(number of participants) against the standardized effect size:

This is a massively asymmetric funnel plot, and would likely be taken as strong evidence of publication bias. However, because I simulate the data, I know that there is no publication bias at all. This is merely an artifact of averaging and standardized effect sizes. You can obtain my simulation code here: github gist

Why is the funnel plot asymmetric? In all studies, the total number of trials performed was approximately the same: 2000 trials. The way these broke down across participants was different. Some studies had 100 trials per condition and 10 participants; others, 10 trials per condition and 100 participants. The standard deviation of the difference scores around their mean is a function of the number of trials performed per participant. When the number of trials is high, the standardized effect size is high, just as discussed in the previous blog post. But here, because the total amount of "effort" per study is conserved (that is, all studies have the same number of total trials), the studies with larger numbers of trials per participant have a smaller number of participants. The funnel plot therefore looks problematic, but it is an artifact.

One wonders if this Cross Validated query was related to this artifact.

Creating a funnel plot from the raw effect sizes removes the asymmetry; a funnel plot with the standard error on the y axis also does so.

This does not mean that using the standard error on the y axis fixes the problem. Consider another way number of trials and number of participants can be divided: positively correlated, rather than negatively as before. That is, studies that run more participants also run more trials per condition. The funnel plots end up looking very strange, with an asymmetry that is reverse of the one we expect. Larger effect sizes are obtained with larger numbers of participants.

Without reflection, this pattern might be offered as evidence that there was something very strange happening in a literature. But there's nothing strange here, except with the analysis. If there were publication bias, though, this artifact might actually mask it.

Wrap up

I suspect there are other artifacts one could generate using standardized effect sizes in a meta-analysis². How can we keep from getting fooled? In some cases, perhaps the correction I mentioned in the previous post might be of use. Since a funnel plot is often used for detecting problematic bias in a literature rather than estimating the effect size, the fact that there is no "true" effect size is not problematic.

For future research, data sharing and reporting of different effect size measures will help. Modifications of Cohen's d and Hedge's g exist which will reduce this problem (see "Computing d and g from studies that use pre-post scores or matched groups", for instance), but these modified statistics cannot be computed from typically-reported statistics. The fact that we need statistics that are not typically reported in order to perform reasonable meta-analyses raises the question of whether current reporting practices really allow a cumulative science.

Footnotes

¹Funnel plots can be created with a variety of statistics on the y axis. Different researchers make different recommendations for both axes (see, for instance, Peters et al 2006), and as we will see, this can have a dramatic effect on the conclusions.

² Sterne et al (2011) note minor asymmetries caused by a correlation between an effect and a standard error, as can be caused in estimation of extreme proportions or similar parameters, but nothing as dramatic or fundamental as shown here. Their asymmetries are mostly problematic for asymmetry tests, which can pick up minor asymmetries with larger samples.

Averaging can produce misleading standardized effect sizes

2016-01-07T14:21:00.001-08:00

Recently, there have been many calls for a focus on effect sizes in psychological research. In this post, I discuss how naively using standardized effect sizes with averaged data can be misleading. This is particularly problematic for meta-analysis, where differences in number of trials across studies could lead to very misleading results.

There are two main types of effect sizes in typical use: raw effect sizes and standardized effect sizes. Raw effect sizes are what you typically see in a plot: for instance, the effect of a priming manipulation might be 30ms. The advantage of raw effect sizes are that they are closer to the process of interest and more interpretable. We all know what it means for something to take 30ms to happen.

Another kind of effect size is the standardized effect size. With a standardized effect size, the raw effect is compared to the some measure of variability in the population. For instance, if the standard deviation of childrens’ heights at age 10 were 3 inches, and a “good” diet had an effect of 1.5 inches on average, we could say that the effect of the diet was .5 standard deviations. This is the logic of Cohen’s $d$, for instance. The disadvantage of this is that it is more difficult to understand what an effect of “half a standard deviation” means (and the variance-accounted-for statistics such as $\eta^2$ and $\omega^2$ are even more difficult to interpret); but standardized effect sizes have many good properties, including a close relationship to the concept of statistical power, comparability across paradigms, and the fact that they can often be computed from reported statistics such as $t$ and $F$.

For these reasons, standardized effect sizes are very common in meta-analysis. However, the common practice of averaging over trials in cognitive psychology makes them difficult to compare or even interpret.

Consider a typical cognitive experiment in with 30 participants, each performing 10 response time trials in two conditions. Typically each participant’s data will be averaged to form a single, average response time in each condition; these average RTs are then submitted to a repeated measures ANOVA (in fact, some R packages, such as afex do this automatically).

Hypothetical data is shown in the plot below. These data represent two hypothetical experiments, one with 10 samples per participant and one with 50.

Error bars represent Morey (2008) adjusted within-subject error bars, computed using this code.

I have generated the data so that everything is the same across the two experiments except the scale of the “error”: the scale of the error in experiment 2 is $1/\sqrt{5}$ times that of experiment 1, because each averaged “observation” represents five times more data. The raw effect size is precisely the same, but our certainty about the effect size is greater in experiment 2. This is exactly as it should be.

We can now perform the typical repeated measures ANOVAs on these two data sets, using the afex. The package will, if requested, compute the common partial $\eta^2$ standardized effect size statistic.

The results of the ANOVA for the “condition” in experiment 1 are:

Effect	df	MSE	F	pes	p.value
condition	1, 29	0.08	7.72	.21	.009

And the results of the ANOVA for the “condition” in experiment 2 are:

Effect	df	MSE	F	pes	p.value
condition	1, 29	0.02	38.58	.57	<.0001

This is would not be unexpected by anyone who routinely uses repeated measures ANOVA. Typically, the whole point of running more trials is to get more power. We perform more trials, and we obtained a higher $F$ value for the comparison of interest.

Notice that the sum of squares for the effect is precisely the same for both experiments. That’s because the raw effect is precisely the same. What is driving the higher $F$ value is the lower residual mean square (MSE) for the comparison, which is about 5 times smaller in Experiment 2. Again, this is what we expect. More trials, less “noise”.

But notice what happens to partial $\eta^2$. Because the MSE is smaller, the proportion of variance accounted for by the condition effect is larger. This drives the partial $\eta^2$ from .21 in experiment 1 to .57 in experiment 2. Researchers have previously warned about using partial $\eta^2$ for comparisons across designs (see for instance, Olejnik & Algina, 2003), but these two experiments appear to have the same design; at least, from the perspective of someone used to only analysing averaged data, they do.

This has the potential to wreak havoc on meta-analyses. Suppose someone combs the literature looking for $F$ values and computing partial $\eta^2$ values from the $F$ values (or, alternatively, Cohen’s $d$ from $t$ values). Assume experiment 1 represents a patient group; due to time constraints, the patients only had time for 10 trials per condition. Suppose experiment 2, on the other hand, represents a group of college students, who had time for more trials. The figure below shows the standardized effects in the two experiments.

Although the raw effect size is precisely the same across the two experiments, the standardized effect size is radically different, possibly leading to erroneous conclusions. Even if there are no systematic differences in number of trials across experiments experiments with different kinds of groups, this introduces a new source of variability into estimates, as well as making it nearly impossible to interpret the effect size. What is the “true” standardized effect size? It seems difficult to say. How can we solve this problem?

Solution 1: Generalized $\omega^2$

One option is generalized $\omega^2$ (see for instance, Olejnik & Algina, 2003) Instead of using the residual variance to standardize against, generalized $\omega^2$ standardizes against all measured (as opposed to manipulated) factors. For instance, the variability in participants is a measured source of variability. These sources of variability are assumed to be stable properties of populations and not effected by mere design choices. We can compute generalized $\omega^2$ again using the afex, which yields $\omega^2_g=0.011$ for the experiment 1 and $\omega^2_g=0.012$ for experiment 2; notice that these very similar. The effect “looks” smaller, because participants vary quite a bit relative to the size of the effect.

One problem with this approach is that the statistics necessary to compute generalized $\omega^2$ are not typically reported, meaning that this solution is useless for meta-analysis of existing literature. Ideally, we’d like a way to use reported statistics to at least compare across studies, when designs are similar enough.

Solution 2: Adjust the MSE

If we know the relative numbers of trials across two studies that have the same basic design, we should be able to “adjust” the MSE in the formula for the effect size (whichever effect size it happens to be) for the number of trials. In our example, experiment 2 has five times as many trials as experiment 1; we therefore would expect the MSE of experiment 2 to be one-fifth as large as that for experiment 1. To make the effect size computed from experiment 2 comparable to that from experiment 1, we can multiply its MSE by 5 before applying the formula for the effect size of interest. For partial $\eta^2$, this leads to an easy adjustment:
\[ \eta^2_2 = \frac{1}{c/\eta^2_1 - c + 1} \] where $c$ is the adjustment factor, $\eta^2_1$ is the original partial $\eta^2$, and $\eta^2_2$ is the adjusted partial $\eta^2$.

As an example, take our experiment 2, which had a partial $\eta^2$ of 0.571 and a sample size 5 times larger than experiment 1. Applying the formula above yields
\[ \begin{eqnarray*} \eta^2_2 &=& \frac{1}{5/0.571 - 5 + 1}\\ &\approx&0.21 \end{eqnarray*} \] which matches the partial $\eta^2$ from experiment 1 very well.

The problem with this approach is that it uses experiment 1 as a “reference” experiment. It is therefore not clear what the standardized effect size means in this case, except as a way to compare across experiments with similar designs. This may be enough to someone performing a meta-analysis — particularly if they can’t obtain the statistics to compute generalized $\omega^2$ — but as a general reporting solution, it is unsatisfactory.

Wrap up

Although standardized effect sizes have been advocated as a general tool for science and are increasingly reported, they are difficult to interpret because they are affected by trivial, common design decisions. The issues I raise here should be of interest to anyone working with standardized effect sizes, particularly those performing meta-analysis. They affect repeated measures designs with averaged data most acutely; however, between-subjects designs are also affected if each participant contributes an “average” score to the analysis. In the between-subjects case the adjustment would have to be different, but for large number of trials per participant might be acceptably ignored, if the error each participants’ score is small enough.

Added postscript

After a re-read, I want to ensure that I make clear that I'm not implying that the only problem here is with meta-analyses; that's just what drove me to write this post, and how I decided to frame it. But consider this: if an arbitrary decision (driven merely by the resources at hand, such as time or money, or even whim) such as "how many trials will we perform per cell in this experiment?" can cause the standardized effect size to increase almost 200%, that standardized effect size should not be taken to reveal any psychological "truth" and is useless for drawing substantive conclusions.

Confidence intervals: What they are and are not

2015-12-03T15:19:00.000-08:00

Over at the Psychonomic Society Featured Content blog, there are several new articles outlining some of our work on confidence intervals published previously in Psychonomic Bulletin & Review. In a three-part series, Steve Lewandosky and Alexander Etz lay out our case for why confidence intervals are not what people think they are. I've written enough about confidence intervals lately, so I'll just link you to their articles.

Part 1 (Steve Lewandowsky): The 95% Stepford Interval: Confidently not what it appears to be. Lewandowsky lays out the basic idea of a confidence interval, what people expect it to be, and how that is wrong.
Part 2 (Steve Lewandowsky): When you could be sure that the submarine is yellow, it’ll frequentistly appear red, blue, or green. Lewandowsky digs deeper into some strangeness in confidence intervals, caused by the fact that they guarantee certain behavior in the long run, but do not guarantee any reasonable post-data inferences.
Part 3 (Alexander Etz): Confidence intervals? More like confusion intervals. Etz summarizes the problems with the confidence interval advocacy literature, including incorrect (even contradictory) interpretations of confidence intervals.

Check it out, and don't forget that Morey et al (2015, 2015a) are open access!

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2015). The Fallacy of Placing Confidence in Confidence Intervals. Psychonomic Bulletin & Review, doi: 10.3758/s13423-015-0947-8.

Morey, R. D., Hoekstra, R., Rouder, J. N., & Wagenmakers, E.-J. (2015). Continued misinterpretation of confidence intervals: response to Miller and Ulrich. Psychonomic Bulletin & Review, doi: 10.3758/s13423-015-0955-8.

Reviewers and open science: why PRO?

2015-12-02T05:00:00.001-08:00

As of yesterday, our paper outlining the PRO Initiative for open science was accepted for publication in the journal Royal Society Open Science. It marks the end of many tweaks to the basic idea, and hopefully the beginning of a new era in peer reviewing: the empowered peer reviewer. The basic idea behind the PRO Initiative is that the peer relationship is fundamental in science, and it is this relationship that should drive cultural change. Open science is necessary, possible, and overdue. As reviewers, we can make it happen.

What is the PRO Initiative? Signatories on the Initiative agree that after 1 January, 2017, they will not offer comprehensive review for any manuscript that does not have open data and open materials (ie, they are placed for all to access on a reliable host, such as those listed here) or at least justifies the lack of openness with the manuscript itself. If a reviewer receives such a manuscript to review, they can contact the editor to ask the authors to fix the problem so that the review process can continue; if the authors refuse, then the reviewer should offer a simple review that focuses only on the lack of data, and does not recommend acceptance. The manuscript was incomplete, after all; it should not become part of the scientific record in that form.

Science is, of course, complicated, and so the Initiative is a bit more complicated too; read the full paper for more details. But the basic idea is that reviewers are the ones who define how to review a paper to ensure that good science is rewarded. There's nothing sacred about the current non-interactive, comprehensive review system; this makes sense in a past world where physical manuscripts were being sent through the mail. Clearly, in this world you don't want a back-and-forth between reviewers, because that would take ages. Now, however, reviewers can ask for a fix instantaneously. I have, for instance, noted that a manuscript was missing a figure, and gotten a new manuscript in a day. If a manuscript does not meet certain minimal standards, there is no reason for a reviewer to continue the review. Ask for a fix; it's your right as a reviewer.

Is open science important enough that reviewers should change the way they review? I say yes, absolutely. Ultimately, though, that is up to you as a reviewer to decide. I hope that you'll decide to stand with the other reviewers who have signed the PRO Initiative. In the coming year before the Initiative takes effect, we'll be organizing events at conferences to talk about open science. Let's change science together!

If you do support the Initiative, head over to the website to get social media avatars to show your support!

[Read the paper -- Sign the Initiative -- More resources for open science]

Habits and open data: Helping students develop a theory of scientific mind

2015-11-19T07:35:00.001-08:00

This post is related to my open science talk with Candice Morey at Psychonomics 2015 in Chicago; also read Candice's new post on the pragmatics: "A visit from the Ghost of Research Past". In this post, we suggest three ideas that can be implemented in a lab setting to improve scientific practices, and encourage habits that make openness easier. These ideas are designed to be minimally effortful for the adviser, but to have a big impact on practice:

* Data partners: young scientists have a partner in another lab, with whom they swap data. The goal is to see if their data documentation is good enough that their partner can reproduce their main analysis with minimal interaction.
* Five year plan: When a project is part-way through, students must give a brief report that details what they have done to insure that the data and analyses will be comprehensible to members of the lab in five-year's time, after they have left.
* Submission check: At first submission of an article based on the project, advisors should discuss with their advisees the pros and cons of opening their data, and how the data will be promoted online, if it will be open.

Betrayed by our habits

Science, like a lot of other things, is based largely on habit. We learn habits early on in our careers, and most of them serve us well. Habits like checking for problems in our data, such as bad coding or outliers, can keep us from getting fooled. Other habits, like doing a final, full read through of a paper before submission, save us work in the long run.

Other habits, however, can keep us from doing better science. Scientists value openness, at least in the abstract. Many scientists have had the frustrating experience of *closed* science: for instance, colleagues that do not share their data. Yet, most science is not open, in spite of the fact that many tools to facilite open science are freely available.

To us, the reasons seems obvious. Open science does not bring great immediate reward, and open practices are not part of most scientists' habits. This is natural; many scientists were trained before openness was easy and expected. Our habits were formed without an expectation, for instance, that our data would be open to everyone. Analyses are messy, badly documented, and full of ad hoc solutions to problems that we decided to improve later. If you weren't expecting data to be open, then making it so requires work.

When we are faced with opening our data at paper submission or publication, then, our habits betray us. Our values may say "we should be open", but our real choice is *not* between open science and closed science; it is between "hours of work now with uncertain payoff" versus "no work now, and maybe no one will ask for it." The result is not a free choice about open science. Our habits have encumbered our choice with irrelevent issues, such as "I don't feel like doing this work right now. I'll do something else more fun," and everything else is more fun.

If we had habits that were more attuned to the expectation of scientific openness, we might be able to do better. Forming such habits later in a career takes work, but forming them early in one's career is much easier. We suggest here a number of things that senior researchers who run labs can do to help their advisees build better habits. None of these things require much work, but we believe that they can help ensure the next generation of scientists has better habits than the current one.

Helping young scientists form better habits: three ideas

The ideas presented here are designed with several features in mind. They all:
* Require minimal effort on the part of an advisor.
* They require little *marginal* effort from a young scientist. They may even save effort, since they will encourage good practices and help avoid mistakes.
* They encourage development of a "theory of scientific mind": How do other scientists think about data and materials? What would they expect of a data set? Will others understand what I've done?
* They help young scientists *truly* have a choice about whether to be open. By the time the choice must be made, no extra work is necessary. The decision can be driven by the arguments in favor, or against, open science, instead of mere momentary pragmatic concerns.

These are roughly ordered in where they would appear in an advisees training. We should emphasize that none of these require an advisor promoting them. Young scientists can do these things without their advisor's support, to help build good habits.

Data partners

In the "data partner" scheme, young scientists in one lab partner with young scientists in another lab working on related topics. The goal of the data partner scheme is to help build an understanding of what information is necessary when sharing data, and to help catch analysis errors early on.

When collecting and analyzing their data, students should plan to share their data with their data partner with a short report containing an initial methods section, and a primary analysis (but without numbers). The data partner will be expected to reproduce the primary analysis *without* interaction with the student. The data will need to be well-documented, and the analysis sufficiently detailed, for the primary analysis to be reproducible. Details such as how the data are to be cleaned will be critical.

Once the data partner has attempted to reproduce the primary analysis, the two can discuss what was lacking. What could have been more clear? If the results could not be replicated, why? This will build the students' understanding of data analysis, develop their theory of scientific mind, and catch many mistakes early in a project. As a side benefit, the student now has created substantial documentation of their data set: precisely the information necessary for releasing data to others.

The five-year plan

One of the issues that often comes up when training students is turn-over. A student often has "ownership" of a project, and an adviser is less involved, guiding the student along but not having complete knowledge of the entire project. This can be problematic. When a student leaves, what if the adviser wants to send the data to someone? What if another student wants to re-analyze it to check a hunch? What if the lab wants to perform a meta-analysis?

A lab runs on data; old materials (including stimuli), data, and analyses should be archived in a clear way so that if someone from the lab, years later, wants to use the materials or data, or wants to reproduce the analysis, they should be able to do so. This is part of being a good lab citizen.

When a project is mature, advisers should give time to the student in a lab meeting to answer the question "What have you done to insure that this project -- including the materials, data, and statistical analysis -- will be useable in five years?" This encourages students to think of the long-term usefulness of their data to others. Over the years, a formal meeting may be come unnecessary as lab standards become more geared toward openness.

The submission check

As the name implies, the "submission check" is meant to occur before a project is submitted for publication. If all has gone well, the project should be well-documented and ready to release. The work has been done; all that needs decided is whether the project will be open. If the answer is not pre-determined by a journal or granting agency requiring open data and materials, then the adviser should have a conversation with the young scientist: should we open these data and materials? What are the arguments for opening it? Are there arguments against it?

If it is decided to open the data, then the next question should be, "How will you promote these data and materials from the project?" For a young scientist thinking of the next stage of their career, promotion is critical. One of the advantages of open data is that it yields another product of the research that can be promoted; open materials and open code provide others. The student should be encouraged to think about how these can be leveraged to their advantage, and to follow through with their promotion ideas.

Promoting good habits and open science

For many scientists, open science is a difficult choice because it is encumbered by a number of unnecessary pragmatic concerns flowing from habits formed over many years. Openness is not truly a free choice, driven by the merits of open science. This need not be the case for the next generation of researchers. Senior researchers have an important role to play in helping their advisees form good habits and develop a theory of scientific mind. The data partner scheme, the five-year plan, and the submission check can help establish good lab practices, with the benefit that students will be prepared for a more open science.

Neyman does science, part 2

2015-11-12T10:21:00.002-08:00

In part one of this series, we discussed the different philosophical viewpoints of Neyman and Fisher on the purposes of statistics. Neyman had a behavioral, decision based view: the purpose of statistical inference is to select one of several possible decisions, enumerated before the data have been collected. To Fisher, and to Bayesians, the purpose of statistical inference is related to the quantification of evidence and rational belief. I agree with Fisher on this issue, and I was curious how Neyman -- with his pre-data inferential philosophy -- would actually tackle a problem with real data. In this second part of the series, we examine Neyman's team's analysis of the data from the Whitetop weather modification experiment in the 1960s.

First: Get the data!

I have saved the data in Table 2 in Neyman et al. (1969) online in a text format. It can be loaded into R using the following code (available as a gist):

## R code to get data and make plots
## You may have to install the devtools and RCurl packages first
library('devtools')
source_gist("https://gist.github.com/richarddmorey/5f22fc742535d078c25f")

The code will also regenerate the panels from Neyman et al.'s Figure 2.

Project Whitetop

Project Whitetop, a weather modification experiment performed during the summers in 1960-1964, was one of the first meticulously randomized, large-scale experiments of its kind. Before the experiment began, every day was designated as a "seed" day or a "non-seed" day (control). The designation was kept secret until the last moment. Every day in the morning, the experimenters would determine whether the conditions were good for seeding. If there were westerly winds and "high precipitable water" in Little Rock, Arkansas and Columbia, Missouri, then the day was designated as an "experimental" day, and the envelope containing the seed instructions was opened. On seeded days, an airplane dumped silver iodide into the clouds around West Plains, Missouri. The area at the center of the concentric circles in the figure below shows the experimental area.

Figure 1 from Neyman et al (1969) overlaid on a modern map. Original caption reads "Approximate map of the region around the Project Whitetop target. Solid circles mark the location of rain gages used for the evaluation. The radii of the concentric circles are multiples of 30 miles; the letters A, B, C, D, E, and F designate the region within the inner circle and the regions within the successive rings, respectively. For example, region B is the area bounded by the 30 mile (inner) circle and the 60 mile (second) circle. Additionally, the area within the outermost circle is designated as 'entire' (Tables 1 and 2)."

The original analysis as reported by Neyman et al (1969) was of the change in precipitation in the hour when the seeded plume was overhead. There appeared to be an unexpected decrease in the precipitation due to the seeding ("some" p<0.01, as it was reported by Neyman et al).

Neyman's team was interested in assessing the effect of seeding at longer time scales (24 hours) and at greater distances (up to 180 miles from the experimental area). In their minds, these longer-term, larger-range effects were much more interesting from a policy perspective.

What did Neyman think makes a good analysis?

In the same year, Neyman, Scott, and Wells wrote a paper outlining statistical inference for weather modification experiments ("Statistics in meteorology", 1969). The paper is important because it lays out what we might expect from Neyman's Whitetop analysis. They describe power as related to the notion of an "informative" experiment, and briefly mention the "optimal" class of tests that will be used to analyse the Whitetop data. The critical role of power is emphasized:

[The] rational planning of a rain stimulation experiment must emphasize the question whether, with this design, with this proposed duration and with this particular statistical test, the probability of detecting the effect of treatment that one wishes to detect is 0.2, or 0.5, or 0.8, etc. In other words, in experimentation with weather control, it is of paramount importance to estimate the power of the statistical test to be used on the data that may be provided by the contemplated design of the experiment. (p. 123)

And again on page 124: "[T]he power of the test to be used in the evaluation of a rain stimulation experiment is of prime importance." This makes it explicit: the optimality of the test is interesting, but the power of the test with reference to the experimental design is critical. An optimal test can give an uninformative experiment, if the design is bad (e.g., low sample size). It should not be a surprise that Neyman emphasizes the importance of ensuring the test is worthwhile before one undertakes it. This is especially critical if one faces interpreting null results ($p>\alpha$), as Neyman points out elsewhere.

Neyman et al's Whitetop analysis

Here we discuss Neyman et al. (1969)'s analysis ("Areal Spread of the Effect of Cloud Seeding at the Whitetop Experiment"). In order to examine the effect of distance, they decided to use data from the 174 rain gages within 180 miles of the experimental area. The goal, according to Neyman et al, was two-fold:

Specifically, an effort was made to determine (i) the differences in the 24-hour precipitation amounts at different distances from the center of the Whitetop target, averaged over the 102 days with seeding and over the 96 experimental days without seeding, and (ii) the probability (P) of obtaining such differences, or larger, purely through unavoidable chance variation. (pp. 1445-1446)

Of critical interest to use is how P (the p value) is interpreted later on. The figure above (overlaid on the map) shows how Neyman et al. divided the area into 6 concentric rings. For each ring A-F, the change in percent change precipitation was computed, along with a two-tailed p value. Note that a positive change was expected, and only the discovery of the negative effect in this same data set would lead one to look for a negative effect.

Before we look at the analysis results themselves, let me emphasize that I am not concerned with whether Neyman et al are correct; what I'm interested in is how they use statistics to support their case. Of special interest are some ideas that never appear in this paper. These include:

An $\alpha$ level
Power
Error rates
A significant or nonsignificant result (other than the already-mentioned, previously-published $p<.01$ result with the same data, which is described as "significant")
Pre-determined decisions

With that in mind, we can look at the results, which I have combined from their Table 2 and Figure 2. The first panel shows the effect of the seeding on days that are "wet" (that is, given rain occurred, how much did it rain?). The second panel shows the effect of seeding on all days.

Recreation of top panel from Neyman et al's (1969) Figure 2. See bottom panel for original caption.

Recreation of bottom panel from Neyman et al's (1969) Figure 2. Note that these points are not independent. Original caption reads: "Average daily precipitation versus average distance from the target center. (Top) Precipitation averaged per wet day; (bottom) precipitation averaged per day, wet or dry. In each case the upper curve represents experimental days not seeded, the middle curve represents experimental days seeded, and the lower curve represents the 267 days of June, July, and August 1960-64, which were not classified as experimental." I have not added the lower line, since these numbers are not included in the tables (and they are irrelevant).

Neyman et al refer primarily to the results for all days. This is how Neyman et al describe the results:

The estimate of the average seeding effect in the entire region is a 21-percent loss of rain. In the absence of a real effect, chance alone could produce such an estimate of loss, or a larger one, about once in 15 independent trials. (p 1447)

Note the switch from a two-sided p value (p=0.13, lower panel) to a one-sided p value! This occurs without any a priori mention of this hypothesis, aside from the fact that it was found in these same data previously; no mention of any pre-determined decision criterion. In fact, they note that this magnitude of negative effect is supported by no "intelligible theory" (p. 1447). This is a purely evidential use of the one-sided p value. They continue:

From the point of view of the question as to whether the current state of weather modification technology justifies its use for alleviating water shortages, [the data] appear decisive. As already mentioned, the Whitetop experiment was conducted in a locality where summer precipitation is critical. In fact, the possibilities of increases due to seeding as modest as 5 to 10 percent have been mentioned as something to be hoped for. When instead of such gains the experimental results show losses averaging 20 percent over an area of some 100,000 square miles, then even the slightest possibility that these losses were caused by seeding must be considered as disqualifying the underlying technology. Actually, the evidence in support of the causal relation between seeding and loss of rain appears quite strong. (p 1447; emphasis mine)

The only mention of a decision is related to the evidence -- that is, it is decisive -- not any decision pre-planned decision. Moreover, the interpetation is concerned with the possibility, on a graded scale, that the effect is real and negative, and the evidence is "quite strong" -- again, a graded way of referring to the strength of the evidence.

Almost immediately, Battan (1969) responded to Neyman et al saying what probably most of the readers of this blog post were thinking, on seeing the mediocre p values Neyman et al used as evidence:

The two-tailed significance levels in the two tables are not so small as to make it self-evident that the rainfall differences were caused by seeding. Several hypotheses might be offered to explain effects of seeding downwind of the seeding area, but no plausible hypothesis has been offered to explain effects upwind and to the side to distances of 180 miles. (p. 618)

In the clearest sign that their use of statistical inference was evidential and post-data, rather than decision-based and pre-data, Neyman et al responded:

Battan is certainly entitled to his opinion that "significance levels . . . are not so small as to make it self-evident that the rainfall differences were caused by seeding." In fact, we agree about the lack of self-evidence. But, if there is anything in the contention that a gain in the rainfall of 5 to 10 percent is worth talking about, then a 20 percent loss, experienced over a vast area of some 100,000 square miles, must be a disaster. In these conditions, the odds of 14 to 1 that this loss was caused by seeding do not appear negligible to us. We feel that it is imperative that the general public and the government be informed of the situation. (p. 618, emphasis mine)

In a stunning misuse of statistics, Neyman et al have confused a p value with a posterior odds. Not just any p value; this was a p value from a post hoc one-tailed test. The need for a post-data mode of statistical inference is so great that Neyman -- who was famed for his pre-data theory of statistical inference -- is forced into a basic fallacy when responding to a critic. To me, this is quite remarkable.

Wrapping up

Fisher and Neyman disagreed about the philosophy of statistical inference. To Fisher (and indeed to almost all scientists), statistical inference was post-data and evidential. Neyman, however, had a pre-data, behavioural view of statistical inference. Neyman's viewpoint (which, unfortunately, has stuck around a long time in training of scientists using Type I and Type II errors) is not one that is conducive to science. Neyman himself, when doing science, appears to have had a post-data mind-set.

In part 3 of the series, I will look at the aftermath of the Neyman's Whitetop analysis, and how Neyman eventually abandoned the conclusions.

Neyman does science, part 1

2015-11-10T07:34:00.002-08:00

On reading Neyman's statistical and scientific philosophy (e.g., Neyman, 1957), one of the things that strikes a scientist is its extreme rejection of post-data reasoning. Neyman adopts the view that once data is obtained statistical inference is not about reasoning, but is rather about the automatic adoption of one of several decisions. Given the importance of post-data reasoning to scientists -- which can be confirmed by reading any scientific manuscript -- I wondered how Neyman would think and write about an actual, applied problem. This series of blog posts explores Neyman's work on the analysis of weather modification experiments. The (perhaps unsurprising) take-home message from this series of posts is this: not even Neyman applied Neyman's philosophy, when he was confronted with real data.

Consider the view of statistical inference put forward by Fisher in contrast to Neyman's perspective:

Decision [as opposed to reasoning] itself must properly be referred to a set of motives, the strength or weakness of which should have had no influence whatever on any estimate of probability. We aim, in fact, at methods of inference which should be equally convincing to all rational minds, irrespective of any intentions they may have in utilizing the knowledge inferred. (Fisher, 1955, p. 77)

The concept of evidence -- that is, information which warrants changes in belief -- is central to science. Under an evidential view, evidence can be strong or weak, or, in other words, convincing or unconvincing. This occurs as a matter of degrees, independent of any particular decisions one might have in mind. Evidence is a post-data concept, applying to the interpretation of data after it has been collected.

Neyman, on the other hand, appears to reject epistemology altogether. Post-data ideas like beliefs -- justified or otherwise -- are not a target of statistical analysis:

The beliefs of particular scientists are a very personal matter and it is useless to attempt to norm them by any dogmatic formula...The content of the concept of inductive behavior is the recognition that the purpose of every piece of serious research is to provide grounds for the selection of one of several contemplated courses of action. (Neyman, 1957, p. 16)

Neyman offers a pre-data philosophy. Decision criteria are set before the experiment, on the basis of considerations of long-run Type I and Type II errors. The outcome of a study is the selection of a decision, not a reasonable change in belief. For those of us in the sciences, Neyman's is a very strange outlook on science and statistics. Fisher, of course, was primarily a scientist; Neyman, a statistician. This difference shows in their respective ideas of how statistical inference is to be undertaken.

Neyman, however, goes to great lengths to show that Fisher, in fact, acted consistently with Neyman's own philosophy, and not with Fisher's stated one. In highlighting a case where Fisher has interpreted a low p value as indicating that a particular null hypothesis is not true, Neyman says:

The trouble is that the premise "P is less than .01" does not imply that "the departures are not fortuitous" [that is, did not arise by chance, under the null hypothesis]. In fact, even if the inheritance of the characteristics considered conformed exactly with the assumed model, the probability of observing $\chi^2$ corresponding to the value of P less than 0.01 is positive and approximately equal to 0.01. Thus, the assertion "the departures are not fortuitous" cannot be deduced from "P is less than .01". Yet, this assertion is made, and is made in very definite terms...[O]ne may presume that the assertion "the departures are not fortuitous" is interpreted by Fisher as equivalent to the adoption of the hypothesis of differential viability. (Neyman, 1957, p. 12)

And on this point, Neyman is right. A low p value does not imply that the null hypothesis is false (at least, not by itself). Neyman takes this to mean that Fisher was making a decision to reject the null hypothesis, rather than engaging in any sort of post-data "reasoning".

The main problem is that it is difficult to see how Neyman's philosophy is applicable to science, which, as Fisher pointed out, is primarily designed to incrementally increase knowledge, and concerned with graded evidence and beliefs. I wanted to understand how Neyman would interpret the results of an already-performed experiment. How could he avoid post-data evaluations of evidence?

Neyman's weather modification work

In a series of articles in starting in the 1960s, Neyman was involved with the analysis of meteorological data from the Whitetop project, which was designed to evaluate the efficacy of cloud seeding to increase rainfall. In cloud seeding, particles (such as silver iodide) are scattered into clouds in the hope that the water or ice will condense around them. In theory, this should increase precipitation, because it is meant to mimic the natural processes underlying precipitation.

The Whitetop project was one of the first large-scale, randomized experiments in cloud seeding. It was designed to test whether silver iodide dropped from a plane in summer months could increase the probability of rain, or, given that rain occurred, how much rain actually fell. Silver iodide was dropped on random days within an area about 60 miles in radius around West Plains, Missouri in five summer seasons. This area at the time was suffering from low rainfall that threatened agriculture in the area.

There are a few relevant facts that we can lay out before we start:

The field of weather modification at the time (and perhaps still) was susceptible to widespread confirmation bias and what we would call today questionable research practices, including reliance on p hacking to explain away negative results in a primary outcome (Atals, 1977).
In light of the previous point, it is perhaps not surprising that the efficacy of cloud seeding is still disputed. According to the National Research Council's Committee on the Status of and Future Directions in U.S. Weather Modification Research and Operations (2003): "The Committee concludes that there still is no convincing scientific proof of the efficacy of intentional weather modification efforts. In some instances there are strong indications of induced changes, but this evidence has not been subjected to tests of significance and reproducibility."
There was (is?) no known mechanism by which seeding could affect rainfall upwind at distances on the order of 100 miles (Braham, 1979).

Neyman's team of researchers at the University of California, Berkeley was not part of the original team of researchers on the Whitetop project team. In a series of papers they analysed the Whitetop data with an interest in determining the long-distance, medium-term (1 day) effects of cloud seeding. In the next post, I will examine how Neyman presents data analyses in several papers, with a focus on how he and his team writes about data analysis and statistical inference. The resulting analysis will be similar to how Neyman (1957) assessed Fisher's language, but in reverse: we will see that Neyman strongly favored evidential language and did not use decisions, error rates, or power to interpret the data.

Go to part 2 >>>

Bibliography for these posts

Atals, D. (1977). The Paradox of Hail Suppression. Science, 195(4274), 139–145.

Braham, R. R. (1979). Field Experimentation in Weather Modification. Journal of the American Statistical Association, 74(365), 57–68.

Committee on the Status and Future Directions in U.S Weather Modification Research and Operations, National Research Council. (2003). Critical issues in weather modification research. National Academies Press.

Fisher, R. A. (1955). Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society. Series B (Methodological), 17, 69–78.

J. L. Lovasich, M. A. W., J. Neyman, E. L. Scott. (1971). Hypothetical Explanations of the Negative Apparent Effects of Cloud Seeding in the Whitetop Experiment. Proceedings of the National Academy of Sciences of the United States of America, 68(11), 2643–2646.

Lovasich, J. L., Neyman, J., Scott, E. L., & Smith, J. A. (1969). Wind directions aloft and effects of seeding on precipitation in the Whitetop experiment. Proceedings of the National Academy of Sciences, 64(3), 810–817.

Lovasich, J. L., Neyman, J., Scott, E. L., & Wells, M. A. (1971). Further Studies of the Whitetop Cloud-Seeding Experiment. Proceedings of the National Academy of Sciences, 68(1), 147–151.

Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Review of the International Statistical Institute, 25, 7–22.

Neyman, J. (1977). A statistician’s view of weather modification technology (A Review). Proceedings of the National Academy of Sciences of the United States of America, 74(11), 4714–4721.

Neyman, J., Scott, E. L., & Smith, J. A. (1969). Whitetop Experiment (response to Battan). Science, 165(3893), 618.

Neyman, J., Scott, E. L., & Wells, M. A. (1969). Statistics in Meteorology. Review of the International Statistical Institute, 37(2), 119–148.

Neyman, J., Scott, E., & Smith, J. A. (1969). Areal Spread of the Effect of Cloud Seeding at the Whitetop Experiment. Science, 163(3874), 1445–1449.

BayesFactor version 0.9.12-2 released to CRAN

2015-09-24T13:52:00.000-07:00

I've released BayesFactor 0.9.12-2 to CRAN; it should be available on all platforms now. The changes include:

Added feature allowing fine-tuning of priors on a per-effect basis: see new argument rscaleEffects of lmBF, anovaBF, and generalTestBF

Fixed bug that disallowed logical indexing of probability objects

Fixed minor typos in documentation

Fixed bug causing regression Bayes factors to fail for very small R^2

Fixed bug disallowing expansion of dot (.) in generalTestBF model specifications

Fixed bug preventing cancelling of all analyses with interrupt

Restricted contingency prior to values >=1

All BFmodel objects have additional "analysis" slot giving details of analysis

Please help: BayesFactor testimonials

2015-09-09T12:52:00.001-07:00

I'm compiling a portfolio about the BayesFactor software, and I would love to have short comments (a few sentences to a paragraph) from people who have found the software useful. If you have used the software and you wouldn't mind sending me a short blurb about your experience, I'd love to hear from you! Please send your BayesFactor testimonial to richarddmorey@gmail.com. Thanks in advance!

On radical manuscript openness

2015-08-10T12:36:00.000-07:00

One of my papers that has attracted a lot of attention lately is "The Fallacy of Placing Confidence in Confidence Intervals," in which we describe some of the fallacies held by the proponents and users of confidence intervals. This paper has been discussed on twitter, reddit, on blogs (eg, here and here), and via email with people who found the paper in various places. A person unknown to me has used the article as the basis for edits to the Wikipedia article on confidence intervals. I have been told that several papers currently under review cite it. Perhaps this is a small sign that traditional publishers should be worried: this paper has not been "officially" published yet.

I am currently wrapping up the final revisions on the paper, which has been accepted pending minor revisions at Psychonomic Bulletin & Review. The paper has benefited from an extremely public revision process. When I had a new major version to submit, I published the text and all code on github, and shared it via social media. Some of resulting discussions have been positive, others negative; some useful and enlightening, others not useful and frustrating. Most scientific publications almost exclusively reflect input from the coauthors and the editors and reviewers. This manuscript, in contrast, has been influenced by scores of people I've never met, and I think the paper is better for it.

This is all the result of my exploring ways to make my writing process more open, which led to the idea of releasing successive major versions of the text and R code on github with DOIs. But what about after it is published? How can manuscript openness continue after the magic moment of publication?

One of the downsides of the traditional scientific publishing model is that once the work is put into a "final" state, it becomes static. The PDF file format in which articles find their final form — and in which they are exchanged and read — enforces certain rigidity, a rigor mortis. The document is dead and placed behind glass for the occasional passerby to view. It is of course good to have a citable version of record; we would not, after all, want a document to be a moving target, constantly changing on the whim of the authors. But it seems like we can do better than the current idea of a static, final document, and I'd like to try.

I have created a website for the paper that, on publication, will contain the text of the paper in its entirety, free to read for anyone. It also contains extra material, such as teaching ideas and interactive apps to assist in understanding the material in the paper. The version of the website corresponding to the "published" version of the paper will be versioned on github, along with the paper. But unlike the paper at the journal, a website is flexible, and I intend to take advantage of this in several ways.

First, I have enabled hypothes.is annotation across the entire text. If you open part of the text and look in the upper right hand corner, you will see three icons that can be used to annotate the text:

The hypothes.is annotation tools.

Moreover, highlighting a bit of text will open up further annotation tools:

Highlighting the text brings up more annotation tools.

Anyone can annotate the document, and others can see the annotations you make. Am I worried that on the Internet, some people might not add the highest quality annotations? A bit. But my curiosity to see how this will be used, and the potential benefits, outweighs my trepidation.

Second, I will update the site with new information, resources, and corrections. These changes will be versioned on github, so that anyone can see what the changes were. Due to the fact that the journal will have the version of record, there is no possibility of "hiding" changes to the website. So I get the best of both worlds: the trust that comes with having a clear record of the process, with the ability to change the document as the need arises. And the entire process can be open, through the magic of github.

Third, I have enabled together.js across every page of the manuscript. together.js allows collaboration between people looking at the same website. Unlike hypothes.is, together.js is meant for small groups to privately discuss the content, not for public annotation. This is mostly to explore its possibilities for teaching and discussion, but I also imagine it holds promise for post-publication review and drafting critiques of the manuscript.

The together.js collaboration tools allow making your mouse movements and clicks visible to others, text chat, and voice chat.

Critics could discuss the manuscript using together.js, chatting about the content of the manuscript. The communication in together.js is peer-to-peer, ensuring privacy; nothing is actually being managed by the website itself, except for making the collaboration tools available.

The best part of this is that it requires no action or support from the publisher. This is essentially a sophisticated version of a pre-print, which I would release anyway. We don't have to wait for the publishers to adopt policies and technologies friendly for post-publication peer review; we can do it ourselves. All of these tools are freely available, and anyone can use them. If you have any more ideas for tools that would be useful for me to add, let me know; the experiment hasn't even started yet!

Check out "The Fallacy of Placing Confidence in Confidence Intervals," play around with the tools, and let me know what you think.

Call for papers: Bayesian statistics, at Zeitschrift für Psychologie

2015-05-26T13:55:00.001-07:00

I am guest editing a special topical issue of Zeitschrift für Psychologie on Bayesian statistics. The complete call, with details, can be found here: [pdf]. Briefly:

As Bayesian statistics become part of standard analysis in psychology, the Zeitschrift für Psychologie invites papers to a topical issue highlighting Bayesian methods. We invite papers on a broad range of topics, including the benefits and limitations of Bayesian approaches to statistical inference, practical benefits of Bayesian methodologies, interesting applications of Bayesian statistics in psychology, and papers related to statistical education of psychologists from a Bayesian perspective. In addition to suggestions for full original or review articles, shorter research notes and opinion papers are also welcome.

We invite scholars from various areas of scholarship, including but not limited to psychology, statistics, philosophy, and mathematics, to submit their abstracts on potential papers.

Abstracts are due at the end of July. Critiques and articles about the history of Bayesian statistics are also welcome.

Visualizing statistical distributions with javascript

2015-05-10T02:37:00.000-07:00

For the past few years, I've been developing and using a library I created that allows me to easily generate visualizations of statistical distributions for teaching. One can specify a distribution along with a parametrization, and the library sees it and generates a table containing all the distributions, which gives links to interactive plots that allow anyone to see how changing the parameters affects the distribution. In addition, clicking on the plot allows finding areas under the distribution. Users can switch between PDF and CDF views. I've now opened the code on github.

You can also link directly to a visualization using URL parameters. For instance:

http://learnbayes.org/demo/stat-distributions-js/distributionDisplay.html?dist=normal&ptzn=2&plotxrng=50,150&rangesLo=50,3&rangesHi=150,45&starts=100,15

See the live demo and the github repository for more details.

Example screenshots:

The fallacy of placing confidence in confidence intervals (version 2)

2015-04-20T23:53:00.000-07:00

I, with my coathors, have submitted a new draft of our paper "The fallacy of placing confidence in confidence intervals". This paper is substantially modified from its previous incarnation. Here is the main argument:

"[C]onfidence intervals may not be used as suggested by modern proponents because this usage is not justified by confidence interval theory. If used in the way CI proponents suggest, some CIs will provide severely misleading inferences for the given data; other CIs will not. Because such considerations are outside of CI theory, developers of CIs do not test them, and it is therefore often not known whether a given CI yields a reasonable inference or not. For this reason, we believe that appeal to CI theory is redundant in the best cases, when inferences can be justified outside CI theory, and unwise in the worst cases, when they cannot."

The document, source code, and all supplementary material is available here on github.

Guidelines for reporting confidence intervals

2015-04-17T02:39:00.000-07:00

I'm working on a manuscript on confidence intervals, and I thought I'd share a draft section on the reporting of confidence intervals. The paper has several demonstrations of how CIs may, or may not, offer quality inferences, and how they can differ markedly from credible intervals, even ones with so-called "non-informative" priors.

Guidelines for reporting confidence intervals

Report credible intervals instead. We believe any author who chooses to use confidence intervals should ensure that the intervals correspond numerically with credible intervals under some reasonable prior. Many confidence intervals cannot be so interpreted, but if the authors know they can be, they should be called "credible intervals". This signals to readers that they can interpret the interval as they have been (incorrectly) told they can interpret confidence intervals. Of course, the corresponding prior must also be reported. This is not to say that one can't also call them confidence intervals if indeed they are; however, readers are likely more interested in the post-data properties of the procedure -- not the coverage -- if they are interested arriving at substantive conclusions from the interval.

Do not use procedures whose Bayesian properties are not known. As Casella (1992) pointed out, the post-data properties of a procedure are necessary for understanding what can be inferred from an interval. Any procedure whose Bayesian properties have not been explored can have properties that make it unsuitable for post-data inference. Procedures whose properties have not been adequately studied are inappropriate for general use.

Warn readers if the confidence procedure does not correspond to a Bayesian procedure. If it is known that a confidence interval does not correspond to a Bayesian procedure, warn readers that the confidence interval cannot be interpreted as having a X% probability of containing the parameter, that it cannot be interpreted in terms of the precision of measurement, and that cannot be said to contain the values that should be taken seriously: the interval is merely an interval that, prior to sampling, had a X% probability of containing the true value. Authors using confidence intervals have a responsibility to keep their readers from invalid inferences if they choose to use them, and it is almost sure that readers will misinterpret them without a warning (Hoekstra et al, 2014).

Never report a confidence interval without noting the procedure and the corresponding statistics. As we have described, there are many different ways to construct confidence intervals, and they will have different properties. Some will have better frequentist properties than others; some will correspond to credible intervals, and others will not. It is unfortunately common for authors to report confidence intervals without noting how they were constructed. As can be seen from the examples we've presented, this is a terrible practice because without knowing which confidence intervals was used, it is unclear what can be inferred. A narrow interval could correspond to very precise information or very imprecise information depending on which procedure was used. Not knowing which procedure was used could lead to very poor inferences. In addition, enough information should be presented so that any reader can compute a different confidence interval or credible interval. In most cases, this is covered by standard reporting practices, but in other cases more information may need to be given.

Consider reporting likelihoods or posteriors instead. An interval provides fairly impoverished information. Just as proponents of confidence intervals argue that CIs provide more information than a significance test (although this is debatable for many CIs), a likelihood or a posterior provides much more information than an interval. Recently, Cumming (2014) [see also here] has proposed so-called "cat's eye" intervals which are either fiducial distributions or Bayesian posteriors under a "non-informative" prior (the shape is the likelihood, but he interprets the area, so it must be a posterior or a fiducial distribution). With modern scientific graphics so easy to create, along with the fact that likelihoods are often approximately normal, we see no reason why likelihoods and posteriors cannot replace intervals in most circumstances. With a likelihood or a posterior, the arbitrariness of the confidence or credibility coefficient is avoided altogether.

All about that "bias, bias, bias" (it's no trouble)

2015-04-10T02:34:00.001-07:00

At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem “biased” toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.

Uri Simonsohn has a blog post critiquing default Bayes factors for their supposed “bias” toward the null hypothesis for small sample sizes. I have several brief responses:

We do not suggest a “default” prior; we suggest a family of default priors, which an adjustable prior parameter (see also this paper describing our view, which is conditionally accepted at Multivariate Behavioral Research). If you're looking for a small effect, adjust the prior.
The whole point of a Bayes factor analysis is that you do not know what the true effect size is (see Jeff Rouder and Joe Hilgard's response here). Noting that the Bayes factor will mislead when you know there's a small effect, but you use a prior that says the effect size is probably moderate to large is not useful. Bayes factors just do what you ask them to do!
More broadly, though, I think it is helpful to think about this supposed “bias”. Is it what we would expect for a reasonable method? Sometimes our intuitions fail us, and we end up thinking something undesirable, when actually we should be worried if that thing didn't happen.

The third point is what this blog post is about. Here, I show that the “bias” toward the null for small effect sizes is exactly what must happen for any reasonable method that meets four, simple desiderata.

We start with the idea of a measure of evidence comparing some composite alternative hypothesis to the null hypothesis. For our purposes here, it could be any measure of evidence; it does not have to be a Bayes factor. What we will do is set a number of reasonable desiderata on the properties of this evidence measure, and show that the so-called “bias” in favor of the null for small effect sizes must occur.

We assume that our data can be summarized in terms of an effective sample size and an (observed) effect size measure. This effect size should have a “nullest” member (for instance, d=0, or R²=0). For any given sample size, the evidence against the null will be an increasing function of this observed effect size. We also need the concept of “no”, or equivocal, evidence; that is, that the data do not favor either hypothesis. This defines a 0 point on the evidence scale, whatever it is.

The important concept for our demonstration is the idea of a bivariate space of sample size vs evidence. Sample size begins at 0 and increases along the x axis, and “no” evidence is marked on the y axis. We can think of sample size abstractly as indexing the amoung of information in the data. We are going to imagine fixing an observed effect size and varying the sample size, which will trace a curve through this bivariate space:

A bivariate sample size / evidence space.

We can now give four desired properties that any evidence measure will have.

Desiderata 1: The evidence with no data is “equivocal”.

If we observe no data, the strength of the evidence does not favor either hypothesis. Whatever the “0 evidence” point in the evidence space, having no data must put you there.

[For a Bayes factor, this means that prior odds and the posterior odds are the same — with no data, they don't change — and the log Bayes factor is 0.]

Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.

For instance, if we observe d=0 with N=1000 participants, this is more convincing evidence in favor of the null than of we had observed d=0 with N=10. Obviously, this null observed effect should offer the most evidence possible, for a given sample size.

Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.

If we observe d=.3, with 10 participants, this isn't terribly convincing; but if we observed d=.3 with more and more participants, we are increasingly sure that the null hypothesis is false. In the bivariate space, this means that all non-null effect size curves eventually must end up either at -∞ or at an asymptote at some large value in favor of the alternative.

Desiderata 4: The closer an observed effect size is to the null effect size, the more it's curve “looks like” the null

This is just a smoothness assumption. The conclusions we obtain from observing d=0 should be very close to the ones we obtain from d=.001 and even closer to those we obtain from d=.0000001. Of course, this smoothness should also hold for all other observed effect sizes, not just the null, but for our purposes here the observed null is what is important.

For small sample sizes, this means that the curves for small effect sizes must be near the null effect size lines in the bivariate space. As we increase the sample size, of course, those lines must diverge downward.

The effect of these four desiderata is to ensure that small effect sizes “look” null. This is not a consequence of the Bayes factor, or the prior, but rather of very reasonable conditions that any evidence measure would fulfil. For a Bayes factor, of course, how these lines move through the bivariate space — and how small an effect size will need to be in order to “look” null — will be sensitive to the prior on the alternative, as it must be. But behaviour described by Simonsohn is natural consequence of very reasonable assumptions.

Although it is counter intuitive, we would be worried if it didn't happen for some measure of evidence.

Some thoughts on replication

2015-04-09T15:23:00.001-07:00

In a recent blog post, Simine Vazire discusses the problem with the logic of requiring replicators to explain when they reach different conclusions to the original authors. She frames it, correctly, it as asking people to over-interpret random noise. Vazire identifies the issue as a problem with our thinking: that we under-estimate randomness. I'd like to explore other ways in which our biases interferes with clear thinking about replication, and perhaps suggest some ways we can clarify it.

I suggest two ways in which we fool ourselves in thinking about replication: the concept of "replication" is unnecessarily asymmetric and an example of overly-linear thinking, and lack of distinction in practice causing a lack of distinction in theory.

Fooled by language: the asymmetry of "replication"

Imagine that a celebrated scientist, Dr. Smith, dies, and within her notes is discovered a half-written paper. Building on her previous work, this paper clearly lays out an creative experiment to test a theory. To avoid any complications such as post hoc theorising, assume the link between the theory and experiment is clear and follows from her previous work. On the Dr. Smith's computer, along with the paper, is found a data set. Dr. Smith's colleagues decide to finish the paper and publish it in her honor.

Given the strange circumstances of this particular paper's history, another scientist, Dr. Jones, decides to replicate the study. Dr. Jones does his best to match the methods described in the paper, but obtains a different result. Dr. Jones tries to publish, but editors and reviewers demand an explanation: why is the replication different? Dr. Jones' result is doubted until he can explain the difference.

Now suppose -- unbeknownst to everyone -- that the first experiment was never done. Dr. Smith simulated the data set as a pedagogical exercise to learn a new analysis technique. She never told anyone because she did not anticipate dying, of course, but everyone assumed the data was real. The second experiment is no replication at all; it is the first experiment done.

Does this change the evidential value of the Dr. Jones' experiment at all? Of course not. The fact that the Dr. Smith's experiment was not done is irrelevant to the evidence in Dr. Jones' experiment. The evidence contained in a first experiment is the same, regardless of whether a second experiment is done (assuming, of course, that the methods are all sound). "Replication" is a useless label.

Calling the Dr. Jones' experiment a "replication" focuses our attention on wrong relationship. One replicates an actual experiment that was done. However, the evidence that an experiment provides for a theory depends not on the relationship between the experiment's methods and an experiment that was done in the past. Rather, the evidence depends on the relationship between the experiment's methods and a hypothetical experiment that is designed to test the theory. One cannot replicate a hypothetical experiment, of course, because hypothetical experiments cannot be performed. Instead, one realizes a hypothetical experiment, and there may be several realizations of the same hypothetical experiment.

Thinking in this manner eliminates the asymmetric relationship between the two experiments. If both experiments can be realizations of the same hypothetical experiment designed to test a theory, which one came first is immaterial.* The burden is no longer on the second experimenter to explain why the results are different; the burden is on the advocates of the theory to explain the extant data, which now includes two differing results. (Vazire's caution about random noise still applies here, as we still don't want to over-explain differences; it is assumed that any post hoc explanation will be tested.)

Three hypothetical experiments that are tests of the same theory, along with five actually-run experiments. Hypothetical experiments A and B may be so-called "conceptual replications" of A, or tests of other aspects of the theory.

The conceptual distinction between a hypothetical experiment -- that is, the experiment that is planned -- and the actual experiment is critical. That hypothetical experiment can be realized in many ways: different times, different labs, different participants, even different stimuli, if these are randomly generated or are selected from a large collection of interchangeable stimuli. Importantly, when the first realization of the hypothetical experiment is done, it does not get methodological priority. It is temporally first, but is simply one way in which the experiment could have been realized.

Conceptualizing the scientific process in this way prevents researchers who did an experiment first from claiming that their experiment takes priority. If you are "replicating" their actual experiment, then it makes sense that your results will get compared to theirs, in the same way a "copy" might be compared to the "original". But conceptually, the two are siblings, not parent and child.

Lack of distinction in practice vs. theory

The critical distinctions above is the distinction between a hypothetical experiment and an actual one. I think this is an instance where modern scientific practice causes problems. Although the idea of a hypothetical experiment arises in any experimental planning process, consider the typical scientific paper, which has an introduction, then a brief (maybe even just a few sentences!) segue describing the logic of the experiment, into the methods of an actually-performed experiment.

This structure means that the hypothetical experiment and the actual experiment are impossible to disentangle. This is one of the reasons, I think, why we talk about "replication" so much, rather than performing another realization of the hypothetical experiment. We have no hypothetical experiment to work from, because it is almost completely conflated with the actual experiment.

One initiative that will help with this problem is public pre-registration. A hypothetical experiment is laid out in an pre-registration document. Note that from a pre-registration document, the structure in the figure becomes clear. If someone posts a public pre-registration document, why does it matter who does the experiment first (aside from the ethical issue of "scooping", etc)? No one is "replicating" anyone else; they are each separately realizing the hypothetical experiment that was planned.

But in current practice, which does not typically distinguish a hypothetical experiment and an actual one, the only way to add to the scientific literature about hypothetical experiment A is to try to "redo" one of its realizations. Any subsequent experiment is then logically dependent on the first actually performed experiment, and the unhelpful asymmetry crops up again.

I think it would be useful to have a different word than "replication", because the connotation of the word "replication", as a fascimile or a copy of something already existing, focuses our attention in unhelpful ways.

* Although logically which came first is immaterial, there may be statistical considerations to keep in mind, like the "statistical significance filter" that is more likely to affect a first study than a second. Also, as Vazire points out in the comments, the second study has fewer researcher degrees of freedom.

My favorite Neyman passage: on confidence intervals

2015-04-09T08:38:00.000-07:00

I've been doing a lot of reading on confidence interval theory. Some of the reading is more interesting than others. There is one passage from Neyman's (1952) book "Lectures and Conferences on Mathematical Statistics and Probability" (available here) that stands above the rest in terms of clarity, style, and humor. I had not read this before the last draft of our confidence interval paper, but for those of you who have read it, you'll recognize that this is the style I was going for. Maybe you have to be Jerzy Neyman to get away with it.

Neyman gets bonus points for the footnote suggesting the "eminent", "elderly" boss is so obtuse (a reference to Fisher?) and that the young frequentists should be "remind[ed] of the glory" of being burned at the stake. This is just absolutely fantastic writing. I hope you enjoy it as much as I did.

[begin excerpt, p. 211-215]

[Neyman is discussing using "sampling experiments" (Monte Carlo experiments with tables of random numbers) in order to gain insight into confidence intervals. $\theta$ is a true parameter of a probability distribution to be estimated.]

The sampling experiments are more easily performed than described in
detail. Therefore, let us make a start with $\theta_1 = 1$, $\theta_2 = 2$, $\theta_3 = 3$ and $\theta_4 = 4$. We imagine that, perhaps within a week, a practical statistician is faced four times with the problem of estimating $\theta$, each time from twelve observations, and that the true values of $\theta$ are as above [ie, $\theta_1,\ldots,\theta_4$] although the statistician does not know this. We imagine further that the statistician is an elderly gentleman, greatly attached to the arithmetic mean and that he wishes to use formulae (22). However, the statistician has a young assistant who may have read (and understood) modern literature and prefers formulae (21). Thus, for each of the four instances, we shall give two confidence intervals for $\theta$, one computed by the elderly Boss, the other by his young Assistant.

[Formula 21 and 22 are simply different 95% confidence procedures. Formula 21 is has better frequentist properties; Formula 22 is inferior, but the Boss likes it because it is intuitive to him.]

Using the first column on the first page of Tippett's tables of random
numbers and performing the indicated multiplications, we obtain the following
four sets of figures.

The last two lines give the assertions regarding the true value of $\theta$ made by the Boss and by the Assistant, respectively. The purpose of the sampling experiment is to verify the theoretical result that the long run relative frequency of cases in which these assertions will be correct is, approximately, equal to $\alpha = .95$.

You will notice that in three out of the four cases considered, both assertions (the Boss' and the Assistant's) regarding the true value of $\theta$ are correct and that in the last case both assertions are wrong. In fact, in this last case the true $\theta$ is 4 while the Boss asserts that it is between 2.026 and 3.993 and the Assistant asserts that it is between 2.996 and 3.846. Although the probability of success in estimating $\theta$ has been fixed at $\alpha = .95$, the failure on the fourth trial need not discourage us. In reality, a set of four trials is plainly too short to serve for an estimate of a long run relative frequency. Furthermore, a simple calculation shows that the probability of at least one failure in the course of four independent trials is equal to .1855. Therefore, a group of four consecutive samples like the above, with at least one wrong estimate of $\theta$, may be expected one time in six or even somewhat oftener. The situation is, more or less, similar to betting on a particular side of a die and seeing it win. However, if you continue the sampling experiment and count the cases in which the assertion regarding the true value of $\theta$, made by either method, is correct, you will find that the relative frequency of such cases converges gradually to its theoretical value, $\alpha= .95$.

Let us put this into more precise terms. Suppose you decide on a number $N$ of samples which you will take and use for estimating the true value of $\theta$. The true values of the parameter $\theta$ may be the same in all $N$ cases or they may vary from one case to another. This is absolutely immaterial as far as the relative frequency of successes in estimation is concerned. In each case the probability that your assertion will be correct is exactly equal to $\alpha = .95$. Since the samples are taken in a manner insuring independence (this, of course, depends on the goodness of the table of random numbers used), the total number $Z(N)$ of successes in estimating $\theta$ is the familiar binomial variable with expectation equal to $N\alpha$ and with variance equal to $N\alpha(1 - \alpha)$. Thus, if $N = 100$, $\alpha = .95$, it is rather improbable that the relative frequency $Z(N)/N$ of successes in estimating $\alpha$ will differ from $\alpha$ by more than

\[
2\sqrt{\frac{\alpha(1-\alpha)}{N}} = .042
\]

This is the exact meaning of the colloquial description that the long run relative frequency of successes in estimating $\theta$ is equal to the preassigned $\alpha$. Your knowledge of the theory of confidence intervals will not be influenced by the sampling experiment described, nor will the experiment prove anything. However, if you perform it, you will get an intuitive feeling of the machinery behind the method which is an excellent complement to the understanding of the theory. This is like learning to drive an automobile: gaining experience by actually driving a car compared with learning the theory by reading a book about driving.

Among other things, the sampling experiment will attract attention to
the frequent difference in the precision of estimating $\theta$ by means of the two alternative confidence intervals (21) and (22). You will notice, in fact, that the confidence intervals based on $X$, the greatest observation in the sample, are frequently shorter than those based on the arithmetic mean $\bar{X}$. If we continue to discuss the sampling experiment in terms of cooperation between the eminent elderly statistician and his young assistant, we shall have occasion to visualize quite amusing scenes of indignation on the one hand and of despair before the impenetrable wall of stiffness of mind and routine of thought on the other.^{[See footnote]} For example, one can imagine the conversation between the two men in connection with the first and third samples reproduced above. You will notice that in both cases the confidence interval of the Assistant is not only shorter than that of the Boss but is completely included in it. Thus, as a result of observing the first sample, the Assistant asserts that

\[
.956 \leq \theta \leq 1.227.
\]

On the other hand, the assertion of the Boss is far more conservative and admits the possibility that $\theta$ may be as small as .688 and as large as 1.355. And both assertions correspond to the same confidence coefficient, $\alpha = .95$! I can just see the face of my eminent colleague redden with indignation and hear the following colloquy.

Boss: "Now, how can this be true? I am to assert that $\theta$ is between .688 and 1.355 and you tell me that the probability of my being correct is .95. At the same time, you assert that $\theta$ is between .956 and 1.227 and claim the same probability of success in estimation. We both admit the possibility that $\theta$ may be some number between .688 and .956 or between 1.227 and 1.355. Thus, the probability of $\theta$ falling within these intervals is certainly greater than zero. In these circumstances, you have to be a nit-wit to believe that
\[
\begin{eqnarray*}
P\{.688 \leq \theta \leq 1.355\} &=& P\{.688 \leq \theta < .956\} + P\{.956 \leq \theta \leq 1.227\}\\
&& + P\{1.227 \leq \theta \leq 1.355\}\\
&=& P\{.956 \leq \theta \leq 1.227\}.\mbox{''}
\end{eqnarray*}
\]

Assistant: "But, Sir, the theory of confidence intervals does not assert anything about the probability that the unknown parameter $\theta$ will fall within any specified limits. What it does assert is that the probability of success in estimation using either of the two formulae (21) or (22) is equal to $\alpha$."

Boss: "Stuff and nonsense! I use one of the blessed pair of formulae and come up with the assertion that $.688 \leq \theta \leq 1.355$. This assertion is a success only if $\theta$ falls within the limits indicated. Hence, the probability of success is equal to the probability of $\theta$ falling within these limits —."

Assistant: "No, Sir, it is not. The probability you describe is the a posteriori probability regarding $\theta$, while we are concerned with something else. Suppose that we continue with the sampling experiment until we have, say, $N = 100$ samples. You will see, Sir, that the relative frequency of successful estimations using formulae (21) will be about the same as that using formulae (22) and that both will be approximately equal to .95."

I do hope that the Assistant will not get fired. However, if he does, I would remind him of the glory of Giordano Bruno who was burned at the stake by the Holy Inquisition for believing in the Copernican theory of the solar system. Furthermore, I would advise him to have a talk with a physicist or a biologist or, maybe, with an engineer. They might fail to understand the theory but, if he performs for them the sampling experiment described above, they are likely to be convinced and give him a new job. In due course, the eminent statistical Boss will die or retire and then —.

[footnote] Sad as it is, your mind does become less flexible and less receptive to novel ideas as the years go by. The more mature members of the audience should not take offense. I, myself, am not young and have young assistants. Besides, unreasonable and stubborn individuals are found not only among the elderly but also frequently among young people.

[end excerpt]

The TES Challenge to Greg Francis

2015-03-29T05:26:00.000-07:00

This post is a follow-up to my previous post, “Statistical alchemy and the 'test for excess significance'”. In the comments on that post, Greg Francis objected to my points about the Test for Excess Significance. I laid out a challenge in which I would use simulation to demonstrate these points. Greg Francis agreed to the details; this post is about the results of the simulations (with links to the code, etc.)

A challenge

In my previous post, I said this:

Morey: “…we have bit of a mystery. That $E$ [the expected number of non-significant studies in a set of $n$ studies] equals the sum of the expected [Type II error] probabilities is merely asserted [by Ioannidis and Trikalinos]. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false.”

Greg Francis replied:

Francis:“…none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct…[i]n short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.”

To which I replied,

Morey:“If you don’t believe me, here’s a challenge: you pick a power and a random seed. I will simulate a very large ‘literature’ according to the ‘experimenter behaviour’ of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the ‘literature’ coded in terms of ‘sets’ of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T’s model, along with a 95% CI (I’ll tell you the total number of such studies). That is, we’re just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).” “This will answer the question of ‘what is the expected number of nonsignificant studies in a set of n?’”

This challenge will very clearly show that my situations are not “impossible”. I can sample them in a very simple simulation. Greg Francis agreed to the simulation:

Francis: “Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013”

I further clarified:

Morey: “Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because ‘excess significance’ is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?”

…to which Greg Francis agreed.

I have performed this simulation. Before reading on, you should read the web page containing the results:

Web page (with code) outlining the results: http://learnbayes.org/talks/TES/TESsimulation.html
Source .Rmd file: http://learnbayes.org/talks/TES/TESsimulation.Rmd

The table below shows the results of the simulation of 1000000 “sets” of studies. All simulated “studies” are published in this simulation, no questionable research practices are involved. The first column shows $n$, and the second column shows the average number of non-significant studies for sets of $n$, which is a Monte Carlo estimate of I&T's $E$. As you can see, it is not 2.5.

Total studies (n)	Mean nonsig. studies	Expected by TES (E)	Count
1	1	0.5	499917
2	1	1.0	249690
3	1	1.5	125269
4	1	2.0	62570
5	1	2.5	31309
6	1	3.0	15640
7	1	3.5	7718
8	1	4.0	3958
9	1	4.5	1986
10	1	5.0	975

(I have truncated the table at $n=10$; see the HTML file for the full table.)

I also showed that you can change the experimenter's behaviour and make it 2.5. This indicates that the assumptions one makes about experimenter behavior matter to the expected number of non-significant studies in a particular set. Across all sets of studies, the expected proportion of significant studies is expected to be equal to the power. However, how this is distributed across studies of different lengths is a function of the decision rule.

The expression for the expected number of non-significant studies in a set of $n$ is not correct (without further very strong, unwarranted assumptions).

Two things to stop saying about null hypotheses

2015-03-28T03:18:00.000-07:00

There is a currently fashionable way of describing Bayes factors that resonates with experimental psychologists. I hear it often, particularly as a way to describe a particular use of Bayes factors. For example, one might say, “I needed to prove the null, so I used a Bayes factor,” or “Bayes factors are great because with them, you can prove the null.” I understand the motivation behind this sort of language but please: stop saying one can “prove the null” with Bayes factors.

I also often hear other people say “but the null is never true.” I'd like to explain why we should avoid saying both of these things.

Null hypotheses are tired of your jibber jabber

Why you shouldn't say “prove the null”

Statistics is complicated. People often come up with colloquial ways of describing what a particular method is doing: for instance, one might say a significance tests give us “evidence against the null”; one might say that a “confidence interval tells us the 95% most plausible values”; or one might say that a Bayes factor helps us “prove the null.” Bayesians often are quick to correct misconceptions that people use to justify their use of classical or frequentist methods. It is just as important to correct misconceptions about Bayesian methods.

In order to understand why we shouldn't say “prove the null”, consider the following situation: You have a friend who claims that they can affect the moon with their mind. You, of course, think this is preposterous. Your friend looks up at the moon and says “See, I'm using my abilities right now!” You check the time.

You then decide to head to the local lunar seismologist, who has good records of subtle moon tremors. You ask her whether about what happened at the time your friend was looking at the moon, and she reports back to you that lunar activity at that time was stronger than it typically is 95% of the time (thus passes the bar for “statistical significance”).

Does this mean that there is evidence for your friend's assertion? The answer is “no.” Your friend made no statement about what one would expect from the seismic data. In fact, your friend's statement is completely unfalsifiable (as is the case with the typical “alternative” in a significance test, $\mu\neq0$).

But consider the following alternative statements your friend could have made: “I will destroy the moon with my mind”; “I will make very large tremors (with magnitude $Y$)”; “I will make small tremors (with magnitude $X$).” How do we now regard your friend's claims in light of the what happened?

“I will destroy the moon with my mind” is clearly inconsistent with the data. You (the null) are supported by an infinite amount, because you have completely falsified his statement that he would destroy the moon (the alternative).
“I will make very large tremors (with magnitude $Y$)” is also inconsistent with the data, but if we allow a range of uncertainty around his claim, may not be completely falsified. Thus you (the null) are supported, but not by as much in the first situation.
“I will make small tremors (with magnitude $X$)” may support you (the null) or your friend (the alternative), depending on how the magnitude predicted and observed.

Here we can see that the support for the null depends on the alternative at hand. This is, of course, as it must be. Scientific evidence is relative. We can never “prove the null”: we can only “find evidence for a specified null hypothesis against a reasonable, well-specified alternative”. That's quite a mouthful, it's true, but “prove the null” creates misunderstandings about Bayesian statistics, and makes it appear that it is doing something it cannot do.

In a Bayesian setup, the null and alternative are both models and the relative evidence between them will change based on how we specify them. If we specify them in a reasonable manner, such that the null and alternative correspond to relevant theoretical viewpoints or encode information about the question at hand, the relative statistical evidence will be informative for our research ends. If we don't specify reasonable models, then the relative evidence between the models may be correct, but useless.

We never “prove the null” or “compute the probability of the null hypothesis”. We can only compare a null model to an alternative model, and determine the relative evidence.

[See also Gelman and Shalizi (2013) and Morey, Romeijn and Rouder (2013)]

Why you shouldn't say “the null is never true”

A common retort to tests including a point null (often called a 'null' hypothesis) is that “the null is never true.” This backed up by four sorts of “evidence”:

A quote from an authority: “Tukey or Cohen said so!” (Tukey was smart, but this is not an argument.)
Common knowledge / “experience”: “We all know the null is impossible.” (This was Tukey's “argument”)
Circular: “The area under a point in a density curve is 0.” (Of course if your model doesn't have a point null, the point null will be impossible.)
All models are “false” (even if this were true --- I think it is actually a category error --- it would equally apply to all alternatives as well)

The most attractive seems to be the second, but it should be noted that people almost never use techniques that allow finding evidence for null hypotheses. Under these conditions, how is one determining that the null is never true? If a null were ever true, we would not be able to accumulate evidence for it, so the second argument definitely has a hint of circularity as well.

When someone says “The null hypothesis is impossible/implausible/irrelevant”, what they are saying in reality is “I don't believe the null hypothesis can possibly be true.” This is a totally fine statement, as long as we recognize it for what it is: an a priori commitment. We should not pretend that it is anything else; I cannot see any way that one can find universal evidence for the statement “the null is impossible”.

If you find the null hypothesis implausible, that's OK. Others might not find it implausible. It is ultimately up to substantive experts to decide what hypotheses they want to consider in their data analysis, and not up to methodologists or statisticians to decide to tell experts what to think.

Any automatic behavior — either automatically rejecting all null hypothesis, or automatically testing null hypotheses — is bad. Hypothesis testing and estimation should be considered and deliberate. Luckily, Bayesian statistics allows both to be done in a principled, coherent manner, so informed choices can be made by the analyst and not by the restrictions of the method.

BayesFactor updated to version 0.9.11-1

2015-03-23T07:58:00.001-07:00

The BayesFactor package has been updated to version 0.9.11-1. The changes are:

CHANGES IN BayesFactor VERSION 0.9.11-1

CHANGES
* Fixed memory bug causing importance sampling to fail.

CHANGES IN BayesFactor VERSION 0.9.11

CHANGES
* Added support for prior/posterior odds and probabilities. See the new vignette for details.
* Added approximation for t test in case of large t
* Made some error messages clearer
* Use callbacks at least once in all cases
* Fix bug preventing continuous interactions from showing in regression Gibbs sampler
* Removed unexported function oneWayAOV.Gibbs(), and related C functions, due to redundancy
* gMap from model.matrix is now 0-indexed vector (for compatibility with C functions)
* substantial changes to backend, to Rcpp and RcppEigen for speed
* removed redundant struc argument from nWayAOV (use gMap instead)

BayesFactor: Software for Bayesian inference

Stop saying confidence intervals are "better" than p values

p values and "surprise"

p values and confidence intervals

Does this matter?

Where to go from here?

Numerical pitfalls in computing variance

When logarithms don't help

Conclusion

How to train undergraduate psychologists to be post hoc BS generators

Example assignment: Critique a peer-reviewed article

Example assignment: Do an experiment and interpret the results

Fixing the problems

Critiquing pop science

Separating critiques of method from critiques of results

Being specific about potential critiques

Wrap up

How to check Likert scale summaries for plausibility

Option 1: Brute force

Option 2: Linear Inverse Models

Conclusion

New digital feature editorship at the Psychonomic Society

Asymmetric funnel plots without publication bias

A (not so) hypothetical paradigm

Wrap up

Footnotes

Averaging can produce misleading standardized effect sizes

Solution 1: Generalized \(\omega^2\)

Solution 2: Adjust the MSE

Wrap up

Added postscript

Confidence intervals: What they are and are not

Reviewers and open science: why PRO?

Habits and open data: Helping students develop a theory of scientific mind

Betrayed by our habits

Helping young scientists form better habits: three ideas

Data partners

The five-year plan

The submission check

Promoting good habits and open science

Neyman does science, part 2

First: Get the data!

Project Whitetop

What did Neyman think makes a good analysis?

Neyman et al's Whitetop analysis

Wrapping up

Neyman does science, part 1

Neyman's weather modification work

Bibliography for these posts

BayesFactor version 0.9.12-2 released to CRAN

Please help: BayesFactor testimonials

On radical manuscript openness

Call for papers: Bayesian statistics, at Zeitschrift für Psychologie

Visualizing statistical distributions with javascript

The fallacy of placing confidence in confidence intervals (version 2)

Guidelines for reporting confidence intervals

Guidelines for reporting confidence intervals

All about that "bias, bias, bias" (it's no trouble)

Desiderata 1: The evidence with no data is “equivocal”.

Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.

Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.

Desiderata 4: The closer an observed effect size is to the null effect size, the more it's curve “looks like” the null

Some thoughts on replication

Fooled by language: the asymmetry of "replication"

Lack of distinction in practice vs. theory

My favorite Neyman passage: on confidence intervals

The TES Challenge to Greg Francis

A challenge

Two things to stop saying about null hypotheses

Why you shouldn't say “prove the null”

Why you shouldn't say “the null is never true”

BayesFactor updated to version 0.9.11-1