## Monday, March 9, 2015

### The frequentist case against the significance test, part 1

It is unfortunate that today, we tend to think about statistical theory in terms of Bayesianism vs frequentism. Modern practice is a blend of Fisher's and Neyman's ideas, with the characteristics of the blend driven by convenience rather than principle. Significance tests are lumped in as a “frequentist” technique by Bayesians in an unfortunate rhetorical shorthand.

In recent years, the significance test has been critiqued on several grounds, but often these critiques are offered from Bayesian or pragmatic grounds. In a two-part post, I will outline the frequentist case developed by Jerzy Neyman against the null hypothesis significance test.

I will outline two frequentist arguments Neyman deployed against significance tests: the first is philosophical, and the second is statistical:
1. The significance test fails as an epistemic procedure. There is no relationship between the $p$ value and rational belief. More broadly, the goal of statistical inference is tests with good error properties, not beliefs.
2. The significance test fails as a test. The lack of an alternative means that a significance test can yield arbitrary results.
In this post I will describe the significance test and outline Neyman's first, philosophical objection to the signifance test. In part 2, I will develop Neyman's statistical objection.

### Significance testing

Suppose that a company is developing a drug for depression (for simplicity, we will consider two-sided tests, but the points will generalize to one-sided tests as well). We randomly assign participants to a placebo control and experimental group, and then measure the change across time via a depression inventory.

If the drug had no effect at all, then clearly we would expect the difference between the two groups to be 0. However, we expect variability in depression scores due to factors other than the drug, so we can't simply take an observed difference between the two conditions as evidence that the drug has an effect. We always expect some difference.

We therefore need to somehow take into account the variability we expect, even if there is no effect, in assessing the effect of the drug. The most common way of doing this is with a significance test; in our example, typically a t test would be used. The logic goes like this:
1. Develop a null hypothesis that is to be (possibly) rejected (e.g., the drug has no effect)
2. Collect data and compute a test statistic $T$ with a known probability distribution, assuming that the null hypothesis is true (e.g., a $t$ statistic).
3. Compute the probability $p$ of obtaining a more extreme test statistic, assuming the null hypothesis is true.
4. Interpret $p$ as a measure of evidence against the null hypothesis; the lower $p$ is, the less we should believe the null hypothesis.
Or, as Neyman (1952) described the logic:
When an “improbable sample” was obtained, the usual way of reasoning was this: “Were the hypothesis $H$ true, then the probability of getting a value of [test statistic] $T$ as or more improbable than that actually observed would be (e.g.) $p = 0.00001$. It follows that if the hypothesis $H$ be true, what we actually observed would be a miracle. We don't believe in miracles nowadays and therefore we do not believe in $H$ being true.”
Sometimes Step 4 will be accompanied by a decision to reject the null hypothesis, but what is important to us now is that the $p$ value supposedly gives us reason to disbelieve the null hypothesis.

### Preliminaries: deductive validity

One of the ways that a significance test is often taught is that the significance test is a probabilistic version of the following argument:
• If the theory is true, $P$ will not occur.
• $P$ did occur.
• Therefore, the theory is false.
This a deductively valid argument; that is, if the premises are true, then the conclusion necessarily follows. The use of such a deductive argument forms the basis of the falsificationist model of the scientific process. Given the intuitive nature of falsificationist logic, it is not surprising that a similar logic is often used to describe the significance test:
• If the null hypothesis were true, we would probably not observe a “small” $p$ value.
• We observed a “small” $p$ value.
• Therefore, the null hypothesis is probably not true.
Although the above argument seems parallel, it is does not share the deductive validity of the non-probabilistic version. We can easily see its deductive invalidity by adding a new premise that does not contradict the other premises, yet contradicts the conclusion:
• The null hypothesis is certainly true.
Obviously, this does not contradict the premise about what we would expect, if the null hypothesis were true; neither does it contradict the premise stating our observation. It does, however, contradict our conclusion that the null hypothesis is improbable. With a deductively valid argument, any time the premises are true the conclusion must also be true. The familiar significance testing argument cannot be deductively valid.

That the logic of the significance test is not deductively valid is not news; certainly if the logic were deductively valid, Fisher and Neyman would have made use of that fact, since deductive logic plays a major role in both Fisher and Neyman's theories. As it turns out, it isn't valid; but this isn't necessarily a problem if another justification for the logic can be found. In fact, Fisher justified the significance test as an example inductive logic. This is the logic to which Neyman would object.

### Neyman's first objection: Epistemology

Very important to Fisher's view of the significance test is the idea that the $p$ value is a measure of inductive evidence against the null hypothesis:
“Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength that the evidence has reached, or to ignore the fact that with further trial it might come to be stronger, or weaker.” (Fisher, 1971, The Design of Experiments)
The evidence from the $p$ value was supposed by Fisher to be a rational measure of the disbelief someone should have regarding the null hypothesis:
“…the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to and verifiable by other rational minds. The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief [in the null hypothesis] it engenders.” (Fisher, 1959; Statistical Methods and Scientific Inference)
This will resonate with almost everyone who has taken an introductory statistics course, and certainly with scientists who use $p$ values. Scientists use $p$ values to call into doubt a particular hypothesis. The concept of statistics as an epistemic endevour — that is, that statistics is, in some sense, about what it is reasonable to believe — was central to Fisher and the Bayesians. Neyman was philosophically aligned against this idea.

In a fantastic 1957 paper entitled “'Inductive Behavior' as a Basic Concept of Philosophy of Science”, Neyman outlined the philosophy underlying his frequentist theory of inference. I believe this paper should be required reading for all users of statistics; it is clear, non-technical, and raises important points that all users of statistics should think about. Historically, it is a very important paper because we read Neyman's reaction to three important viewpoints developed in the previous decades: Fisher's, Jeffreys' objective Bayesianism, and de Finetti's subjective Bayesianism.

Neyman, with characteristic clarity, identified the major problem with both Fisher's (and, incidentally, Jeffreys' Bayesian) viewpoint: How much doubt should a particular $p$ value yield? This is not a question that Fisher ever answered, nor could answer. As Neyman says,
“[I]f a scientist inquires why should he reject or accept hypotheses in accordance with the calculated values of $p$ the unequivocal answer is: because these values of $p$ are the ultimate measures of beliefs especially designed for the scientist to adjust his attitudes to. If one inquires why should one use the normative formulae of one school rather than those of some other, one becomes involved in a fruitless argument.
Fisher never formalized the connection between the $p$ value and rational "disbelief.” Every scientist is assumed to have a “feeling” of disbelief from the $p$ value. Whose feeling is the rational one? And why use $p$, and not, say, $\sqrt{1-p^4}$? Fisher's entire argument is built on intuition. Neyman continues:
It must be obvious that [inductive inference's] use as a basic principle underlying research is unsatisfactory. The beliefs of particular scientists are a very personal matter and it is useless to attempt to norm them by any dogmatic formula.
That is, rational belief is not the target of statistical analysis. Neyman took this idea to an extreme. He went so far as to deny that interpreting the results of a statistical test involves beliefs, reasoning, knowledge, or even any conclusions at all (see, for instance, Neyman, 1941)! What mattered to Neyman was setting up tests that had good long-run error properties, and acting according to a plan derived on the basis of these tests. “Rational belief” is not the target of a statistical procedure.

Fisher found Neyman's viewpoint completely alien, and I suspect most scientists today would as well. I find that scientists agree more with Fisher, when he responded to Neyman's philosophy:
In fact, scientific research…is an attempt to improve public knowledge undertaken as an act of faith to the effect that, as more becomes known, or surely known, the intelligent pursuit of a great variety of aims, by a great variety of men, or groups of men, will be facilitated. (Fisher, 1956)
Although one might agree or disagree with Neyman's take on the philosophy of science, he was correct in his critique of Fisher: Fisher failed to provide any link between $p$ values and rational (dis)belief, and no such link exists. The epistemic use of the significance test, championed by Fisher and deployed by scientists all over the world for decades, has no foundation.

### Conclusion, and on to part 2

The significance test is not deductively valid, in spite of its being sometimes taught as having a falsificationist foundation. Fisher justified the use of the $p$ value as an epistemic statistic and as an example of inductive inference. Neyman points out that there is simply no foundation for the “rational” feelings Fisher associates with the $p$ value, and emphasizes the frequentist view that beliefs aren't the target of inference; rather, tests with good long-run error properties are.

In part 2, we will explore Neyman's second argument against the logic of the significance test: it fails to consider what makes a “good” frequentist test, and actually can lead to tests that produce arbitrary results.