BayesFactor: Software for Bayesian inference: The TES Challenge to Greg Francis

Sunday, March 29, 2015

The TES Challenge to Greg Francis

This post is a follow-up to my previous post, “Statistical alchemy and the 'test for excess significance'”. In the comments on that post, Greg Francis objected to my points about the Test for Excess Significance. I laid out a challenge in which I would use simulation to demonstrate these points. Greg Francis agreed to the details; this post is about the results of the simulations (with links to the code, etc.)

A challenge

In my previous post, I said this:

Morey: “…we have bit of a mystery. That $E$ [the expected number of non-significant studies in a set of $n$ studies] equals the sum of the expected [Type II error] probabilities is merely asserted [by Ioannidis and Trikalinos]. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false.”

Greg Francis replied:

Francis:“…none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct…[i]n short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.”

To which I replied,

Morey:“If you don’t believe me, here’s a challenge: you pick a power and a random seed. I will simulate a very large ‘literature’ according to the ‘experimenter behaviour’ of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the ‘literature’ coded in terms of ‘sets’ of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T’s model, along with a 95% CI (I’ll tell you the total number of such studies). That is, we’re just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).” “This will answer the question of ‘what is the expected number of nonsignificant studies in a set of n?’”

This challenge will very clearly show that my situations are not “impossible”. I can sample them in a very simple simulation. Greg Francis agreed to the simulation:

Francis: “Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013”

I further clarified:

Morey: “Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because ‘excess significance’ is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?”

…to which Greg Francis agreed.

I have performed this simulation. Before reading on, you should read the web page containing the results:

Web page (with code) outlining the results: http://learnbayes.org/talks/TES/TESsimulation.html
Source .Rmd file: http://learnbayes.org/talks/TES/TESsimulation.Rmd

The table below shows the results of the simulation of 1000000 “sets” of studies. All simulated “studies” are published in this simulation, no questionable research practices are involved. The first column shows $n$, and the second column shows the average number of non-significant studies for sets of $n$, which is a Monte Carlo estimate of I&T's $E$. As you can see, it is not 2.5.

Total studies (n)	Mean nonsig. studies	Expected by TES (E)	Count
1	1	0.5	499917
2	1	1.0	249690
3	1	1.5	125269
4	1	2.0	62570
5	1	2.5	31309
6	1	3.0	15640
7	1	3.5	7718
8	1	4.0	3958
9	1	4.5	1986
10	1	5.0	975

(I have truncated the table at $n=10$; see the HTML file for the full table.)

I also showed that you can change the experimenter's behaviour and make it 2.5. This indicates that the assumptions one makes about experimenter behavior matter to the expected number of non-significant studies in a particular set. Across all sets of studies, the expected proportion of significant studies is expected to be equal to the power. However, how this is distributed across studies of different lengths is a function of the decision rule.

The expression for the expected number of non-significant studies in a set of $n$ is not correct (without further very strong, unwarranted assumptions).

39 comments:

AnonymousMarch 29, 2015 at 8:10 AM
The only assumption that is needed is that the set of studies is a representative set of studies with 50% power. Assume that there is an urn with 100 studies and an experimenter draws at random from this urn. The experimenter is expected to draw 50% significant results and 50% non-significant results. That is a standard assumption in statistics and not "a very strong, unwarranted assumption"
ReplyDelete
Replies
Richard MoreyMarch 29, 2015 at 8:16 AM
Yes, but you're not understanding the point. The point is that the test assumes that n is fixed. That's why we agreed in the simulation that n=5. In your example, you don't specify a number of balls drawn from the urn. You're marginalizing over n, which is NOT what the TES does. No one is disputing that half the studies are expected to be significant with power = .5; the simulations clearly show that, and it can be easily proved. What is being disputed, if you read all the material, is that half the studies will be expected to be significant *for every n*.
ReplyDelete
Replies
Greg FrancisMarch 29, 2015 at 8:54 AM
In your description of the challenge, you left out a crucial condition of my agreement. I said, "As long as your procedure for producing studies reports all the studies that are relevant for a theoretical claim and do not use some kind of questionable research practice, then I think we are in agreement."

So what does it mean for your procedure to report "all studies that are relevant for a theoretical claim"? Well, you ran 2,000,661 simulated studies, so that's what would have to be reported with regard to a (hypothetical) theoretical claim being made about this (hypothetical) effect (or set of effects). If all 2,000,661 studies are reported, then we find that the proportion of significant and non-significant studies is 0.500165195 and 0.499834805, respectively. This matches the 0.5 power as specified by the I&T analysis (nothing special here, it's just what we mean by power).

If someone takes a subset of these experiments, say, one of the sets with 7 experiments (6 significant and 1 non-significant) and uses them as support for a theoretical claim about some effect, then that person is ignoring the thousands (or millons) of other studies (half of which are non-significant) that are relevant to the theoretical claim. That person is cherry-picking study results and thereby is making a poor scientific argument for their theoretical claim (indeed, their 7 experiments will overestimate the effect size). A TES analysis (based on the true power of 0.5) would properly report that we should be skeptical about the presented relationship between the theoretical claim and these 7 experiments because, in isolation, they appear too good to be true. If these 7 experiments were combined with the thousands of other experiments, then the scientist could (probably) make a convincing case for the theoretical claim. A TES analysis of the full set would not report any problem.

To summarize, if scientists are working in an environment where there are many investigations of an effect, then their theoretical claims have to reflect that environment. They should not isolate their findings from the larger set of studies. Such isolation is essentially "cherry-picking" of results, and the TES analysis will sometimes pick up on this kind of bias.

Just to be clear, this kind of cherry-picking of results does not imply malicious intent to deceive. It can be very difficult for a scientist to judge whether studies in another lab, with different stimuli, or different subjects really investigate the same topic. I can easily see how a scientist might mistakenly convince themselves that their set of 7 experiments is a good set for their theory while other experiments do not apply. This mistake is similar to model overfitting, and a TES analysis can raise a flag to warn a scientist that they might be engaging in something like that.
ReplyDelete
Replies
Richard MoreyMarch 29, 2015 at 9:49 AM
Please stick with the question the simulation was meant to answer. The only reason there are millions of "studies" is that these are *theoretical replications* meant to estimate the expected value through Monte Carlo simulation. Even if I had done the simulation once, the expected value remains the same.

So here's the question that I put to you, and that you agreed to: What is the expected number of nonsignificant studies in a set of 5, in a situation with no QRPs all studies published? Is it necessarily what I&T said? Yes or no?

We can move on to other issues when we've addressed the question the simulation was meant to answer.
ReplyDelete
Replies
Richard MoreyMarch 29, 2015 at 11:35 AM
What you are saying is clearly inconsistent with I&T's "derivation" of the test. n is the number of "already published" studies. It is therefore known and fixed, but of arbitrary value. Hence, I&T's test assumes that for *each n* the expected value is n(1-beta) (if every study has same beta). This is just straight from I&T's text.

Why would you agree to a simulation where the question is about n=5? Do "all studies that are relevant to a theoretical claim" exist in set sizes 5? What were you thinking the simulation would show?

[previous post deleted for copy-paste error....]
ReplyDelete
Replies
Richard MoreyMarch 29, 2015 at 11:41 AM
...and by the same token, why are you saying "if you run your simulation one time, then you will most likely not produce a set of 5 studies", implying that the possible outcomes from the experiment are something other than a set of 5 studies, yet when you compute the p values for the TES, you do not include any other ns as possible outcomes?
ReplyDelete
Replies
Greg FrancisMarch 29, 2015 at 2:16 PM
I said it because it is obviously true for your experiment set generating process. When I compute Ptes, I use the set that corresponds to what was reported by the authors. When they say that the outcomes from 5 experiments provide support for their theoretical claim, I take that for what it is. I have never seen a case where the authors say that the outcomes from 3 experiments would have been sufficient but they go ahead and unnecessarily report 2 more experiments. Maybe the TES would not apply in such a case, I would have to think about it.

Your experiment set generating process seems fine if all studies are included when considering theoretical claims, and the TES will not find fault with such a set of studies. But you only get the properties you wanted if you do not include a large number of studies (whether published or not) when considering theoretical claims. That's publication bias, and the TES will (sometimes) find fault with such a set of studies.
ReplyDelete
Replies
AnonymousMarch 29, 2015 at 3:29 PM
If your method works for power, how about gambling. You go to a casino and keep playing while you are winning and stop after you lose (50% red, 50% black). Some nights you break even (1 win, 1 lose), all other nights you win more than you lose (2:1) (3:1)...etc. I think you just found a sure way to beat the odds and get rich. Go and try it.
ReplyDelete
Replies
AnonymousMarch 30, 2015 at 12:54 PM
If you go to a casino and play 5 rounds for a year (50% black, 50% red). What is your expected number of wins on an average night. Do you really think it is 4? No, it is 2.5 wins on average. Your sets of 5 studies are not representative sets of 5 studies. They are created by a sampling strategy that leads to a systematic bias for all set sizes. It simply does not follow that the TES is useless because it predicts the expected number of success in a representative sample of studies (i.e., go to the Casino and play 5 games).
ReplyDelete
Replies
Richard MoreyMarch 30, 2015 at 1:34 PM
"With regard to the theoretical claim about some effect that is being measured by the experiments, your full set of experiments is not n=5 but much larger." No, it isn't; there is no set of experiments! The Monte Carlo simulation was designed for one purpose: to estimate the conditional expectation. This is trivial and could have been done without sampling. Expectation does not require any "full set" of experiments. Suppose I had instead assumed a distribution on n, sampled that, and then sampled significant "studies" based on a binomial model. If it took me 100 studies to sample n=5, would that have any bearing at all on my estimation of the conditional expectation? No, of course not. This is a silly argument.

"If the authors making a theoretical claim say that there were n=5 experiments, then that is what is used in the TES analysis." Yes! This is correct! And the question is, in those 5 experiments, what is the expected number of nonsignificant studies? The answer: it depends! This is a fact.

I really wish you'd interface with the *statistical* argument I'm making. Talking about "replication", "theoretical claims", "pool[ing] together", "relevant studies", is just obfuscating the issue. I'm taking about the properties of a statistical model in the abstract. We're modelling studies as Bernoulli random variables, and "sets" of studies are just sequences of Bernoulli random variables. There's nothing about "theoretical claims" here.

I can't see this conversation continuing unless you choose to discuss the statistical argument in precise, statistical terms. Prove the necessary conditions under which I&T's expression will be applicable. Note that n is random but observed, but of arbitrary value, and therefore the expectation they describe is a conditional expectation. Use a statistical argument, not vague ideas like "theoretical claims" and "relevant studies". These are all irrelevant to the abstract argument at hand; bringing them up simply obfuscates things.

As I noted above, I think that requirement has something to do with exchangeability: that is, that the ordering of the experiments is arbitrary (perhaps in the case of different powers, conditional exchangeability, given a population of true powers). This would make sense; the binomial distribution arises when the underlying Bernoulli RVs are exchangeable, hence any ordering is equally likely. But you can't address my argument without describing the conditions under which I&T's expression is applicable. That's the whole question.

Also, if you could define "tak[ing] it for what it is" for me that would be great. I note that this was in response to a statistical question of mine, so I assume it has a statistical meaning.
ReplyDelete
Replies
Greg FrancisMarch 30, 2015 at 2:02 PM
I agree that we seem to be an impasse again. You seem to think that conditional expectation is what the TES deals with, and I say that it does not. You show a method for producing n=5 studies with certain properties, and when I show that the method is biased for the purpose scientists would use it for, you say that is irrelevant.

I think we are basically back where we started. That's a shame because I think we both put in an honest effort to explain our views and try to understand the other person's argument. I am puzzled that my explanation does not clarify things for you, and I suspect you feel the same.

I do not think this is a situation where simply agreeing to disagree is an appropriate end point, but I think further discussion along the present lines is not going to be fruitful. Maybe someone else can join in with a new characterization of the issues.
ReplyDelete
Replies
Greg FrancisMarch 30, 2015 at 2:13 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousMarch 30, 2015 at 2:26 PM
Final post:

"Prove the necessary conditions under which I&T's expression will be applicable."

I am not a statistician, so excuse me if I am not using your language, but I understand the basic concepts of statistics well enough.

In my words, any time the set of studies is a random sample of studies from the population of studies with a given power (say 50%), the expected value of significant results is the number of studies multiplied by power (say 10 studies, expected value = 10 * .5 = 5).

This is similar to a coin flip. if a coin flip has a 50% chance to show heads or tails and you flip a coin 10 times you are expected to get 5 heads and 5 tails.

Now I am aware that you understand this, but you claim that this has nothing to do with the TES and that I am missing your point. So, I agree with you that further discussion is pointless because you don't seem to think that this logic applies to the TES and I don't understand why your simulation is supposed to show a major flaw of the TES.

I think you need to make your argument clearer if you want to make a case that the TES is misleading. Ultimately, the success of the TES or other methods will rest on the ability to make correct predictions about actual experimental outcomes in the real world.
ReplyDelete
Replies
UnknownApril 30, 2017 at 3:42 PM
Good point of view. Where can I read more about it? All in all a good alternative is also comments and discussions. And what about this - can anyone say something more about microsoft dynamics 365 pricing? Mayby someone :)
ReplyDelete
Replies
mikalJuly 12, 2020 at 9:26 PM
Homework Help Services your post is very impressive on this topic. I feel completely that read more on this topic. I have reading your blog .
ReplyDelete
Replies
Hire EssayOctober 6, 2020 at 8:07 AM
At Hireessaywriters.com our goal is to ensure your essay writing needs are met perfectly with our brilliant academic writing team. We have the expertise to assist with any topic requested.
ReplyDelete
Replies
Hire EssayDecember 15, 2020 at 2:46 PM
Your search to Professional Essay Writers has come to an end with our academic writing company. Feel free to contact us at your convenience.
ReplyDelete
Replies
MATINAJuly 3, 2021 at 11:06 PM

I was diagnosed as HEPATITIS B carrier in 2013 with fibrosis of the
liver already present. I started on antiviral medications which
reduced the viral load initially. After a couple of years the virus
became resistant. I started on HEPATITIS B Herbal treatment from
ULTIMATE LIFE CLINIC (www.ultimatelifeclinic.com) in March, 2020. Their
treatment totally reversed the virus. I did another blood test after
the 6 months long treatment and tested negative to the virus. Amazing
treatment! This treatment is a breakthrough for all HBV carriers.
ReplyDelete
Replies
AnonymousSeptember 12, 2021 at 5:50 AM
Your content is nothing short of brilliant in many ways. I think this is engaging and eye-opening material. Thank you so much for caring about your content and your readers. Angular trainer
ReplyDelete
Replies

Add comment

More about BayesFactor

Sunday, March 29, 2015

The TES Challenge to Greg Francis

A challenge

39 comments: