## Thursday, January 7, 2016

### Averaging can produce misleading standardized effect sizes

Recently, there have been many calls for a focus on effect sizes in psychological research. In this post, I discuss how naively using standardized effect sizes with averaged data can be misleading. This is particularly problematic for meta-analysis, where differences in number of trials across studies could lead to very misleading results.

There are two main types of effect sizes in typical use: raw effect sizes and standardized effect sizes. Raw effect sizes are what you typically see in a plot: for instance, the effect of a priming manipulation might be 30ms. The advantage of raw effect sizes are that they are closer to the process of interest and more interpretable. We all know what it means for something to take 30ms to happen.

Another kind of effect size is the standardized effect size. With a standardized effect size, the raw effect is compared to the some measure of variability in the population. For instance, if the standard deviation of childrens’ heights at age 10 were 3 inches, and a “good” diet had an effect of 1.5 inches on average, we could say that the effect of the diet was .5 standard deviations. This is the logic of Cohen’s $d$, for instance. The disadvantage of this is that it is more difficult to understand what an effect of “half a standard deviation” means (and the variance-accounted-for statistics such as $\eta^2$ and $\omega^2$ are even more difficult to interpret); but standardized effect sizes have many good properties, including a close relationship to the concept of statistical power, comparability across paradigms, and the fact that they can often be computed from reported statistics such as $t$ and $F$.

For these reasons, standardized effect sizes are very common in meta-analysis. However, the common practice of averaging over trials in cognitive psychology makes them difficult to compare or even interpret.

Consider a typical cognitive experiment in with 30 participants, each performing 10 response time trials in two conditions. Typically each participant’s data will be averaged to form a single, average response time in each condition; these average RTs are then submitted to a repeated measures ANOVA (in fact, some R packages, such as afex do this automatically).

Hypothetical data is shown in the plot below. These data represent two hypothetical experiments, one with 10 samples per participant and one with 50.

 Error bars represent Morey (2008) adjusted within-subject error bars, computed using this code.

I have generated the data so that everything is the same across the two experiments except the scale of the “error”: the scale of the error in experiment 2 is $1/\sqrt{5}$ times that of experiment 1, because each averaged “observation” represents five times more data. The raw effect size is precisely the same, but our certainty about the effect size is greater in experiment 2. This is exactly as it should be.

We can now perform the typical repeated measures ANOVAs on these two data sets, using the afex. The package will, if requested, compute the common partial $\eta^2$ standardized effect size statistic.

The results of the ANOVA for the “condition” in experiment 1 are:

Effect df MSE F pes p.value
condition 1, 29 0.08 7.72 .21 .009

And the results of the ANOVA for the “condition” in experiment 2 are:
Effect df MSE F pes p.value
condition 1, 29 0.02 38.58 .57 <.0001

This is would not be unexpected by anyone who routinely uses repeated measures ANOVA. Typically, the whole point of running more trials is to get more power. We perform more trials, and we obtained a higher $F$ value for the comparison of interest.

Notice that the sum of squares for the effect is precisely the same for both experiments. That’s because the raw effect is precisely the same. What is driving the higher $F$ value is the lower residual mean square (MSE) for the comparison, which is about 5 times smaller in Experiment 2. Again, this is what we expect. More trials, less “noise”.

But notice what happens to partial $\eta^2$. Because the MSE is smaller, the proportion of variance accounted for by the condition effect is larger. This drives the partial $\eta^2$ from .21 in experiment 1 to .57 in experiment 2. Researchers have previously warned about using partial $\eta^2$ for comparisons across designs (see for instance, Olejnik & Algina, 2003), but these two experiments appear to have the same design; at least, from the perspective of someone used to only analysing averaged data, they do.

This has the potential to wreak havoc on meta-analyses. Suppose someone combs the literature looking for $F$ values and computing partial $\eta^2$ values from the $F$ values (or, alternatively, Cohen’s $d$ from $t$ values). Assume experiment 1 represents a patient group; due to time constraints, the patients only had time for 10 trials per condition. Suppose experiment 2, on the other hand, represents a group of college students, who had time for more trials. The figure below shows the standardized effects in the two experiments.

Although the raw effect size is precisely the same across the two experiments, the standardized effect size is radically different, possibly leading to erroneous conclusions. Even if there are no systematic differences in number of trials across experiments experiments with different kinds of groups, this introduces a new source of variability into estimates, as well as making it nearly impossible to interpret the effect size. What is the “true” standardized effect size? It seems difficult to say. How can we solve this problem?

### Solution 1: Generalized $\omega^2$

One option is generalized $\omega^2$ (see for instance, Olejnik & Algina, 2003) Instead of using the residual variance to standardize against, generalized $\omega^2$ standardizes against all measured (as opposed to manipulated) factors. For instance, the variability in participants is a measured source of variability. These sources of variability are assumed to be stable properties of populations and not effected by mere design choices. We can compute generalized $\omega^2$ again using the afex, which yields $\omega^2_g=0.011$ for the experiment 1 and $\omega^2_g=0.012$ for experiment 2; notice that these very similar. The effect “looks” smaller, because participants vary quite a bit relative to the size of the effect.

One problem with this approach is that the statistics necessary to compute generalized $\omega^2$ are not typically reported, meaning that this solution is useless for meta-analysis of existing literature. Ideally, we’d like a way to use reported statistics to at least compare across studies, when designs are similar enough.

### Solution 2: Adjust the MSE

If we know the relative numbers of trials across two studies that have the same basic design, we should be able to “adjust” the MSE in the formula for the effect size (whichever effect size it happens to be) for the number of trials. In our example, experiment 2 has five times as many trials as experiment 1; we therefore would expect the MSE of experiment 2 to be one-fifth as large as that for experiment 1. To make the effect size computed from experiment 2 comparable to that from experiment 1, we can multiply its MSE by 5 before applying the formula for the effect size of interest. For partial $\eta^2$, this leads to an easy adjustment:
$\eta^2_2 = \frac{1}{c/\eta^2_1 - c + 1}$ where $c$ is the adjustment factor, $\eta^2_1$ is the original partial $\eta^2$, and $\eta^2_2$ is the adjusted partial $\eta^2$.

As an example, take our experiment 2, which had a partial $\eta^2$ of 0.571 and a sample size 5 times larger than experiment 1. Applying the formula above yields
$\begin{eqnarray*} \eta^2_2 &=& \frac{1}{5/0.571 - 5 + 1}\\ &\approx&0.21 \end{eqnarray*}$ which matches the partial $\eta^2$ from experiment 1 very well.

The problem with this approach is that it uses experiment 1 as a “reference” experiment. It is therefore not clear what the standardized effect size means in this case, except as a way to compare across experiments with similar designs. This may be enough to someone performing a meta-analysis — particularly if they can’t obtain the statistics to compute generalized $\omega^2$ — but as a general reporting solution, it is unsatisfactory.

### Wrap up

Although standardized effect sizes have been advocated as a general tool for science and are increasingly reported, they are difficult to interpret because they are affected by trivial, common design decisions. The issues I raise here should be of interest to anyone working with standardized effect sizes, particularly those performing meta-analysis. They affect repeated measures designs with averaged data most acutely; however, between-subjects designs are also affected if each participant contributes an “average” score to the analysis. In the between-subjects case the adjustment would have to be different, but for large number of trials per participant might be acceptably ignored, if the error each participants’ score is small enough.

After a re-read, I want to ensure that I make clear that I'm not implying that the only problem here is with meta-analyses; that's just what drove me to write this post, and how I decided to frame it. But consider this: if an arbitrary decision (driven merely by the resources at hand, such as time or money, or even whim) such as "how many trials will we perform per cell in this experiment?" can cause the standardized effect size to increase almost 200%, that standardized effect size should not be taken to reveal any psychological "truth" and is useless for drawing substantive conclusions.

1. Great post. You might also be interested in:

http://janhove.github.io/reporting/2015/02/05/standardised-vs-unstandardised-es/

http://janhove.github.io/design/2015/03/16/standardised-es-revisited/

2. Thanks! You make some nice observations there, which are particularly relevant in this comment: here. I didn't expand on it too much above, but I think you're right about the difficulties inherent in standardised effect sizes. Although standardised effect sizes do allow for comparison across paradigms, given the difficulties, I wonder if the meaningfulness of such comparisons is merely illusory. It would definitely be profitable to at least consider what meta-analysis would look like based solely on raw effect sizes.

1. Actually, those are not my blog posts, and so not my interesting observations (though I agree with them). So sadly, I cannot take credit.

2. With a mere 14-month delay: Thanks to jwdink for linking to my blog, and thanks to Richard for his comments!

3. Interesting post!
I had recently come across this issue myself, and observed that including random intercepts and slopes in the simulations attenuates the overestimation of the effect sizes (depending on the ratio between the within- and between subjects variability).

I have presented this at our labmeeting only (and do not have a blog to summarize this), but I thought it might be useful to share these slides here, in case you would be interested: https://www.dropbox.com/s/o2ss3iehjnl808f/Labmeeting_12112015.pptx?dl=0

As you highlight in the post-script, I agree one of the most important implications of this is that standardized effect sizes behave in unanticipated ways when researchers first average across trials for each participant before analyzing the data, but this has implications for power analysis, and meta-analysis indeed.

1. Very nice. It looks like we're thinking along the same lines.

4. This comment has been removed by a blog administrator.

5. Richard, can you provide intuition/explanation for why the standardize effect sizes are so much lower In Expt 1 vs. Expt 2? I could see it if it were Cohen's d because Expt 1 would have higher same mean difference but higher standard deviation, thus Expt 1's d would be lower than Expt 2's. How does this work for eta-squared though, which according to Wikipedia is SS_treatment / SS_total? https://en.wikipedia.org/wiki/Effect_size#Eta-squared.2C_.CE.B72

There, you have SS in both numerator and denominator, so why aren't the ratios in both experiments roughly the same? Is it because the SS_total is similar/same in both experiments, and Expt 1 has higher SS_treatment than does Expt 2?

Also, would be interested in seeing code that underlies this running example.

1. edit: "Expt 1 would have higher same mean difference" should be "... have same mean difference"

2. Still wondering about the specifics behind why the standardized effect size is lower in Expt 1, if anyone has an idea...

6. This comment has been removed by a blog administrator.

7. Can Solution 2 be applied also when for adjusting reliability coefficients?
e.g. If I measure reliability using the split-half method, I am really only measuring the reliability of a measure with half as many trials as the actual measure used in analysis.

8. Quantitative data depicts the quality and can be scrutinized, but measuring it precisely is daunting enough; in contrast quantitative data can be easily measured and is depicted in number or amount. See more qualitative analysis with nvivo

9. It was very useful for me. This was actually what I was looking for, and I am glad to came here! Thanks for sharing the such information with us.

Server 2016
Microsoft Server 2016 Migration

10. COEPD LLC- Center of Excellence for Professional Development is the most trusted online training platform to global participants. We are primarily a community of Business Analysts who have taken the initiative to facilitate professionals of IT or Non IT background with the finest quality training. Our trainings are delivered through interactive mode with illustrative scenarios, activities and case studies to help learners start a successful career. We impart knowledge keeping in view of the challenging situations individuals will face in the real time, so that they can handle their job deliverables with at most confidence.

http://coepd.us/

11. Great post you shared with us, assignment writing service uk, assignment writing uk,

12. Irrespective of receiving daily oral or future injectable depot therapies, these require health care visits for medication and monitoring of safety and response. If patients are treated early enough, before a lot of immune system damage has occurred, life expectancy is close to normal, as long as they remain on successful treatment. However, when patients stop therapy, virus rebounds to high levels in most patients, sometimes associated with severe illness because i have gone through this and even an increased risk of death. The aim of “cure”is ongoing but i still do believe my government made millions of ARV drugs instead of finding a cure. for ongoing therapy and monitoring. ARV alone cannot cure HIV as among the cells that are infected are very long-living CD4 memory cells and possibly other cells that act as long-term reservoirs. HIV can hide in these cells without being detected by the body’s immune system. Therefore even when ART completely blocks subsequent rounds of infection of cells, reservoirs that have been infected before therapy initiation persist and from these reservoirs HIV rebounds if therapy is stopped. “Cure” could either mean an eradication cure, which means to completely rid the body of reservoir virus or a functional HIV cure, where HIV may remain in reservoir cells but rebound to high levels is prevented after therapy interruption.Dr Itua Herbal Medicine makes me believes there is a hope for people suffering from,Parkinson's disease,Schizophrenia,Cancer,Scoliosis,Fibromyalgia,Fluoroquinolone Toxicity
Syndrome Fibrodysplasia Ossificans Progressiva.Fatal Familial Insomnia Factor V Leiden Mutation ,Epilepsy Dupuytren's disease,Desmoplastic small-round-cell tumor Diabetes ,Coeliac disease,Creutzfeldt–Jakob disease,Cerebral Amyloid Angiopathy, Ataxia,Arthritis,Amyotrophic Lateral Sclerosis,Alzheimer's disease,Adrenocortical carcinoma.Asthma,Allergic diseases.Hiv_ Aids,Herpes,Inflammatory bowel disease ,Copd,Diabetes,Hepatitis,I read about him online how he cure Tasha and Tara so i contacted him on drituaherbalcenter@gmail.com even talked on whatsapps +2348149277967 believe me it was easy i drank his herbal medicine for two weeks and i was cured just like that isn't Dr Itua a wonder man? Yes he is! I thank him so much so i will advise if you are suffering from one of those diseases Pls do contact him he's a nice man.

14. I am bold enough among many others to state that there is now a potent cure to this sickness but many are unaware of it. I discovered that I was infected with the virus 3 months ago, after a medical check-up. My doctor told me and I was shocked, confused and felt like my world has crumbled. I was dying slowly due to the announcement of my medical practitioner but he assured me that I could leave a normal life if I took my medications (as there was no medically known cure to Herpes). I went from churches to churches but soon found that my case needed urgent attention as I was growing lean due to fear of dying anytime soon. In a bid to look for a lasting solution to my predicament, I sought for solutions from the herbal world. I went online and searched for every powerful trado-medical practitioner that I could severe, cos I heard that the African Herbs had a cure to the Herpes syndrome. It was after a little time searching the web that I came across one Dr Itua(A powerful African Herbal Doctor), who offered to help me at a monetary fee. I had to comply as this was my final bus-stop to receiving a perfect healing. My last resolve was to take my life by myself, should this plan fail. At last it worked out well. He gave me some steps to follow and I meticulously carried out all his instructions. Last month, to be precise, I went back to the hospital to conduct another test and to my amazement, the results showed that negative,Dr Itua Can As Well Cure The Following Desease…Cancer,Hiv,Herpes, Hepatitis B,Liver Inflammatory,Diabetis,Fribroid,,Non Hodgkin Lymphoma,Skin Cancer,Uterine Cancer,Prostate Cancer Dercum,Infertility,fibromyalgia,Get Your Ex Back,Als,SYPHILLIS,Genetic disease,Epilepsy, Parkinson's disease..You can free yourself of this Herpes virus by consulting this great African Herbal Doctor via this e-mail: drituaherbalcenter@gmail.com or call and whatsapp him on +2348149277967 He will help you and his herb medication is sure. he has the cure on all disease .You can talk to me on INSTAGRAM..tashamoore219....

15. I have long felt a special connection with Herbal Medicine. For one thing is natural, Charlie we went to the same small college in Southern California — Claremont Men’s College — though he dropped out and ultimately enrolled at the Julliard School of the Performing Arts in New York. Had he stayed at Claremont, he would have been a senior the year I started there; I have often kidded that I was the reason he left when he finds out I had herpes. So my life have been lonely like hell all day I couldn't bear the outbreak pain, Tasha then introduce me to Dr Itua who use his herbal medicine to cure her two weeks of drinking it. I place an order form him and he delivers to my post office then I picked it up and used for two weeks all my sore was healed completely no more outbreak I'm honestly telling you this man is a great man, I trust him and his herbal medicine so much I'm sharing this to show my appreciation also to let sick people know there is hope with Dr Itua Herbal Made Medicine.Dr Itua Contact Email.drituaherbalcenter@gmail.com,Whatsapp.+2348149277967
He cures.
Herpes,
STROKE.
Hepatitis
H.P.V TYPE 1 TYPE 2 TYPE 3 AND TYPE 4. TYPE 5.
HIV
SYPHILIS.
Liver/Kidney Inflammatory
Epilepsy

16. You have come at the right place! Why not relax, while our professional hands do it at affordable rates? With a diverse range of services like Nursing Assignment Help, Finance Assignment Help, Law Assignment Help, Online Assignment Help, report writing, and many more in a plethora of subjects, working round the clock with dedication and striving to deliver 100% satisfaction to clients.

17. Thanks for sharing, nice post! Post really provice useful information!

Hương Lâm chuyên cung cấp bán máy photocopy và dịch vụ cho thuê máy photocopy giá rẻ, uy tín TP.HCM với dòng máy photocopy toshiba và dòng máy photocopy ricoh uy tín, giá rẻ.

18. COEPD LLC- Center of Excellence for Professional Development is the most trusted online training platform to global participants. We are primarily a community of Business Analysts who have taken the initiative to facilitate professionals of IT or Non IT background with the finest quality training. Our trainings are delivered through interactive mode with illustrative scenarios, activities and case studies to help learners start a successful career. We impart knowledge keeping in view of the challenging situations individuals will face in the real time, so that they can handle their job deliverables with at most confidence.

https://www.coepd.us/