|
©www.cartoonstock.com |
Here's s question for
researchers who use analysis of variance (ANOVA). Suppose I set up a study to
see if one group (e.g. men) differs from another (women) on brain response
to auditory stimuli (e.g. standard sounds vs deviant sounds – a classic
mismatch negativity paradigm). I measure
the brain response at frontal and central electrodes located on two sides of the head.
The nerds among my readers will see that I have
here a four-way ANOVA, with one between-subjects factor (sex) and three
within-subjects factors (stimulus, hemisphere, electrode location). My
hypothesis is that women have bigger mismatch effects than men, so I predict an
interaction between sex and stimulus, but the only result significant at p <
.05 is a three-way interaction between sex, stimulus and electrode location.
What should I do?
a) Describe this as my
main effect of interest, revising my hypothesis to argue for a site-specific
sex effect
b) Describe the result as
an exploratory finding in need of replication
c) Ignore the result as
it was not predicted and is likely to be a false positive
I'd love to do a survey
to see how people respond to these choices; my guess is many would opt for a)
and few would opt for c). Yet in this situation, the likelihood of the result
being a false positive is very high – much higher than many people realise.
Many people assume that
if an ANOVA output is significant at the .05 level, there's only a one in
twenty chance of it being a spurious chance effect. We have been taught that we do ANOVA rather
than numerous t-tests because ANOVA adjusts for multiple comparisons. But this
interpretation is quite wrong. ANOVA adjusts for the number of levels within a factor, so, for instance, the probability
of finding a significant effect of group is the same regardless of how many
groups you have. ANOVA makes no
adjustment to p-values for the number of factors and interactions in your
design. The more of these you have, the greater the chance of turning up a
"significant" result.
So, for the example given
above, the probability of finding something
significant at .05, is as follows:
For the four-way ANOVA
example above, we have 15 terms (four
main effects, six 2-way interactions, four 3-way interactions and one 4-way
interaction) and the probability of finding no significant effect is .95^15 =
.46. It follows that the probability of finding something significant is .54.
And for a three-way ANOVA
there are seven terms (three main effects, three 2-way interactions and one
3-way interaction), and p (something significant) = .30.
So, basically, if you do
a four-way ANOVA, and you don't care what results comes out, provided something
is significant, you have a slightly greater than 50% chance of being satisfied. This might seem like an
implausible example: after all who uses ANOVA like this? Well, unfortunately,
this example corresponds rather closely to what often happens in
electrophysiological research using event-related potentials (ERPs). In this field, the interest is often in
comparing a clinical and a control group, and so some results are more
interesting than others: the main effect of group, and the seven interactions
with group are the principal focus of attention. But hypotheses about exactly what will be
found are seldom clearcut: excitement is generated by any p-value associated
with a group term that falls below .05. There's a one in three chance that one
of the terms involving group will have a p-value this low. This means that the
potential for 'false positive psychology' in this field is enormous (Simmons et
al, 2011).
A corollary
of this is that researchers can modify the likelihood of finding a
"significant" result by selecting one ANOVA design rather than
another. Suppose I'm interested in comparing brain responses to standard and
deviant sounds. One way of doing this is to compute the difference between ERPs
to the two auditory stimuli and use this difference score as the dependent
variable:
this reduces my ANOVA from a
4-way to a 3-way design, and gives fewer opportunities for spurious findings. So
you will get a different risk of a false positive,
depending on how you analyse the data.
Another feature of ERP
research is that there is flexibility in how electrodes are handled in an ANOVA
design: since there is symmetry in electrode placement, it is not uncommon to
treat hemisphere as one factor, and electrode site as another. The alternative
is just to treat electrode as a repeated measure. This is not a neutral choice:
the chances of spurious findings is greater if one adopts the first approach,
simply because it adds a factor to the analysis, plus all the interactions with
that factor.
I stumbled across these
insights into ANOVA when I was simulating data using a design adopted in a
recent PLOS One paper that I'd commented on. I was initially interested in looking at the
impact of adopting an unbalanced design in ANOVA: this study had a group factor
with sample sizes of 20, 12 and 12.
Unbalanced designs are known to be problematic for repeated measures ANOVA and I initially thought this might be
the reason why simulated random numbers were giving such a lot of
"significant" p-values. However, when I modified the simulation to
use equal sample sizes across groups, the analysis continued to generate far
more low p-values than I had anticipated, and I eventually twigged that this
was because this is what you get if you use 4-way ANOVA. For any one main
effect or interaction, the probability of p < .05 was one in twenty: but the
probability that at least one term in the analysis would give p < .05 was closer
to 50%.
The analytic approach
adopted in the PLOS One paper is pretty standard in the field of ERP. Indeed, I have
seen papers where 5-way or even 6-way
repeated measures ANOVA is used. When
you do an ANOVA and it spews out the results, it's tempting to home in on the
results that achieve the magical significance level of .05 and then formulate
some kind of explanation for the findings. Alas, this is an approach that has
left the field swamped by spurious results.
There have been various
critiques of analytic methods in ERP, but I haven't yet found any that have
focussed on this point.
Kilner (2013) has noted the bias that arises when
electrodes or windows are selected for analysis post hoc, on the basis that
they give big effects. Others have noted
problems with using electrode as a repeated measure, given that ERPs at different electrodes are often highly
correlated. More generally,
statisticians are urging psychologists to
move away from using ANOVA to adopt multi-level modelling, which makes different assumptions and can cope, for
instance, with unbalanced designs. However, we're not going to fix the problem
of "false positive ERP" by adopting a different form of analysis. The
problem is not just with the statistics, but with the use of statistics for what
are, in effect, unconstrained exploratory analyses. Researchers in this field urgently need
educating in the perils of post hoc interpretation of p-values and the
importance of a priori specification of predictions.
I've
argued before that
the best way to teach people about statistics is to get them to generate their
own random data sets. In the past, this was difficult, but these days it can be
achieved using free statistical software, R.
There's no better way of persuading someone to be less impressed by p
< .05 than to show them just how readily a random dataset can generate
"significant" findings. Those who want to explore this approach may
find
my blog on twin analysis in R useful for getting started (you don't need
to get into the twin bits!).
The field of ERP is
particularly at risk of spurious findings because of the way in which ANOVA is
often used, but the problem of false positives is not restricted to this area,
nor indeed to psychology. The mindset of researchers needs to change radically,
with a recognition that our statistical methods only allow us to distinguish
signal from noise in the data if we understand the nature of chance.
Education about
probability is one way forward. Another is to change how we do science to make
a clear distinction between planned and exploratory analyses. This post was
stimulated by
a letter that appeared in the Guardian this week on which I was a
signatory. The authors argued that we should encourage a system of
pre-registration of research, to avoid the kind of post hoc interpretation of
findings that is so widespread yet so damaging to science.
Reference
Simmons, Joseph P., Nelson, Leif D., & Simonsohn, Uri (2011). False-positive psychology Psychological Science, 1359-1366 DOI: 10.1037/e636412012-001This article (Figshare version) can be cited as:
Bishop, Dorothy V M (2014): Interpreting unexpected significant findings. figshare.
http://dx.doi.org/10.6084/m9.figshare.1030406 PS. 2nd July 2013There's remarkably little coverage of this issue in statistics texts, but Mark Baxter pointed me to a 1996 manual for SYSTAT that does explain it clearly. See: http://www.slideshare.net/deevybishop/multiway-anova-and-spurious-results-sytThe authors noted "Some authors devote entire chapters to fine distinctions between multiple comparison procedures and then illustrate them within a multi-factorial design not corrected for the experiment-wise error rate." They recommend doing a Q-Q plot to see if the distribution of p-values is different from expectation, and using Bonferroni correction to guard against type I error.
They also note that the different outputs from an ANOVA are not independent if they are based on the same mean squares denominator, a point that is discussed here:Hurlburt, R. T., & Spiegel, D. K. (1976). Dependence of F Ratios Sharing a Common Denominator Mean Square. The American Statistician, 30(2), 74-78. doi: 10.2307/2683798These authors conclude (p 76)It is important to realize that the appearance of two significant F ratios sharing the same denominator should decrease one's confidence in rejecting either of the null hypotheses. Under the null hypothesis, significance can be attained either by the numerator mean square being "unusually" large, or by the denominator mean square being "unusually" small. When the denominator is small, all F ratios sharing that denominator are more likely to be significant. Thus when two F ratios with a common denominator mean square are both significant, one should realize that both significances may be the result of unusually small error mean squares. This is especially true when the numerator degrees of freedom are not small compared' to the denominator degrees of freedom.