http://www.flickr.com/photos/sciencemuseum/3321607591/ |
This week, a paper by Woodley et al (2013) was widely quoted in the media (e.g. Daily Mail, Telegraph). The authors dramatically announced that the average intelligence of populations of Western industrialised societies has fallen since the Victorian era. This is provocative because previous analyses of large archived datasets of intelligence tests scores by Flynn and others show the opposite. However, Woodley et al did not examine average intelligence test scores obtained from different generations. They compared 16 sets of data from Simple Reaction - Time (SRT) experiments made on groups of people at various times between 1884 and 2002. In all of these experiments volunteers responded to a single light signal by pressing a single response key. Data for women are incomplete but averages of SRTs for men increase significantly with year of testing. Because Woodley et al regard SRTs as good inverse proxy measures for intelligence test scores, which are in some senses “purer” measures of intelligence than pencil and paper tests, they concluded that more recent samples are less intelligent than earlier ones
Throughout their paper the authors argue that higher intelligence of persons alive during the Victorian era can explain why their creativity and achievements were markedly greater than for later, duller generations. We can leave aside an important question whether there is any sound evidence that creativity and intellectual achievements have declined since a Great Victorian Flowering because only two of the 16 datasets they compared were collected before Victoria’s death in 1901. The remaining 14 datasets date between 1941 and 2004 and, of these, only four were collected before 1970. So most of the studies analysed were made within my personal working lifespan. This provokes both nostalgia and distrust. Between 1959 and 2004 I collected reaction times (RTs) from many large samples of people but it would make no sense for me to compare absolute values of group mean RTs that I obtained before and after 1975. This was because, until 1975, like nearly all of my colleagues, the only apparatus I had were Dekatron counters, the Birren Psychomet or SPARTA apparatus, none of which measured intervals shorter than 100 msec. Consequently, when my apparatus gave a reading of 200 msec. the actual Reaction Time might be anywhere between 200 and 299 msec. Like most of my colleagues I always computed and published mean RTs to three decimal places, but this was pretentious because all the RTs I had collected had been, in effect, rounded down by my equipment. After 1975, easier access to computers and better programs gradually began to allow true millisecond resolution. More investigators took advantage of new equipment and our reports of millisecond averages became less misleading. I am unsurprised that mean RTs computed from post-1975 data were consistently, and significantly longer than those for pre-1975 data.
Changes in recording accuracy are a sufficient reason to withold excitement at Woodley et al’s comparison. It is worth noticing that different methodological issues also make it tricky to compare absolute values for means of RTs that were collected at different times and so with different kinds of equipment. For example RTs are affected by differences in signal visibility and rise-times to maximum brightness between tungsten lamps, computer monitor displays, neon bulbs and LCDs. The stiffness and “throw” of response buttons will also have varied between the set-ups that investigators used. When comparing absolute values of SRTs, another important factor is whether or not each signal to respond is preceded by a warning signal, whether the periods between warning signals and response signals are constant or variable and just how long they are (intervals between, approximately, 200 and 800 ms allow faster RTs than shorter or longer ones) Knowing these methodological quirks makes us realise that, in marked contrast to intelligence tests, methodologies for measuring RT have been thoroughly explored but never standardised.
So I do not yet believe that Wooley et al’s analyses show that psychologists of my generation were probably (once!) smarter than our young colleagues (now) are. This seems unlikely, but perhaps if I read further publications by these industrious investigators I may become convinced that this is really the case.
References
Flynn, J. R. (1987). Massive IQ gains in 14 nations - what IQ tests really measure. Psychological Bulletin, 101(2), 171-191. doi: 10.1037/0033-2909.101.2.171
Michael A. Woodley, Jan te Nijenhuis, & Raegan Murphy (2013). Were the Victorians cleverer than us? The decline in general intelligence
estimated from a meta-analysis of the slowing of simple reaction time Intelligence : http://dx.doi.org/10.1016/j.intell.2013.04.006
POST SCRIPT, 24th May 2013
Dr Woodley has published a response to my critique on James Thompson's blog. He asks me to answer. I am glad to do so. Sluggishness has been due only to the pleasure of reading the many articles to which Woodley drew my attention. Dorothy’s remorseless archaeology of this trove, summarised in the table below, has provoked much domestic merriment during the past few days. We are grateful to Dr Woodley for this diversion. Here are my thoughts on his comments on my post.
Woodley et al used data from a meta-analysis by Silverman (2010). I am grateful to Prof Silverman for very rapid access to his paper in which he compared average times to make a single response to a light signal from large samples in Francis Galton's anthropometric laboratories and from several later, smaller samples dating from 1941 to 2006. To these Woodley et al added a dataset from Helen Bradford Thompson's 1903 monograph "The mental traits of sex".
As Silverman (2010) trenchantly points out there is a limit to possible comparisons from these datasets,: “In principle, it would be possible to uncover the rate at which RT increased (since the Galton studies) by controlling for potentially confounding variables in a multiple regression analysis. However, this requires that each of these variables be represented by multiple data points, but this requirement cannot be met by the present dataset. Accurately describing change over time also requires that both ends of the temporal dimension be well represented in the dataset and that the dataset be free of outliers (Cohen, Cohen, West, & Aiken, 2003); neither of these requirements can be met …… Thus, it is important to reiterate that the purpose … is not to show that RT has changed according to a specific function over time but rather to show that modern studies have obtained RTs that are far longer than those obtained by Galton."
Neither Silverman nor Woodley et al seem much concerned that results of comparisons might depend on differences between studies in apparatus and methods, which are shown here, together with temporal resolution where reported.
Since Galton's dataset is the key baseline for the conclusion that population mean RT is increasing, it is worth considering details of his equipment described here and in a wonderful archival paper “Galton’s Data a Century Later” by Johnson et al (1985): “……during its descent the pendulum gives a sight-signal by brushing against a very light and small mirror which reflects a light off or onto a screen, or, on the other hand, it gives a sound-signal by a light weight being thrown off the pendulum by impact with a hollow box. The position of the pendulum at either of these occurrences is known. The position of the pendulum when the response is made is obtained by means of a thread stretched parallel to the axis of the pendulum by two elastic bands one above and one below, the thread being in a plane through the axes of the pendulum, perpendicular to the plane of the pendulum's motion. This thread moves freely between two parallel bars in a horizontal plane, and the pressing of a key causes the horizontal bars to clamp the thread. Thus the clamped thread gives the position of the pendulum on striking the key. The elastic bands provide for the pendulum not being suddenly checked on the clamping. The horizontal bars are just below a horizontal scale, 800 mm. below the point of suspension of the pendulum. Galton provided a table for reading off the distance along the scale from the vertical position of the pendulum in terms of the time taken from the vertical position to the position in which the thread is clamped." (p. 347).
Contemporary journal referees would press authors for reassurance that the apparatus produced consistent values over trials and had no bias to over or underestimate. Obviously this would have been very difficult for Galton to achieve.
In my earlier post I noted that over the mid-to late 20th century it became obvious that to report reaction times (RT) to three decimal places is misleading if equipment only allows centi-second recording. In the latter case a reading of 200 ms will remain until a further 100 ms have elapsed, effectively "rounding down" the RT. Woodley argues that we cannot assume that rounding down occurred. I do not follow his reasoning on this point. He also offers a statistical analysis to confirm that if the temporal resolution of the measure is the only difference between studies, this would not systematically underestimate RT. Disagreement on whether rounding occurred may only be resolved with empirical data comparing recorded and true RTs between equipments.
A general concern with comparisons of RTs between studies is that they are significantly affected by the methodology and apparatus used to collect them. This is not only due to differences in resolution but can lead to systematic bias in timing of trials. For a comprehensive account of how minor differences between different 21st century computers and commercial software can flaw comparisons between studies see Plant and Quinlan (2013), who write: "All that we can state with absolute certainty is that all studies are likely to suffer from varying degrees of presentation, synchronization, and response time errors if timing error was not specifically controlled for." I earlier suggested that apparently trivial procedural details can markedly affect RTs. Among these are whether or not participants are given warning signals, whether the intervals between warning signals and response signals are constant or vary across trials and how long these intervals are, the brightness of signal lamps and the mechanical properties of response keys. A further point also turns out to be relevant to assessment of Woodley et al's argument: average values will also depend on the number of trials recorded, and averaged, for each person, and whether outliers are excluded. Note, for instance, that the equipment used in the studies by Deary and Der, though appropriate for the comparisons that they made and reported, did not record RTs for individual trials but an averaged RT for an entire session. This makes it impossible to exclude outliers, as is normal good practice. The point is that comparisons that are satisfactory within the context of a single well-run experiment may be seriously misleading if made between equally scrupulous experiments using different apparatus and procedures. Johnson et al (1985) and Silverman (2010) stress that Galton’s data were wonderfully internally consistent. This reassures us that equipment and methods were well standardised within his own study. It cannot give any assurance that his data can be sensibly comparable with those obtained with other very diverse equipments and methodologies.
Another excellent feature of the Galton dataset is that re-testing of part of his large initial sample allowed estimates of reliability of his measures. With his large sample sizes even low values of test/re-test correlations were better than chance. Nevertheless it is interesting that the test-retest correlation for visual RT, at .17, on which Silverman’s and Woodley’s conclusions depends, was lower than the next lowest (high frequency auditory acuity,.28), or Snellen eye-chart (.58) and visual acuity (.76 to.79) (Johnson et al, 1985, Table 2).
We do not know whether warning signals were used in Galton's RT studies, or, if so, how long the preparatory intervals between warning and response signals might have been. Silverman (2010) had earlier acknowledged that preparatory interval duration might be an issue but felt that he could ignore it because a report by Teichner of Wundt’s discovery that fore-period duration effects could not be independently substantiated and also because he accepted Seashore et al ‘s (1941) reassurance that there are no effects on RT of fore-period duration.
Ever since a convincing study by Klemmer (1957) it has been recognised that the durations of preparatory intervals do significantly affect reaction times, that the effects of fore-period variation are large and that results cannot be usefully compared unless these factors are taken into consideration. Indeed during the 1960s fore-period effects were the staple topic of a veritable academic industry (see review by Niemi and Naatanen, 1981, and far too many other papers by Bertelson, Nickerson, Sanders, Rabbitt etc. etc). In this context Seashore et al’s (1941) failure to find for-period effects does not increase our confidence in their results as one of the data points on which Woodley et al’s analysis is based.
Silverman’s lack of interest in fore-period duration was also heightened by Johnson et al’s (1985) comment that, as far as they were able to discover, each of Galton’s volunteers was only given one trial. Silverman implies that if each of Galton’s volunteers only recorded a single RT, variations in preparatory intervals are hardly an issue. It is also arguable that this relaxed procedure might have lengthened rather than shortened RTs. Well… Yes and No. First, it would be nice to know just how volunteers were alerted that their single trial was imminent? By a nod or a wink? A friendly pat on the shoulder? A verbal “Ready”? Second, an important point of using warning signals, and of recording many rather than just one trial is that the first thing that all of us who run simple RT Experiments discover is that volunteers are very prone to “jump the gun” and begin to respond before any signal appears, so recording absurdly fast “RTs” that can be as low as 10 to 60 ms. 20th and 21st century investigators are usually (!) careful to identify and exclude such observations. Many also edit out what they regard as implausibly slow responses. We do not know whether or how either kind of editing occurred in the Galton laboratories. Many participants would have jumped the gun and if this was their sole recorded reaction the effects on group means would have been considerable. If Galton’s staff did edit RTs, both acceptance of impulsive responses or dismissal of very slow responses would reduce means and favour the idea of “Speedy Victorians”.
I would like to stress that my concerns are methodological rather than dogmatic. Investigators of reaction times try to test models for information processing by making small changes in single variables in tasks run on the same apparatus and with exactly the same procedures. This makes us wary of conclusions from comparisons between datasets collected with wildly different equipments, procedures and groups of people. My concerns were shared by some of those whose data are used by Silverman and Woodley et al. For example, the operational guide for the Datico Terry 84 device used by Anger et al states that "A single device has been chosen because it is very difficult to compare reaction time data from different test devices".
Because I have spent most of my working life using RTs to compare the mental abilities of people of different ages I am very much in favour of using RT measurements as a research tool for individual differences. (For my personal interpretation of the relationships between people’s calendar ages and gross brain status and their performance on measures of mental speed, of fluid intelligence, of executive function, and of memory see e.g. Rabbitt et al, 2007). I also strongly believe that mining archived data is a very valuable scientific endeavour and becomes more valuable as the volume of available data exponentially increases. For example, Flynn’s dogged analyses of archived intelligence test scores show that data mining has raised provocative and surprising questions. I also believe, with Silverman, that large population studies provide good epidemiological evidence of the effects of changes in incidence of malnutrition or of misuse of pesticides or antibiotics. I am more amused than concerned when, in line with Galton’s strange eugenic obsessions, they are also discussed as potential illustrations of growing degeneracy of our species due to increased survival odds for the biologically unfit. As I noted in my original post, my only concern is that it is a time-wasting mistake to uncritically treat measurements of Reaction Times as being, in some sense, “purer”, more direct and more trustworthy indices of individual differences than other measures such as intelligence tests. Of course RTs can be sensitive and reliable measures of individual differences but, as things stand, equipments and procedures are not standardised and, because RTs are liable to many methodological quirks, we obtain widely different mean values from different population samples even from apparently very similar tasks.
No comments:
Post a Comment