picking blindfolded, you might pull out three red ones in a row, just by chance. The standard cut-off point for statistical significance is a p-value of 0.05, which is just another way of saying, ‘If I did this experiment a hundred times, I’d expect a spurious positive result on five occasions, just by chance.’
To go back to our concrete example of the kids in the playground, let’s imagine that there was definitely no difference in cocaine use, but you conducted the same survey a hundred times: you might get a difference like the one we have seen here, just by chance, just because you randomly happened to pick up more of the kids who had taken cocaine this time around. But you would expect this to happen less than five times out of your hundred surveys.
So we have a risk increase of 35.7 per cent, which seems at face value to be statistically significant; but it is an isolated figure. To ‘data mine’, taking it out of its real-world context, and saying it is significant, is misleading. The statistical test for significance assumes that every data point is independent, but here the data is ‘clustered’, as statisticians say. They are not data points, they are real children, in 305 schools. They hang out together, they copy each other, they buy drugs from each other, there are crazes, epidemics, group interactions.
The increase of forty-five kids taking cocaine could have been a massive epidemic of cocaine use in one school, or a few groups of a dozen kids in a few different schools, or mini-epidemics in a handful of schools. Or forty-five kids independently sourcing and consuming cocaine alone without their friends, which seems pretty unlikely to me.
This immediately makes our increase less statistically significant. The small increase of 0.5 per cent was only significant because it came from a large sample of 9,000 data points – like 9,000 tosses of a coin – and the one thing almost everyone knows about studies like this is that a bigger sample size means the results are probably more significant. But if they’re not independent data points, then you have to treat it, in some respects, like a smaller sample, so the results become less significant. As statisticians would say, you must ‘correct for clustering’. This is done with clever maths which makes everyone’s head hurt. All you need to know is that the reasons why you must ‘correct for clustering’ are transparent, obvious and easy, as we have just seen (in fact, as with many implements, knowing when to use a statistical tool is a different and equally important skill to understanding how it is built). When you correct for clustering, you greatly reduce the significance of the results. Will our increase in cocaine use, already down from ‘doubled’ to ‘35.7 per cent’, even survive?
No. Because there is a final problem with this data: there is so much of it to choose from. There are dozens of data points in the report: on solvents, cigarettes, ketamine, cannabis, and so on. It is standard practice in research that we only accept a finding as significant if it has a p-value of 0.05 or less. But as we said, a p-value of 0.05 means that for every hundred comparisons you do, five will be positive by chance alone. From this report you could have done dozens of comparisons, and some of them would indeed have shown increases in usage – but by chance alone, and the cocaine figure could be one of those. If you roll a pair of dice often enough, you will get a double six three times in a row on many occasions. This is why statisticians do a ‘correction for multiple comparisons’, a correction for ‘rolling the dice’ lots of times. This, like correcting for clustering, is particularly brutal on the data, and often reduces the significance of findings dramatically.
Data dredging is a dangerous profession. You could – at face value, knowing nothing about how stats works – have said that this government report showed a significant increase of 35.7 per cent in cocaine use. But the stats nerds who compiled it knew about clustering, and Bonferroni’s correction for multiple comparisons. They are not stupid; they do stats for a living.
That, presumably, is why they said quite clearly in their summary, in their press release and in the full report that there was no change from 2004 to 2005. But the journalists did not want to believe this: they tried to re- interpret the data for themselves, they looked under the bonnet, and they thought they’d found the news. The increase went from 0.5 per cent – a figure that might be a gradual trend, but it could equally well be an entirely chance finding – to a front-page story in
There are also some perfectly simple ways to generate ridiculous statistics, and two common favourites are to select an unusual sample group, and to ask them a stupid question. Let’s say 70 per cent of all women want Prince Charles to be told to stop interfering in public life. Oh, hang on – 70 per cent of all women
There was an excellent example of this in the
Where did these figures come from? A systematic survey of all GPs, with lots of chasing to catch the non- responders? Telephoning them at work? A postal survey, at least? No. It was an online vote on a doctors’ chat site that produced this major news story. Here is the question, and the options given:
‘GPs should carry out abortions in their surgeries’
Strongly agree, agree, don’t know, disagree, strongly disagree.
We should be clear: I myself do not fully understand this question. Is that ‘should’ as in ‘should’? As in ‘ought to’? And in what circumstances? With extra training, time and money? With extra systems in place for adverse outcomes? And remember, this is a website where doctors – bless them – go to moan. Are they just saying no because they’re grumbling about more work and low morale?
More than that, what exactly does ‘abortion’ mean here? Looking at the comments in the chat forum, I can tell you that plenty of the doctors seemed to think it was about surgical abortions, not the relatively safe oral pill for termination of pregnancy. Doctors aren’t that bright, you see. Here are some quotes:
This is a preposterous idea. How can GPs ever carry out abortions in their own surgeries. What if there was a major complication like uterine and bowel perforation?
GP surgeries are the places par excellence where infective disorders present. The idea of undertaking there any sort of sterile procedure involving an abdominal organ is anathema.
The only way it would or rather should happen is if GP practices have a surgical day care facility as part of their premises which is staffed by appropriately trained staff,
What are we all going on about? Let’s all carry out abortions in our surgeries, living rooms, kitchens, garages, corner shops, you know, just like in the old days.
And here’s my favourite:
I think that the question is poorly worded and I hope that [the doctors’ website] do not release the results of this poll to the
It would be wrong to assume that the kinds of oversights we’ve covered so far are limited to the lower echelons of society, like doctors and journalists. Some of the most sobering examples come from the very top.
In 2006, after a major government report, the media reported that one murder a week is committed by someone with psychiatric problems. Psychiatrists should do better, the newspapers told us, and prevent more of these murders. All of us would agree, I’m sure, with any sensible measure to improve risk management and violence, and it’s always timely to have a public debate about the ethics of detaining psychiatric patients (although in the name of fairness I’d like to see preventive detention discussed for all other potentially risky groups too – like alcoholics, the repeatedly violent, people who have abused staff in the job centre, and so on).
But to engage in this discussion, you need to understand the maths of predicting very rare events. Let’s take a very concrete example, and look at the HIV test. What features of any diagnostic procedure do we measure in order to judge how useful it might be? Statisticians would say the blood test for HIV has a very high ‘sensitivity’, at 0.999. That means that if you do have the virus, there is a 99.9 per cent chance that the blood test will be positive. They would also say the test has a high ‘specificity’ of 0.9999 – so, if you are not infected, there is a 99.99 per cent chance that the test will be negative. What a smashing blood test.
But if you look at it from the perspective of the person being tested, the maths gets slightly counterintuitive.