the more tests you do, the sooner you'll see a fluke on average
In other words, if you toss a coin 1,000 times, then it's hideously unlikely that you'll see a run of 100 consecutive heads. But if you toss the coin 100,000,000 times, you shouldn't be too surprised to see that 100-toss run buried in there somewhere, even though the odds of getting 100 in a row are so small.
If you toss 100,000,000 == 2 ^ 27, you should only expect around 30 in a row. To have a good chance of getting 100 in a row, you need about a billion squared times more.
And the problem is MUCH worse than described above: Let's say you test 1000 wrong hypothesises with p=0.05; 50 of those will be accepted as true, even though all are wrong. If you test 980 wrong hypothesises and 20 right ones, more than half of those that pass the p=0.05 "golden" significance test will in fact be wrong.
Now, when you see a medical journal with 20 articles using p=0.05, which do you think is more probable - that 19 are right and one is wrong, or 19 are wrong and one is right? The latter has a much higher likelihood.
Clinical researchers too. Because lives are at stake.
The whole field of systematic reviews and meta-analyses has developed around the need to aggregate results from multiple studies of the same disease or treatment, because you can't just trust one isolated result -- it's probably wrong.
Statisticians working in EBM have developed techniques for detecting the 'file-drawer problem' of unpublished negative studies, and correcting for multiple tests (data-dredging). Other fields have a lot to learn...
Clinical researchers working for non-profits / universities do, occasionally. I suspect it has become popular recently not because lives are at stake, but because it lets you publish something meaningful without having to run complex, error prone and lengthy experiments.
Regardless of the true reason, these are never carried out before a new drug or treatment is approved (because there is usually one or two studies supporting said treatment, both positive).
And if you have pointers to techniques developed for/by EBM practitioners, I would be grateful. Being a bayesian guy myself and having spent some time reading Lancet, NEMJ and BMJ papers, I'm so far unimpressed, to say the least.
Ugh, reminds me of the undergrad psych research I participated in. When your original hypothesis doesn't turn out, just run correlations on your data until you find something to write about. Publish or perish, right?
You've got it. The other half of the problem is that there's a chance that when you start flipping coins the first 100 flips will all turn up heads. Now, does this mean that the universe is bent and has begun preferentially treating Mr. Lincoln's head different than his backside? Does this mean your testing apparatus is biased? Is this an inherent property of coins? If you stop flipping at 100 it'd very tempting to conclude that this is the case.
The only way to find out is to do enough flips to eliminate the chances of your final result being influenced by statistical flukes. Measuring small differences, like trying to answer "does a coin preferentially land on one side vs the other?" usually takes hundreds of thousands of tests to guarantee you're seeing objective data, rather than seeing patterns in noise.
actually you ought to have selected n number of tests beforehand, rather than see the fluke and, after the fact, "continue testing" until it goes away.
the very moment you peek, your data is tainted from future testing.
In other words, if you toss a coin 1,000 times, then it's hideously unlikely that you'll see a run of 100 consecutive heads. But if you toss the coin 100,000,000 times, you shouldn't be too surprised to see that 100-toss run buried in there somewhere, even though the odds of getting 100 in a row are so small.
Right?