Math phobia — Is that evidence for real?
One of the most important scientific and medical research discussions in recent history has been taking place in the British Medical Journal this year. Yet, despite how much the information could help consumers, it has received no media attention. The topic: the misuse and misinterpretation of studies and the adoption of new, more reliable criteria for showing true results. Far too many studies are acted upon and reacted to — by public health officials, medical practitioners and consumers — when credible evidence has not been demonstrated.
Hearing that a study found some food, exposure or physical characteristic is associated with a 5% to 200% higher risk for some health problem seem like a frightening lot. It’s easy to scare people half to death by citing relative risks that sound big but aren’t actually viable. Such modest risks (RR=1.05 - 3.0) don’t go beyond a null finding by more than chance (the toss of a dice or random coincidence) or a mathematical or modeling error, even if they’re reported as “statistically significant” in an underpowered study. Larger increases in risk are less likely to have happened by chance. False positives are also often due to various biases and confounding factors. Regular JFS readers understand that relative risks below 3 aren’t considered tenable and this knowledge is one of our best defenses from letting the news of the day get our goat. But, even these may be conservative.
For years now, researchers have found that when an effect proves to be genuine, in real life and subsequent research, the relative risks are usually extraordinarily larger. Just how much will come as a surprise to most readers.
The public, and probably a fair number of doctors, would be amazed to realize just how much studies can be tweaked to reach findings the investigators want to conclude, even clinical trials. A few months ago, JFS examined several expert reviews of clinical trials published in peer-reviewed journals that found most studies were false. Investigators can manipulate their study design, analyses and reporting in countless ways so that more relationships cross statistical significance (the p = 0.05 threshold), even though they wouldn’t have otherwise. As Dr. John P. A. Ioannidis explained: Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specified, changes in the disease or control definitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically significant results through data dredging. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them finds a formally statistically significant association, the probability that the research finding is true is only 1.5 x10-4 — hardly any higher than the probability we had before any of this extensive research was undertaken! Observational studies — you know, the various epidemiological studies that dredge through data on a group of people and use computer models to find correlations with a health outcome — are most rife with misinterpreted statistics, errors and biases, and are most easily manipulated to arrive at whatever conclusions researchers set out to find. These are the studies most popularly reported as the scare of the week. Epidemiological studies make up the bulk of what’s funded, published and reported nowadays. These computer studies are cheap and easy, and are ideal for those wanting to get published, or advance some interest. These are the vehicle for never ending efforts to link lifestyles, foods or the environment to disease and are a bottomless source of scares. “Over the past 20 years statistical associations have implicated almost every aspect of people's everyday lives in some lethal disease or other,” wrote Dr. James Le Fanu, physician and author of The Rise and Fall of Modern Medicine: But most of these alleged hazards, about which we read everyday in the newspapers, cannot possibly be true. The human organism is —as it has to be [for the species to have survived!] — robust and impervious to small changes in the external world. The notion that subtle alterations in patterns of food consumption or undetectable levels of pollutants can be harmful is contrary to the fundamental laws of human biology. Sadly, epidemiological studies are also the most poorly understood. The biggest misconception — besides knowing when a correlation is big enough to suggest a true effect that even deserves our attention (hint: it is much bigger than most people would guess) — is mistaking that correlation for causation. Epidemiological studies can never show causation because they can’t account for all sorts of confounding factors — a related factor that’s the true cause. Yet, we see this mistake every day. This past week, untenable associations (RR=1.17-1.83) between extreme obesity and mortality were found in studies, yet reported in the news as significant, and even doctors and professors leapt to conclude causation and advise weight loss. Before reviewers, medical professionals or the public can begin to critically evaluate epidemiological studies, these studies first need to be reported transparently. But they are rarely reported in such a way to enable readers to evaluate them and know what the investigator set out to do, what they found and how the conclusions were made. These were the findings of an international group of methodologists, researchers and medical editors who began the “Strengthening the Reporting of Observational Studies in Epidemiology” (STROBE) project in 2004. Problems are compounded when those studies are used in other reviews, they found: The credibility of research depends on a critical assessment by others of the strengths and weaknesses in study design, conduct, and analysis. Transparent reporting is also needed to judge whether and how results can be included in systematic reviews. However, in published observational research important information is often missing or unclear. An analysis of epidemiological studies published in general medical and specialist journals found that the rationale behind the choice of potential confounding variables was often not reported. Only few reports of case-control studies in psychiatry explained the methods used to identify cases and controls. In a survey of longitudinal studies in stroke research, 17 of 49 articles (35%) did not specify the eligibility criteria. Neurologists Peter M Rothwell and Meena Bhatia from the University Department of Clinical Neurology at Oxford were among those finding the design and reporting of epidemiological studies to be poor, resulting in unreliable results. As they wrote in a BMJ editorial: “Quality control is unlikely to improve in the near future, given the ever increasing number of medical journals, and the consequently reduced influence of peer review on the likelihood that poor quality research will be published.” As a first step, STROBE sought to tighten up the reporting of these studies and they developed a checklist that could be used in evaluating a paper. While the STROBE guidelines, published in the October 20 issue, won’t solve everything, it may help epidemiological studies begin to be interpreted more judiciously. The STROBE experts said observational studies have a role in medical research when beginning to explore and develop hypotheses on the causes of disease or the benefits or harms of interventions for later, more indepth research, and helping to detect rare or late adverse effects of treatments in actual practice among the population. But these guidelines don’t address two larger issues important for us when looking at an epidemiological study. How do we know when a study has been appropriately applied — or when a randomized clinical trial would be more appropriate for assessing a particular health risk? And how do we know when the results being reported are showing an actual effect versus statistical noise? When is a risk derived from an epidemiological study significant enough to be credible? These studies should come with government health warnings, wrote Dr. Guy Lloyd a cardiologist in Eastbourne. Epidemiological studies are often used inappropriately for common illnesses like cardiovascular disease and cancer, he said, where randomized controlled trials are more reliable. Epidemiology is most effective in identifying large risks in rare diseases. Just in the field of cardiology, he wrote, “the results of observational studies are often seriously flawed:” Observational studies of the cardioprotective effects of female sex hormones, the usefulness of antioxidants or homocysteine lowering strategies, and rhythm control for atrial fibrillation suggested a clear treatment effect and greatly influenced practice. But subsequent randomised trials refuted each hypothesis. The main problem, he explained, is all of the interacting factors among cohorts that can’t be statistically accounted for in an epidemiological study. If we let ourselves decide truths or causations based on epidemiology, and devote limited research resources and funding to continual epidemiological research, it can side track us from following more viable lines of research, needlessly worrying people and costing lives in the process. Evidence isn’t an absolute notion that can be based simply on statistically-derived findings. It requires clinical knowledge and experience, and carefully weighing all the solid evidence. It also calls upon us to understand when evidence is tenable. Dr. Paul Glasziou, professor at the Centre for Evidence-Based Medicine at the University of Oxford, and colleagues investigated this major issue. They found that higher standards are needed and that the P-values and relative risks popularly used often misguide us. In the February issue, Dr. Glasziou and colleagues examined the variety of types of evidence doctors use in clinical practice. Clearly, not every treatment decision is based on randomized clinical trials or always need be, as they wrote: Our knowledge of the effects of treatments comes from various sources ranging from personal clinical experience to carefully controlled trials. Although we are often wary of inferring the effects of treatments from evidence other than that from randomised controlled trials, we are all familiar with examples of situations in which confident inferences about treatments have been based on other kinds of evidence. They examined multiple examples of treatments that are widely accepted based on case studies and cohort studies, finding situations where randomized clinical trials may be unnecessary and others where they are necessary. In their evaluation of the strength of a treatment, they examined relative risks using rate ratios: the ratio between how fast a condition responded to an intervention compared to its progression without treatment. Among single case studies, there’s a high signal to noise ratio, they explained, because some conditions will resolve during a treatment period spontaneously, and the effectiveness of a treatment is likely to be overestimated and reported as a success than a failure. In other words, anecdotal evidence isn’t reliable. Before generalizing to other patients, data from several carefully evaluated case studies are needed, they said. When the treatment is seen to fail sometimes, then randomized trials are needed to figure out why and compare the effects of different techniques. When a treatment has a big, fast effect in stable or progressive conditions (for example, relieving a large pneumothorax collapsed lung, glucose for insulin shock, or applying pressure to stop bleeding), it’s pretty easy to predict what the outcomes would be for other patients without treatment. It’s harder to judge the effects of treatments that have gradual or delayed effects. These require bigger rate ratios to be convincing, but can be figured out by blinded observations looking for solid objective effects over periods of time after treatments. Yet even among procedures that obviously work because fast and significant effects are seen using objective measures, it’s unknown if they work better than other interventions. These are other situations when randomized trials are needed. But outcomes of treatments for conditions that fluctuate (arthritic flair-ups, eczema), have spontaneous remissions (colds) or are intermittent (migraines, asthma) are much more unpredictable and proof becomes more difficult, they said. “Here, individual cases and experience are liable to be misleading as there is as much noise as signal....[C]onfounding is common and often not obvious....In these circumstances, we usually need randomization and other measures to reduce biases in order to distinguish treatment effects from the effects of biases, unless the effect is very large.” Here’s where they get to the nitty gritty.... We often hear that small risks across a large population can have a big public health consequences — true, but not when those relative risks aren’t real, are only creations of statistical manipulation, and aren’t the real cause. How large of an effect is large enough to know a treatment really works or a relationship to a health outcome may be credible? As a result of their analysis, the Oxford researchers suggested two rules for defining sufficient differences between treated and untreated patients, or afflicted and unafflicted patients: (a) that the conventionally calculated probability of the two groups of observations coming from the same population should be less than 0.01 and (b) that the estimate of the treatment effect (rate ratio) should be large. In our examples it was at least 20. Simulations have suggested that implausibly large associations, both between treatment and confounding factor and between confounding factor and outcome, are generally required to explain risks beyond relative rates of 5-10. One empirical study that compared randomly selected control groups in multicentre trials also found that, while modest confounding is very likely, such extremes are unlikely. We therefore suggest that rate ratios beyond 10 are highly likely to reflect real treatment effects, even if confounding factors associated with the treatment may have contributed to the size of the observed associations. Other criteria helps when making inferences about causation and the effectiveness of an interventions, such as temporal relationship (treatment precedes effect), plausibility (based on known disease mechanism), consistency (across different settings and methods), dose-related response, etc. Cohort studies can lead us astray when they’re acted upon based on findings unlikely to be true effects, and before randomized trials have been completed and the totality of the evidence is considered. Yet it happens all of the time, and numerous popular beliefs based on weak observational studies have been later disproven in clinical trials. As they wrote: The recent examples of hormone replacement therapy and beta-carotene show how evidence from sources other than randomised trials can lead us badly astray. In both these cases, however, the signal to noise ratio was modest, with relative risks of around 2 (or 0.5, depending on which way the comparison is framed). Relative risks of this order would not meet our requirements for judging a treatment effect to be dramatic. These researchers are not the first to make these observations and call for higher criteria, either. As Dr. Ioannidis explained: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true. Power is also related to the effect size. Thus, research findings are more likely true in scientific fields with large effects... Modern epidemiology is increasingly obliged to target smaller effect sizes. Consequently, the proportion of true research findings is expected to decrease. In the same line of thinking, if the true effect sizes are very small in a scientific field, this field is likely to be plagued by almost ubiquitous false positive claims. While these new guidelines will need further refinement to avoid false conclusions or being caught by other biases like the Texas sharpshooter bias, they can go along way in helping prevent clinicians, policy makers and the public from jumping to faulty conclusions based on untenable nonrandomized evidence. But relative risks receive the most mainstream media attention (rarely the confidence indexes and p-values), which is why having a rudimentary understanding of them can be especially helpful for the average headline reader. Most worries among the public would already be assuaged if more people simply understood that relative risks less than 3 (200%) have long been recognized as untenable. Just imagine how many popular fears and health agendas would disintegrate in an instant if the public realized that relative risks less than 10 — that’s 10-fold or 900% as high — with p-values >0.01 are often not real, tenable and are generally explained by confounding factors. We’re so used to hearing inconsequential relative risks reported as NEWS that we’ve come to believe they are real and worth us acting upon. Instead, when we hear reports of these small relative risks, if we sat back and waited for the science to work itself out, we’d be a lot less likely to get caught up in the claim of the day and be taken advantage of. This new guideline may sound like an extreme idea, but perhaps not when we look at the relative risks derived in epidemiology that have later proven out in clinical studies, versus all of those that haven’t. A few examples may help put risks into perspective. “Studies of heavy smoking and lung cancer report a relative risk of about 20; those of aspirin and Reye's syndrome in children report a relative risk of 35,” said Steve Milloy of Junkscience and author of Science Without Sense. The FDA Center for Food Safety and Applied Nutrition reports relative risks for listeriosis associated with raw seafood of 17 among most adults, rising to 20 in elderly; and 15 with unpasteurized milk. The relative risks for Kaposi sarcoma associated with HIV infection is 192 (95% confidence interval); and the relative risks for non-Hodgkin lymphoma with HIV have been reported as high as 76.4. Among carriers of the BRCA1 mutations, the relative risks of breast cancer have been estimated to be 21.6 in women under 40 years of age, 9.6 in women 40-49 years of age and 7.6 in older women. As Dr. Lloyd wrote, in commenting how cohort studies are often wrong, yet acted upon in isolation: Glasziou et al suggested that a combined rates ratio of at least 10 and a P value of <0.01 should be used to distinguish between a true effect and background population “noise." Few of our current favourite targets - mild [sic] obesity, salt intake or passive smoking — would pass this test. The findings of cohort studies should start rather than close the debate. Experts are too hasty to present a hypothesis as a proven fact, and the medical profession is too willing to accept such findings uncritically.
Seek and ye shall find
Population data dredges
STROBE — Giving it to us straight
The bigger issues
When is evidence for real?
The money shot
© 2007 Sandy Szwarc
<< Home