A recent compilation of papers in Nature Medicine from investigators affiliated with the venerable Global Burden of Disease Study purports to tell us what we know about salient health matters - tobacco use, dietary patterns, and various outcomes from minor back pain to myocardial infarction - and how reliably we know it.
In the service of their objective, they offer us a 5-star rating system; the more stars, the more reliably we know what we thought we knew. They refer to all of this under the rubric: “burden of proof studies.” Media attention has been generous.
I appreciate the intentions of those involved in this, having no reason to think them other than virtuous. The Global Burden of Disease Study is nothing less than a treasure trove, and I cite it routinely. But the idea that there is some universal, standard “burden of proof” – a term, obviously, borrowed from the legal system where innocence is presumed and must be disproved beyond reasonable doubt – attached to everything humans can, and do, know is, in a word, preposterous.
What, for instance, is the burden of proof in securing the relationship between putting your hand into the flames of a camp fire, and getting burned? There are no studies on the topic, randomized or otherwise, that I can find. Failing, thus, to meet entry-level criteria for that standard burden, are we all now invited to doubt what we thought we knew regarding fire and burns?
My example, silly as it seems, is powerful exactly because it is trivial. If we can so obviously know something in the absence of any targeted “science,” it is all the proof we need that science is in no position to tell us there is just one way for us to know something, and just one way to measure how reliably we know it.
The authors are right that confidence matters. We all have decisions to make- as individuals, and as citizens configured into the body politic- and decisions are well served by priorities. Priorities, in turn, are well served by confidence. When, for instance, we know with confidence that something is harmful or beneficial, we are more likely to prioritize avoidance, or adoption, respectively- than if we have only the vaguest hints of such effects.
But the merits in this effort end there, unfortunately. Applying not just a single scale of stars, but a single means of grading evidence that is simply not suitable to all important questions, the authors’ own results challenge, and arguably punish their contentions. They accord the same 3 stars to the relationship between smoking and heart disease, where mechanistic pathways are clearly known, as to that between smoking and low back pain- where the causal pathway would be…anybody’s guess.
Based on exchanges with expert colleagues not easily translated into facile column-speak, we may note that there are some serious questions about subtleties of statistical methodology here, too. From altitude, there is the fundamental problem that causal relationships vary by context. Recall that these authors have been examining the “global” burden of disease. There are many factors in common across countries, but many that diverge.
For example, the association between beef intake and adverse health outcomes (to say nothing of calamitous impacts on planetary health) in the many countries around the world where meat consumption is high (ours- the U.S.- is notoriously near the top of that list) - is perfectly clear and robustly evidence-based. But in countries prone to famine and protein insufficiency, individuals affluent or resourceful enough to acquire meat would, of course, fare better than their counterparts. Mix these two populations together and you have the potential for gibberish- like studying the benefits of swimming as exercise in a mixed group of those who do, and don’t, know how to swim. The former would benefit, the latter would drown- and you might lump those together (i.e., good half the time, bad half the time) and arrive at “0” net effect.
In this case, the authors handled such “heterogeneity” as if it were merely a statistical factor, and on that basis, whenever they found it- it reduced the star rating. But if “A” improves health in population “1” because that population is deficient in “A,” while impairing it in population “2” that is awash in “A,” that is not statistical noise; it is the ineluctable importance of context. That is probably obvious to everyone other than a statistician. Fortunately, statisticians have means to deal with it, too, by accounting for what is known as “effect modification;” but they must first choose that option.
There is another potential source of statistical “noise” in the “burden of proof” methods that is readily explained. Consider that for most of us - parents, grandparents, teachers, and so on- the question “is it dangerous for kids to run with scissors?” – in addition to having a rather obvious answer - is a pertinent, reasonable, and complete query. But the reductionisms of meta-analysis would invite the following concerns: running where, and in what manner?; “with” scissors how, exactly? (i.e., in the left hand, in the right hand, in a pocket, etc?); what brand of scissors?; what specific injury?
Leaving aside the apparent fact that there is no science whatsoever addressing the presumed dangers of running with scissors, the “burden of proof” would view running with sewing shears in the left hand as distinct from running with craft scissors in the right, to say nothing of injury to left leg or right arm. Sources of heterogeneity, all, such variation would bias whatever “proof” we could assemble toward the null. The burden of proof entirely unmet, we are left to wonder: was running with scissors ever really a bad idea?
Let’s leave the statistics, and scissors, there, and move on.
One aspect of nutrition is of unique and particular importance. Eating more of X as a proportion of some total means eating less of Y. Separate analysis of meat intake, vegetable intake, and so on is apt to underestimate the effects of overall dietary pattern, and quality. Eating more meat tends to mean eating less legumes, whole grains, vegetables, and so on; eating more legumes generally means less meat. The health effects of what is eaten, and what it is displacing, are conjoined- but the methods applied here overlook, and distort, that reality.
In general, there are several factors that inform our reliable understanding. One is the recognition of relevant mechanisms. Another is consistency of apparent effect. A third is the implication of counterfactuals.
Counterfactuals are so fundamental to how humans “know” that they form the very bedrock of work in artificial intelligence. In simplest form, they go like this: there was some exposure to “A,” and then “B” happened; would “B” have happened had there not been exposure to “A”? A macabre example might be: a young man bleeds out on the street, or in the emergency department following a gunshot wound to the chest. But for the bullet hole, would there have been a hemorrhage? A “no” to this question- based on relevant mechanism and the consistency of such associations- is clearly all the burden of proof we need. It is not the job of science to devise criteria to invalidate what we actually do know very reliably.
This has run long, even though I’ve left out many of my musings; let’s draw it to a close. Do you, or do you not, know that looking both ways before crossing a busy street is a good idea? What is the “burden of proof” required to justify your conviction? In so homely an example, we see readily that sometimes the burden of proof is rather light, because no sophisticated methods of science are required to confirm what is self-evident by mere observation. So it goes for crossing busy streets, running with scissors, or putting our hands into a flame.
I am not suggesting that important truths about nutrition, or tobacco, or physical activity are as self-evident as these; but I am arguing that a universal “burden of proof” metric by means of some preferred study methods and grading system as the one, great arbiter of what we know- is intrinsically flawed.
By the bright light of day, we humans know a great deal about what matters most to our daily routines predicated on simple sense and the power of observation. We know a great deal more courtesy of science and its capacity to extend us beyond our unaided senses. We know as well that context is not the mere statistical noise of heterogeneity; context genuinely matters. We know that across these expanses of sense and science, no single, standard burden of proof prevails, nor possibly could.
In comparison to all this shared illumination, the light of several stars, while perhaps lovely, is unlikely to help us see much.
David L. Katz, MD, MPH, FACPM, FACP, FACLM, is the Founding Director (1998) of Yale University’s Yale-Griffin Prevention Research Center, and former President of the American College of Lifestyle Medicine. He has published roughly 200 scientific articles and textbook chapters, and 15 books to date, including multiple editions of leading textbooks in both preventive medicine, and nutrition. He has made important contributions in the areas of lifestyle interventions for health promotion; nutrient profiling; behavior modification; holistic care; and evidence-based medicine. David earned his BA degree from Dartmouth College (1984); his MD from the Albert Einstein College of Medicine (1988); and his MPH from the Yale University School of Public Health (1993). He completed sequential residency training in Internal Medicine, and Preventive Medicine/Public Health. He is a two-time diplomate of the American Board of Internal Medicine, and a board-certified specialist in Preventive Medicine/Public Health. He has received two Honorary Doctorates.