The idea of "statistical significance" has been a basic concept in introductory statistics courses for decades. If you spend any time looking at quantitative research, you will often see in tables of results that certain numbers are marked with an asterisk or some other symbol to show that they are "statistically significant."
For the uninitiated, "statistical significance" is a way of summarizing whether a certain statistical result is likely to have happened by chance, or not. For example, if I flip a coin 10 times and get six heads and four tails, this could easily happen by chance even with a fair and evenly balanced coin. But if I flip a coin 10 times and get 10 heads, this is extremely unlikely to happen by chance. Or if I flip a coin 10,000 times, with a result of 6,000 heads and 4,000 tails (essentially, repeating the 10-flip coin experiment 1,000 times), I can be quite confident that the coin is not a fair one. A common rule of thumb has been that if the probability of an outcome occurring by chance is 5% or less--in the jargon, has a p-value of 5% or less--then the result is statistically significant. However, it's also pretty common to see studies that report a range of other p-values like 1% or 10%.
Given the omnipresence of "statistical significance" in pedagogy and the research literature, it was interesting last year when the American Statistical Association made an official statement "ASA Statement on Statistical Significance and P-Values" (discussed here) which includes comments like: "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. ... A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. ... By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis."
Now, the ASA has followed up with a special supplemental issue of its journal The American Statistician on the theme "Statistical Inference in the 21st Century: A World Beyond p < 0.05" (January 2019).The issue has a useful overview essay, "Moving to a World Beyond “p < 0.05.” by Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar. They write:
We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way. Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. ... In sum, `statistically significant'—don’t say it and don’t use it.
The special issue is then packed with 43 essays from a wide array of experts and fields on the general theme of "if we eliminate the language of statistical significance, what comes next?"
To understand the arguments here, it's perhaps useful to have a brief and partial review of some main reasons why the emphasis on "statistical significance" can be so misleading: namely, it can lead one to dismiss useful and true connections; it can lead one to draw false implications; and it can cause researchers to play around with their results. A few words on each of these.
The question of whether a result is "statistically significant" is related to the size of the sample. As noted above, 6 out of 10 heads can easily happen by chance, but 6,000 out of 10,000 heads is extraordinarily unlikely to happen by chance. So say that you do an study which finds an effect which is fairly large in size, but where the sample size isn't large enough for it to be statistically significant by a standard test. In practical terms, it would be foolish to ignore to ignore this large result; instead, you should presumably start trying to find ways to run the test with a much larger sample size. But in academic terms, the study you just did may be unpublishable: after all, a lot of journals will tend to decide against publishing a study with negative results--a study that doesn't that doesn't fine a statistically significant effect
Knowing that journals are looking to publish "statistically significant" results, researchers will be tempted to look for ways to jigger their results. Studies in economics, for example, aren't about simple probability examples like flipping coins. Instead, one might be looking at Census data on households that can be divided up in roughly a jillion ways: not just the basic categories like age, income, wealth, education, health, occupation, ethnicity, geography, urban/rural, during recession or not, and others, but also various interactions of these factors looking at two or three or more at a time. Then, researchers make choices about whether to assume that connections between these variables should be thought of a linear relationship, curved relationships (curving up or down), relationships are are U-shaped or inverted-U, and others. Now add in all the different time periods and events and places and before-and-after legislation that can be considered. For this fairly basic data, one is quickly looking at thousands or tens of thousands of possible connections relationships.
Remember that the idea of statistical significance relates to whether something has a 5% probability or less of happening by chance. To put that another way, it's whether something would have happened only one time out of 20 by chance. So if a researcher takes the same basic data and looks at thousands of possible equations, there will be dozens of equations that look like they had a 5% probability of not happening by chance. When there are thousands of researchers acting in this way, there will be a steady stream of hundreds of result every month that appear to be "statistically significant," but are just a result of the general situation that if you try enough
A classic statement of this issue arises in Edward Leamer's 1983 article, "Taking the Con out of Econometrics" (American Economic Review, March 1983, pp. 31-43). Leamer wrote:
The econometric art as it is practiced at the computer terminal involves fitting many, perhaps thousands, of statistical models. One or several that the researcher finds pleasing are selected for re- porting purposes. This searching for a model is often well intentioned, but there can be no doubt that such a specification search in-validates the traditional theories of inference. ... [I]n fact, all the concepts of traditional theory, utterly lose their meaning by the time an applied researcher pulls from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose. The consuming public is hardly fooled by this chicanery. The econometrician's shabby art is humorously and disparagingly labelled "data mining," "fishing," "grubbing," "number crunching." A joke evokes the Inquisition: "If you torture the data long enough, Nature will confess" ... This is a sad and decidedly unscientific state of affairs we find ourselves in. Hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly anyone takes anyone else's data analyses seriously."
Economists and other social scientists have become much more aware of these issues over the decades, but Leamer was still writing in 2010 ("Tantalus on the Road to Asymptopia," Journal of Economic Perspectives, 24: 2, pp. 31-46):
Since I wrote my “con in econometrics” challenge much progress has been made in economic theory and in econometric theory and in experimental design, but there has been little progress technically or procedurally on this subject of sensitivity analyses in econometrics. Most authors still support their conclusions with the results implied by several models, and they leave the rest of us wondering how hard they had to work to find their favorite outcomes ... It’s like a court of law in which we hear only the experts on the plaintiff’s side, but are wise enough to know that there are abundant for the defense.
Taken together, these issues suggest that a lot of the findings in social science research shouldn't be believed with too much firmness. The results might be true. They might be a result of a researcher pulling out "from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose." And given the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is "significant," while if the result had a 5.2% probability of happening by chance it is "not significant." Uncertainty is a continuum, not a black-and-white difference.
So let's accept the that the "statistical significance" label has some severe problems, as Wasserstein, Schirm, and Lazar write:
Timothy Taylor is an American economist. He is managing editor of the Journal of Economic Perspectives, a quarterly academic journal produced at Macalester College and published by the American Economic Association. Taylor received his Bachelor of Arts degree from Haverford College and a master's degree in economics from Stanford University. At Stanford, he was winner of the award for excellent teaching in a large class (more than 30 students) given by the Associated Students of Stanford University. At Minnesota, he was named a Distinguished Lecturer by the Department of Economics and voted Teacher of the Year by the master's degree students at the Hubert H. Humphrey Institute of Public Affairs. Taylor has been a guest speaker for groups of teachers of high school economics, visiting diplomats from eastern Europe, talk-radio shows, and community groups. From 1989 to 1997, Professor Taylor wrote an economics opinion column for the San Jose Mercury-News. He has published multiple lectures on economics through The Teaching Company. With Rudolph Penner and Isabel Sawhill, he is co-author of Updating America's Social Contract (2000), whose first chapter provided an early radical centrist perspective, "An Agenda for the Radical Middle". Taylor is also the author of The Instant Economist: Everything You Need to Know About How the Economy Works, published by the Penguin Group in 2012. The fourth edition of Taylor's Principles of Economics textbook was published by Textbook Media in 2017.