Scientists call for more replicability and better publishing criteria to solve the p-value debate

By Myriam Vidal Valero. Mentored and edited by Christina Couch.

In the Spring of 2019, the American Statistical Association published an editorial that pushed for scientists and policymakers to stop using the term “statistically significant,” asserting that both the term and the widespread practice of using p-value to determine if the results of an experiment are consequential “has today become meaningless.” A year later, an ASA task force clarified that the article wasn’t an official policy, and that the use of p-values and significance testing shouldn’t be abandoned.

At the American Association for the Advancement of Science annual meeting last month, a panel of statisticians resurrected the debate, underscoring the importance of p-values and significance testing. Some asserted that scientists need to create better ways of explaining their methodologies and emphasized the need for replicability studies, which, they say, are disincentivized in the publishing world.

Journals use the p-value, which measures the probability that an experiment’s outcome didn't happen by chance, as a metric for research publication. Scientific results are often only publishable if p-values are less than 5% (0.05) or less than 1% (0.01) depending on the journal.

However, critics argue, the p-value can change depending on the size of the sample. Using it as a measure of success can lead to false positive and false negative results, in some instances misleading people about the value of scientific findings. It also can make experiments difficult to replicate when papers only report the p-values lower than the .05 threshold.

AAAS panelists Donald Macnaughton, statistician and president of MatStat Research Consulting Inc.; Karen Kafadar, a statistician at the University of Virginia; and Yoav Benjamini, a statistician at the Tel Aviv University, dove deep into this controversy.

Macnaughton expressed the necessity of using statistical significance in science, stating that

people who ditch p-values might be unaware of the usefulness of threshold values in balancing the rates of false positive and false negative results.

If used correctly, p-values can determine if there is a real significant difference in an experiment’s results or if there just appears to be one.

“These people may also be unaware that scientific journal readers are generally not interested in reading about negative results,” Macnaughton said.

The debate of using the p-value to determine significant findings is further complicated by the publish-or-perish culture scientists exist in, which heavily incentivizes producing novel studies rather than replicated ones, said Penny S. Reynolds, a researcher in statistical experimental design at Virginia Commonwealth University who was not on the AAAS panel. “You've got to have publications,” she said, adding that publications (or lack thereof) can advance or stagnate a scientist’s career. Experimental results are “difficult to publish unless it's really new, it's novel,” she said. Reynolds emphasized that scientists need to have better training in statistics and how to use the p-value. Throughout her career, she has found a lot of “poorly reported” studies.

Macnaughton said that despite the limitations of p-values, journals need statistical significance as a publishing parameter. Without it, a lot of papers reporting weak evidence and false positive results would be published, worsening the replication crisis.

“It would lead to open season, the Wild West, people not thinking hard about specifying hypotheses or declaring every little finding as interesting,” Kafadar added.

Kafadar and Benjamini also recommended that, instead of ditching p-values entirely, journals could set their own p-value parameters, which might vary depending on the discipline. It’s unclear whether this arrangement would be feasible or beneficial. Macnaughton argued that having a paper-by-paper p-value analysis would be a time-consuming and biased practice. But, in Kafadar’s opinion, there are multiple ways of doing things and that idea should be further investigated.

Science shouldn’t give up on the p-value, the three panelists asserted. Still, “papers should be considered for publication with all their merits, not simply because of a threshold like p less than .05 may or may not have been met,” said Kafadar.

Myriam Vidal Valero is a freelance science journalist from Mexico City. Her primary interests are health, the environment, and scientific policy, although she loves to report about any scientific story that feeds her curiosity and helps social justice She has written for the New York Times, Science, Slate and The Open Notebook, among others. She also received the 2019 Rosalynn Carter Fellowship for Mental Health Journalism and is a member of the NASW, the Mexican Network of Science Journalists, and the National Association of Hispanic Journalists. She’s currently pursuing a M.A. at the Craig Newmark Graduate School of Journalism at CUNY. Follow her on Twitter @myriam_vidalv or email her at myriam.vidalvalero@gmail.com.