Cover Story: The Fading Bright Line of p < 0.05
The Debate Over Statistical Significance
By Debra L. Beck

CardioSource WorldNews | The innocent p value has been abused to within an inch of its life. It’s been misused, misinterpreted, miscommunicated, and misunderstood. But missing it’s not, as researchers, journal editors, journalists, and readers become ever more dependent on using a simple p value to make snap judgments of good or bad. Yet, using the p-value, as it has been in traditional research, fundamentally miscasts the metric and undermines the validity and reproducibility of the science it’s trying to prove.

In an effort to shed light on the plight of the poor p value and the damage its misuse appears to be doing to the scientific endeavor, the American Statistical Association (ASA), for the first time in its 177-year history, released a position statement on a specific matter of statistical practice.1

“We are really looking to have a conversation […] with lots of people about how to take what we know about what is good using p values and what is not,” said ASA executive director Ronald L. Wasserstein, PhD, in an interview with CardioSource WorldNews. What they want, he added, is to “more effectively pass that along to scientists everywhere so that we can do a better job of making inferences using statistics.”

After months of debate among a group of experts representing a wide variety of viewpoints, the ASA released their statement in March 2016, conceding that the hard-fought ‘agreement’ among the experts breaks no new ground but rather frames some of the issues that have been debated for years. The statement is comprised of six guiding principles with accompanying explanations that seek to bring clarity to the issue. (See Six P-rinciples sidebar.)

According to the ASA, “The issues touched on here affect not only research, but research funding, journal practices, career advancement, scientific education, public policy, journalism, and law.” The question (and we don’t mean hypothesis-generating): how can the system be significantly changed?

p < 0.05: Not a Yes, Rather a Firm “Maybe”

“P values are a tremendously valuable tool,” said Stuart Pocock, MSc, PhD, from the London School of Hygiene and Tropical Medicine, in an interview with CSWN. “There are those who think we should get rid of them and one journal actually banned p values for a while, which is a bit over the top. The change needs to be around how we evaluate p values.”

Dr. Pocock is a leading name in statistics in cardiology and has collaborated on numerous important trials. In late 2015, he authored a four-part practical guide to the essentials of statistical analysis and reporting of randomized clinical trials (RCTs) in the Journal of the American College of Cardiology,2-5 which included extensive discussion of statistical controversies in RCT reporting and interpretation.

As the masses of people who read scientific papers understand statistics less than statisticians themselves (and, let’s face it, most readers are way too busy and distracted to dig deeply into the science of statistical probability), a clear trend has been to reduce all data analysis down to whether the p was significant. This kind of “bright line” (such as p < 0.05) conclusion making, said the ASA, can lead to poor choices. “A conclusion does not immediately become ‘true’ on one side of the divide and ‘false’ on the other.”

Rather, “there is a difference between there being a bright line at which at some point you have to make a decision and there being a bright line about how much you are learning from a particular piece of evidence,” said Dr. Wasserstein.

Ironically, the p value was never meant to determine good from bad or true from false. When first introduced in the 1920s by the British statistician Ronald Fisher in his book Statistical Methods for Research Workers, the idea was simply to use the test to determine whether the data were worthy of further analysis and not a product of randomness. (It probably didn’t help his argument when his effort was dubbed Fisher’s exact test; but, face it, Fisher’s worthiness test doesn’t have quite the same ring of authority.)

According to Fisher, the p value is “the probability of the observed result, plus more extreme results, if the null hypothesis were true.” There’s that word “probability” again.

“Fisher offered the idea of p values as a means of protecting researchers from declaring truth based on patterns in noise,” wrote Andrew Gelman, PhD, and Eric Loken, PhD, in American Scientist.6 The title of their 2014 paper: The Statistical Crisis in Science. “In an ironic twist, p values are now often used to lend credence to noisy claims based on small samples.”

Dr. Gelman is the director of the Applied Statistics Center at Columbia University and a research associate professor at Pennsylvania State University.

Who Crowned 0.05 King Anyway?

Importantly, we need to keep in mind that the p value is not the probability of the null hypothesis being true. Nor is it correct to say that if the p value is greater than 0.05, then the null hypothesis is true. (Dr. Pocock calls this widely present misinterpretation “utter rubbish.”) Rather, a nonsignificant p value means that the data have provided little or no evidence that the null hypothesis is false. Certainly, when the p value is very close to the 0.05 standard cutoff, it’s easy to see that further analysis or data are needed to differentiate whether there is a null effect or just a small effect.

Nor is it necessarily true that p values provide a measure of the strength of the evidence against the null hypothesis. Some argue that while a p value of 0.05 does not provide strong evidence against the null hypothesis, but it is reasonable to say that a

p value < 0.001 does. However, others have cautioned that because p values are dependent on sample size, a p value of 0.001 should not be interpreted as providing more support for rejecting the null hypothesis than one of 0.05.

“Making decisions with such limited flexibility is usually neither realistic nor prudent,” wrote Journal of the American Medical Association senior editor Demetrios N. Kyriacou, MD, PhD, from Northwestern University Feinberg School of Medicine, Chicago, IL, in a recent editorial.7

“For example, it would be unreasonable to decide that a new cancer medication was ineffective because the calculated p value from a phase II trial was 0.51 and the predetermined level of statistical significance was considered to be less than 0.05.”

Added Dr. Wasserstein: “Part of our messages is that p = 0.049 is not qualitatively difference in any respect to p = 0.05 or 0.051.” Furthermore, what appears to be sometimes done when the data yield a p = 0.05 or 0.051 “is to play with your analysis a little bit until you get something a little safer.”

If we fix the p value issue, he noted, that tinkering wouldn’t be necessary. Say, the result garners a p = 0.08, but the effect size is “massive,” then, he said, “you still have something to talk about, you don’t have anything to excuse anymore, and you don’t have to try to talk your way out of p = 0.08.”

Get More Sophisticated

When considering p values, we need not be constricted by “rigid thinking” according to Dr. Kyriacou. Instead, consider the options. One such option is Bayesian Inference, which is particularly useful when, for example, the data show an association between a particular exposure and a specific health-related outcome. Utilizing this method, the investigator can infer the possibility of a causal relationship using the data in conjunction with data from prior similar studies.

Bayes factors have been proposed as more principled replacements for p values. Think of them as a mini deadlift test, measuring the strength of the relative evidence. Technically (and pardon the jargon), they are a weighted average likelihood ratio, representing the weight of evidence for competing hypotheses. Bayes factors represent the degree to which the data shift the relative odds between two hypotheses.

According to Dr. Kyriacou, calculating Bayes factors requires more computational steps than the p value and the technique is not as widely known. Additionally, using Bayesian inference requires that prior information is known and precisely quantified, and Bayes factors have been criticized as being biased towards the null hypothesis in small samples.

Regardless of the technique(s) used, experts tend to agree that automatic dichotomized hypothesis testing using a prearranged level of significance (i.e., p < 0.05) needs to be supplemented (not supplanted) by more sophisticated methods, which might include effect sizes and a range of other tests; plus, some straightforward scientific reasoning and judgment, even if that injects a certain subjectivity. Or, as Stone and Pocock put it: “A p value is no substitute for a brain.”8 They used that statement as a reminder that interpretation of a seemingly “positive” trial rests on more than just a significant p value.

Cherry-picking p Values

Ever notice how the abstract for a ‘positive’ journal article often doesn’t show any non-significant results? Some may call it keeping an abstract to the journal’s specified length, but the reality is it’s cheating.

John P. A. Ioannidis, MD, DSc, has been decrying bad statistics in biomedical research for a long time now, starting with a 2005 article refreshingly titled: “Why most published research findings are false.”9 Dr. Ioannidis is co-director of the Meta-Research Innovation Center at Stanford (METRICS) and holds the C.F. Rehnborg Chair in Disease Prevention at Stanford University.

In a recent JAMA article, Dr. Ioannidis and colleagues studied how p values are reported in abstracts and full text of biomedical research articles over the past 25 years.10 They used automated text mining to identify more than 4.5 million p values in 1.6 million Medline abstracts and some 3.4 million p values from more than 385,000 full-text articles. In addition, the researchers manually assessed reporting of p values in 1,000 sample abstracts (analyzing only the 796 that reported empirical data) and 100 full-text articles.

The resulting abstract proved to be the most technical and dense opening this writer has seen in a long time, but the bottom line was clear: they found a strong “selection bias” towards significant p values in the abstracts versus the text of the study. They also found in abstracts that p values of 0.001 or less were “far more commonly reported than values of 0.05.

Abstracts “appear to provide a somewhat distorted picture of the evidence,” wrote the authors, particularly as “many readers focus primarily on the abstracts.” The tendency to cherry pick lower p values was particularly evident in meta-analyses and reviews, “segments of the literature [that] are influential in clinical medicine and practice,” and in core medical journals that also carry extra influence.

Dr. Pocock suggested that this practice is less of a problem in cardiology than in other fields, particularly in the major journals. “I hope they would never let you get away with it,” he said. “At the same time, we don’t want to deny what is often called ‘exploratory data analysis.’ We want to look at new ideas, we want to look at secondary endpoints, and at subgroups and get ideas for future research, but if you do p values at that exploratory realm, they are more used as descriptive feelers to see if something is worth taking seriously as opposed to leading to direct conclusions.”

Furthermore, the average p values reported overall are getting lower (more significant). This, Chavalarias et al. acknowledge, may be a result of big data offering larger sample sizes. Maybe, but one statistician we admire finds it extremely unlikely that big data will drive p-values down en masse. More likely: the fact that p values are getting lower, which Chavalarias and colleagues say “may reflect a combination of increasing pressure to deliver (ever more) significant results in the competitive publish-or-perish scientific environment as well as the recent conduct of more studies that test a very large number of hypotheses and thus can reach lower p values simply by chance.”10

They concluded their study by saying that the p value < 0.05 has “lost its discriminating ability for separating false from true hypotheses; more stringent p value thresholds are probably warranted across scientific fields.”

What the authors do not suggest is that the p value be abandoned, but rather that they not be reported in isolation: “Articles should include effect sizes and uncertainty metrics.”

P-hacking

While you would be hard pressed to find much statistical slang in Urban Dictionary, the term “p-hacking” was added by someone calling him/herself PProf in Jan. 2012.

The definition: “Exploiting—perhaps unconsciously—researcher degrees of freedom until p < 0.05.” The examples clarify what they mean: “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05,” or “She is a p-hacker, she always monitors data while it is being collected.” (One other statistical term found in this authoritative source is “statsporn: An arbitrarily detailed statistical breakdown of information which provides no greater understanding but fills reports or especially weak assignments designed primarily to give the reader something to look at.” As in, “This report is not great—is there no statsporn we can fill it with?”) Some investigators have suggested that when reported p values cluster around the 0.041 to 0.049 range, p-hacking may be to blame.

Also called data dredging, data fishing, data snooping, and equation fitting, p-hacking appears to be relatively widespread11 and may be especially likely in today’s world where big data offer tantalizing opportunities to draw conclusions, where in fact none may lie. Remember: looking for patterns in data is legitimate. Applying hypothesis testing to that same data from which the pattern was detected is data dredging.

Dr. Pocock noted that a good amount of cardiology research is observational in nature, which does lend itself to less strict analysis. “Because you probably didn’t have a clearly defined a priori research hypothesis in the first place, observational research and epidemiological research tend to be much looser on these things.”

It’s important to understand that p-hacking isn’t just one thing. There are several ways to manipulate data. Leif D. Nelson, PhD, from the Haas School of Business at the University of California, Berkeley, CA, gave a talk on the topic in 2014. His slides were posted on Twitter by the UC Berkeley Initiative for Transparency in the Social Sciences (@UCBITSS) and listed easy ways to p-hack (see list).

Six Ways to p-Hack

  1. Stop collecting data once p < 0.05.
  2. Analyze many measures, but report only those with p < 0.05.
  3. Collect and analyze many conditions, but only report those with p < 0.05.
  4. Use covariates to get p < 0.05.
  5. Exclude participants to get p < 0.05.
  6. Transform the data to get p < 0.05.

Dr. Nelson, who holds an endowed professorship in Business Administration and Marketing, suggested that while we accept the threshold of p < 0.05—in other words a 5% false-positive rate—if we allow p-hacking, then you can calculate the effect and the false-positive rate is actually 61%. That makes p-hacking, he said, a “potential catastrophe to scientific inference” that, in part at least, can be solved through complete and transparent reporting.

Dr. Nelson was coauthor of a paper that introduced the “p curve,” defined as the distribution of statistically significant p values for a set of studies.12 As they explained, “Because only true effects are expected to generate right-skewed p curves—containing more low (0.01s) than high (0.04s) significant p values—only right-skewed p curves are diagnostic of evidential value.”

Who’s to Blame for This p-Fiasco?

In a blog post commenting on the ASA statement, Dr. Gelman places much of the blame for this problem on statistical education (generously including his own courses and textbook), which reduces statistics to a mathematical and computation science, rather than combining computation with a thought process that emphasizes valid and reliable measurement. With less time for thinking today, it is probably not surprising that a big white line separating ‘yes’ from ‘no’ has gained prominence in reporting study results.

Blame also belongs to the trend to sell statistics “as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that “begins with data and concludes with success as measured by statistical significance,” as Dr. Gelman puts it. The culprits here are everyone: educators, journals, granting agencies, researchers, and journalists. “Just try publishing a result with p = 0.20,” he noted.

“If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted,” wrote Dr. Gelman.

For his part, Dr. Pocock suggested that the problem started early on, not so much with Dr. Fisher, but with his contemporaries and rivals, Polish mathematician Jerzy Neyman and UK statistician Egon Pearson, who introduced the concept of an “accept-reject philosophy of p values.”

The journals and regulators also share blame, said Dr. Pocock. The journals because they perpetuate this “collective certification” and the U.S. Food and Drug Administration (FDA), not because they strictly apply a p < 0.05 rule to make their decisions. “The FDA is smarter than that in general,” Dr. Pocock said. Rather, he thinks the issue is that no one has disabused researchers of the “myth that to get a drug approved you must prove p < 0.05.” This is starting to change, however, according to Dr. Pocock, as journals become more sensitive to the issue.

Dr. Wasserstein noted that hypothesis testing and inference is a tricky concept. His hope is simply that the most recent brouhaha will inspire people to reject the use of a single index as a “substitute for scientific reasoning” and “see how we can make the science better as we go into the post p < 0.05 era.”

How Do We p-Fix This?

The trend toward simplistically using and abusing p values is alluring to the many individuals in the field of medicine who lack deep understanding of statistics. Many physicians, researchers, journal editors, and science writers possess a limited understanding of statistical sciences, and likely aren’t all that interested in learning more. After all, who really understands confidence, credibility, or prediction intervals (alternative approaches to analysis suggested by the ASA statement)? How are you on Bayesian methods? (See sidebar interview with statistician Jan Tijssen, PhD, for his comments on how to improve data analysis.)

“It is a mass re-education that is required to get p values seen as strength of evidence with no magic cutoff,” said Dr. Pocock.

To be fair, even the ASA acknowledges that the practice of statistics is not a straightforward endeavor, with clearly defined rules of engagement. The ASA team noted in their statement that finding points of agreement among the two dozen experts gathered “turned out to be relatively easy to do, but it was just as easy to find points of intense disagreement.” The final statement, which was har-won and long-deliberated, is not riveting reading (have you ever read a riveting consensus statement?) but it came with the caveat that it “does not necessarily reflect the viewpoint of all [the listed parties], and in fact some have views that are in opposition to all or part of the statement.”

Perhaps architect Daniel Libeskind summed it up best when he said, “Life is not just a series of calculations and a sum total of statistics, it’s about experience, it’s about participation, it is something more complex and more interesting than what is obvious.”

References:

  1. Wasserstein RL, Lazar NA. Am Stat. 2016 [Epub before print].
  2. Pocock SJ, McMurray JV, Collier TJ. J Am Coll Cardiol. 2015;66:2536-49. http://content.onlinejacc.org/article.aspx?articleID=2473760
  3. Pocock SJ, McMurray JV, Collier TJ. J Am Coll Cardiol. 2015;66:2648-62. http://content.onlinejacc.org/article.aspx?articleID=2474634
  4. Pocock SJ, Clayton TC, Stone GW. J Am Coll Cardiol. 2015;66:2757-66. http://content.onlinejacc.org/article.aspx?articleID=2476071
  5. Pocock SJ, Clayton TC, Stone GW. J Am Coll Cardiol. 2015;66:2886-98. http://content.onlinejacc.org/article.aspx?articleID=2476636
  6. Gelman A, Loken E. American Scientist. 2014; 102;460.
  7. Kyriacou DN. The Enduring Evolution of the P Value. JAMA. 2016;315:1113-5.
  8. Stone GW, Pocock SJ. J Am Coll Cardiol. 2010;55:428-31. http://content.onlinejacc.org/article.aspx?articleID=1140401
  9. Ioannidis JP. PLoS Med. 2005;2:e124.
  10. Chavalarias D, Wallach JD, Li AH, Ioannidis JP. JAMA. 2016;315:1141-8.
  11. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. PLoS Biol. 2015;13(3):e1002106.
  12. Simonsohn U, Nelson LD, Simmons JP. J Exp Psychol Gen. 2014;143:534-47.
Read the full May issue of CardioSource WorldNews at ACC.org/CSWN

Keywords: CardioSource WorldNews, Statistics, Research


< Back to Listings