A paper and associated web-based tool have been proposed to detect publication bias and p-hacking in the literature. From the paper:
The practices of p-hacking and file-drawering mean that a statistically significant finding may reflect selective reporting rather than a true effect. In this paper, we introduce p-curve as a way to distinguish between selective reporting and truth. P-curve is the distribution of statistically significant p-values for a set of independent findings. Its shape is diagnostic of the evidential value of that set of findings. We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole explanation of those findings. As detailed below, only right-skewed p-curves, those with more low (e.g., .01s) than high (e.g., .04s) significant p-values, are diagnostic of evidential value. P-curves that are not right-skewed suggest that the set of findings lacks evidential value, and p-curves that are left-skewed suggest the presence of intense p-hacking.
The basic intuition is this: if a studied effect does not actually exist (that is, if the null hypothesis is true) then the distribution of p-values found should be uniform. In a uniform distribution any value is equally likely. When there is no effect, any p-value is equally likely. While publication bias ensures that p-values greater than .05 are unlikely to be reported, we can still look at the distribution of values less than .05, and that distribution, when no effect exists, should be uniform. When a studied effect *does* exist, we should see more low p-values - that is, the distribution should be right-skewed.
Only uniform and right-skewed distributions should occur naturally. A left-skewed distribution therefore indicates p-hacking. P-hacking is a term which describes a wide range of unethical yet sadly common practices used by researchers to achieve significance so that they can get published. Because p-hacking typically stops once researchers reach the magic .05 threshold, the distribution of results between .00 and .05 will be left-skewed, with more results closer to .05.
The authors demonstrate p-curve analysis by collecting p values from two sets of twenty studies. One set was chosen by looking for signs of p-hacking - in this case, by selecting studies which reported their analysis with covariates. The authors explain, "We were suspicious of experiments reporting an effect only with a covariate because we suspect that many researchers make the decision to include a covariate only when and if the simpler analysis without such a covariate, the one they conduct first, is nonsignificant." The second set of studies was chosen by searching for articles lacking words typically associated with p-hacking such as “excluded”, “covariate”, and “transform”.
You can see their results below.
The "suspected p-hacking" data set is left-skewed (that is, it has more data points on the right -- I know, I find skewness labels confusing too). The other dataset, thought to be clean, is right-skewed.
You might have noticed the teal dotted line labelled "null of 33% power". That is the distribution you'd expect to see if the effect exists but the studies testing it are very underpowered. (The consensus is that 80% power is ideal, but see this recent write-up about chronic underpowering of studies.) The clean dataset matches the "null of 33% power" distribution.
The authors detail the necessary steps to perform a p-curve analysis. When selecting studies, they instruct:
1) Create a rule. Rather than decide on a case-by-case basis whether a study should be included, one should minimize subjectivity by deciding on an inclusion rule in advance...
2) Disclose the selection rule...
3) ... When the implementation of the rule generated ambiguity as to whether a given study should be included or not, results with and without those studies should be reported.
4) ... replicate [single-paper p-curves]. Given the risk of cherry-picking analyses that are based on single papers – for example, a researcher may decide to p-curve a paper precisely because s/he has already observed that it has many significant p-values greater than .025 – we recommend that such analyses be accompanied by a properly powered direct replication of at least one of the studies in the paper.
The also provide guidelines for selecting p-values from selected studies, to be documented in a six-column "disclosure table":
Step 1. Identify researchers’ stated hypothesis and study design (Columns 1 and 2)...
Step 2. Identify the statistical result testing the stated hypothesis (Column 3)...
Step 3. Report the statistical result(s) of interest (Column 4)...
Step 4. Recompute precise p-values based on reported test statistics (Column 5)... Recomputation is necessary because p-values are often reported merely as smaller than a particular benchmark (e.g., p<.01) and because they are sometimes reported incorrectly...
Step 5. Report robustness p-values (Column 6). Some experiments report results on two or more correlated dependent variables (e.g., how much people like a product and how much they are willing to pay for it)... P-curvers should not simultaneously include all of these p-values because they must be statistically independent for inference from p-curve to be valid. Instead, p-curvers should use selection rules and report robustness of the results to such rules.
The authors also provide documentation for how to use their web-based tool which generates p-curves when given p-values.