metascience @meta-science - Tumblr Blog

These problems are not new

From a paper published in 1945 by Vannevar Bush, head of the U.S. Office of Scientific Research and Development (OSRD) during World War II:

Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose. If the aggregate time spent in writing scholarly works and in reading them could be evaluated, the ratio between these amounts of time might well be startling. Those who conscientiously attempt to keep abreast of current thought, even in restricted fields, by close and continuous reading might well shy away from an examination calculated to show how much of the previous month's efforts could be produced on call. Mendel's concept of the laws of genetics was lost to the world for a generation because his publication did not reach the few who were capable of grasping and extending it; and this sort of catastrophe is undoubtedly being repeated all about us, as truly significant attainments become lost in the mass of the inconsequential.

The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.

Sadly, the rest of this rather long article details potential fixes, many of which have been achieved and none of which have eliminated the above problem.

#author: Bush #source: The Atlantic #meta-analysis #organization of knowledge

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

University of California opens one door and closes another

The recent announcement that the University of California - the world's largest public university system - will be embracing open access has been meet with celebration. However, the new policy only encourages faculty to share their work - there are no requirements. Meanwhile, the UC Regents have voted to create a new corporate entity to manage research funding:

The purpose of Newco is to completely revamp how scientific discoveries made in UC laboratories — from new treatments for cancer to apps for smartphones — come to be used by the public. Traditionally, UC campuses have used their own technology transfer offices to make these decisions. But under Newco, decisions about the fate of academic research will be taken away from university employees and faculty, and put in the hands of a powerful board of businesspeople who will be separate from the university. This nonprofit board will decide which UC inventions to patent and how to structure licensing deals with private industry. It also will have control over how to spend public funds on these activities.

...

But if last month's regents meeting in Sacramento is any indication, UC oversight of Newco may be less than robust. Several regents, in fact, objected to creating an oversight committee that would keep tabs on the new entity.

...

Records show that wealthy investors and influential businessmen with close ties to UCLA and one of the UC Regents — Alan C. Mendelson — are financially invested in companies that currently license university-owned patents under exclusive financial arrangements. Mendelson, who also is a trustee for the UC Berkeley Foundation and has investments of his own in businesses that profit from university-produced research, was one of the main backers of the Newco proposal and cast a vote in favor of it.

The Newco program also could benefit companies like the ones Mendelson and his network of friends and investors own and work for. Many of the UC Regents are also close friends of investors who want greater access to university inventions under more favorable terms, and who want the university to subsidize early-stage business expenses and take financial risks by investing in technology startups.

And under Newco, they may be able to get exactly what they want.

The above is an excerpt from a well-written, longform article by Darwin BondGraham - well worth your time.

#open access #money in science #author: BondGraham #source: East Bay Express #conflict of interest

Comprehensive review of US's clinical trials system needed

An editorial in Nature by Arthur Ammann points out a number of problems with our current clinical trial system:

Inadequate expertise: the composition of IRBs has not kept pace with the complexities of ethics and science. Expert opinions are often derived from individuals who lack sufficient expertise to make an informed decision.

Conflicts of interest: individual IRB members may gain salary, health and retirement benefits from approval of research studies conducted at their institutions, which may also make gains.

Exclusivity issues: the design and ethical review of federally funded research is often undertaken by a homogeneous group of individuals with congruent interests at the same or similar academic institutions. Individuals from the public, advocacy groups and non-academic organizations are often excluded...

Marked increases in funding: the NIH budget for research in 2011 was more than US$30 billion. Large amounts of money can distort priorities for research and shift the focus away from urgent public-health needs on the basis of the belief that all research products merit clinical evaluation. The number of products in the therapeutic pipeline is rising and there is noinformed method for prioritizing those which should move into clinical research. This increases the risk for people who participate in research.

Increased cost of clinical research and fewer treatment-naive individuals... in the United States: the number of research participants required to obtain statistically significant results for new products has increased drastically because of the need to compare these products with ones that are known to work. A ‘mining’ approach to obtaining treatment-naive people for research in poor countries has evolved, enlisting vulnerable populations... The shift to resource-poor countries is often accomplished by reducing standard of care, exaggerating potential benefits, the use of inferior treatment comparisons and the enrolment of vulnerable people not fully informed of their legal or ethical rights.

#source: Nature #author: Ammann #research ethics #ethnocentrism

Daisy-chained replications

This is a little old - I'm going through my bookmarks - but:

Nobel prize-winner Daniel Kahneman has issued a strongly worded call to one group of psychologists to restore the credibility of their field by creating a replication ring to check each others’ results.

...

To address this problem, Kahneman recommends that established social psychologists set up a “daisy chain” of replications. Each lab would try to repeat a priming effect demonstrated by its neighbour, supervised by someone from the replicated lab. Both parties would record every detail of the methods, commit beforehand to publish the results, and make all data openly available.

I've seen the daisy-chain methodology a few times in the medical research. I very much like the idea of a group of labs getting together and attempting to replicate, openly and rigorously, a specific finding, or to solve an open question. As I've mentioned previously, only one in every thousand psychology studies has a published replication. We're a long way from fixing this problem across all of psychology, but within a specific sub-field a little daisy-chaining could go a long way.

You can see Kahneman's letter here.

#replications #openness #journal: Nature #author: Kahneman

Retractions as political statement

From Retraction Watch, a story about a retraction is really a story about Nature's failure to publish a refutation:

Knowing that Nature had an explicit editorial policy to publish, in some form, work which refutes an important conclusion of any paper which appears in its pages, we submitted our findings describing the transgenic mice and our failure to replicate the work from Bellgrau et al. to Nature. We received two very positive reviews, but based on a third, very negative one, from Bellgrau et al., the editors decided not to publish our findings as a letter or as correspondence.

Although the authors were able to publish their work in another journal, they still wanted Nature to acknowledge the new evidence. Eventually they decided to retract a 'News and Views' piece they'd published in Nature lauding the original finding by Bellgrau et al.

I added “I regret having to take this course, but as Naturerefuses to abide by its own ethical policy, namely to “publish refutations of any important conclusion that appears in its pages,” I am left with no other option.

Thankfully,Naturedid agree to publish the retraction, but, perhaps unsurprisingly, they were unhappy with the wording. The retraction included just two sentences.

...

The retraction was published in 1998, and has attracted 16 citations of its own. However, of the 976 citations of the Bellgrau et al. paper, about 700 were subsequent to publication of the retraction, so it’s clear many remain unaware that its findings are questionable. Clearly, the processes that allow the scientific record to self-correct can be improved, not least by Nature.

I'm not sure how it's a good idea to let the original authors (in this case Bellgrau et al) be veto-holding members of the review process.

It would be interesting if journals were required to also publish all attempted replications of original work first published in their pages. Given the editorial burden that's not really fair to the journals, but it would be great for the robustness and quality of the literature at large.

#retractions #journal: Nature #author: Vaux #editorial policy #publication bias

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Pressure to publish and the prevalence of positive results

This PLoS One study attempted to look at how pressure to publish might influence the prevalence of positive results. The author, Daniele Fanelli, made an odd choice by using 'papers per capita by state' as the measure of pressure to publish. The state seems way too macro a level to look at. I would expect pressure to vary strongly between schools and between departments within schools. There's no reasoning for this given in the paper - I suspect that it was just simpler to use this measure, already provided by the NSF, then to come up with a method for looking at individual institutions.

The author reports a significant correlation of the percentage of papers that reported positive results with the per capita academic productivity of the state the author was from (b = 1.383±0.682, Wald test = 4.108, df = 1, p = 0.043, Odds-Ratio (95%CI) = 3.988(1.047–15.193)). I have very little formal statistical training, so take my opinions with a grain of salt:

That p-value is very modestly significant.

The confidence interval for the odds ratio is quite broad, stretching almost to 1 (or "no difference") on the lefthand side.

Why does the author report the slope coefficient (b) instead of the correlation coefficient (R)?

The slope or regression coefficient (b) represents the slope of the regression line. So I'd interpret the statistic as saying that for every "unit" of pressure increase, there's an increase of 1.383 "units" of positive publication bias increase. However it's not clear from the paper what the units are. Also, note that the standard error of b is almost half the size of b.

My final critique is of the dataset. The author analyzed 1316 papers randomly. The distribution was such that some states had a large number of papers to analyze (such as California, which had 150) and some had a very small number (such as Wyoming, with 1). One state, Delaware, was excluded from the analysis because no papers were found in the random sampling. You can see the full set here:

Wyoming and South Dakota get listed as having a 100% positive publication rate, based on an n of 1. Nebraska and DC also get listed as having a 100% positive rate, based 13 and 18 papers, respectively. Michigan, which seems to have about a 97% positive rate (eyeballing, the author does not provide the actual stats), is based on 54 papers. It doesn't seem sensible to treat these data points equally. While it would doubtless have taken the author a lot of time and energy to random-sample his way up to reasonable sample sizes, he could also have used a stratified random sampling approach that I don't think would have compromised the results.

In the end, I don't put much stock in the results of this study, though I'm very sympathetic to the author's overall points. Please do let me know if you spot any flaws in my analysis.

#journal: PLOS one #author: Fanelli #publication bias #publish or perish

Solutions: Open Science Data Cloud

This article about Bionimbus, about an effort to efficiently and securely share cancer genomics data, mentions the Open Science Data Cloud. The OSDC is a platform for sharing large datasets, and although their current set of projects, including Bionimbus, are mostly genomics-focused, they seem to be open to datasets of all types.

From the article:

Megan McNerney, an instructor of pathology at the University of Chicago, used Bionimbus to analyze data that led to her discovery that gene CUX1, which acts as a tumor suppressor, is frequently inactivated in acute myeloid leukemia.

"Bionimbus was critical for my work, as it was used for all aspects of the project, including secure storage of protected data, quality control of next-generation sequencing results, alignments, expression analysis, and algorithm development," she said. "The strength of Bionimbus, however, is the support that is provided for end users, which enabled both expert and non-expert team members to use the cloud."

This seems very useful, as one of the primary objections to open sharing of data is security. Not that these projects are completely open - and it's not clear how one gets access, and how much these will be used to enable already existing small collaborations vs more "crowd-sourced" projects - but it's a proof of concept, at least.

Although their raison d'être seems to be secure data, they also host a small number of public data sets.

#solutions #source: phys.org #open data #security #platforms

Softer sciences publish more positive results

A colleague referenced this 2010 paper which measures proportion of positive results by scientific discipline.

I'm not sure how much value there is in the "hard" vs "soft" framing of the disciplines. I feel like it's a bit of a rabbit hole. If that's something that interests you, though, there are plenty of references in the paper, which is open access.

I am interested in the general results for each discipline:

As you can see, psychiatry & psychology had the highest proportion of positive results, at 91.5%, while space sciences had the lowest at 70.2%. The average was 84%.

They also found that applied sciences were more likely to report positive results than "pure" sciences, and that studies with human subjects were more likely to report positive results than studies with non-human subjects:

Another interesting note was the frequency of negative results in papers that reported multiple hypotheses:

The frequency of negative results in papers that tested multiple hypotheses (N = 151, in which only the first hypothesis was considered), was significantly higher than in papers testing only one hypothesis (X2 = 13.591, df = 1, p<0.001). Multiple-hypotheses papers were more frequent in the social than in the biological and the physical sciences (respectively, 18.47% (number of multiple papers N = 76), 4.46% (N = 62) and 1.87% (N = 12), X2 = 140.308, df = 2, p<0.001, Cramer's V = 0.240), and were most frequent in the discipline of Economics and Business (47%, N = 55).

... When correcting for the confounding effect of presence/absence of multiple hypotheses, the odds of reporting a positive result were around five times higher for papers published in Psychology and Psychiatry and Economics and Business than in Space Science (Table 1, Nagelkerke R2N = 0.051).

In the discussion section, the researchers offer two broad explanations for what's causing these differences: the hypotheses tested in softer sciences might be more likely to be true, and/or the testing of hypotheses in the softer sciences might be less rigorous.

They give a few reasons why the first explanation might be the case. Hypotheses in the softer sciences might be based on more personal experience and observation, leading to an informal weeding out of bad theories. (This is a common argument in favor of hypotheses in psychology having a higher success rate than average.) Alternatively, and less flatteringly, hypotheses tested in the soft sciences may be less "deep":

Younger, less developed fields of research should tend to produce and test hypotheses about observable relationships between variables (“phenomenological” theories). The more a field develops and “matures”, the more it tends to develop and test hypotheses about non-observable phenomena underlying the observed relationships (“mechanistic” theories). These latter kinds of hypotheses reach deeper levels of reality, are logically stronger, less likely to be true, and are more conclusively testable.

The second set of explanations, which put forth the idea that soft sciences are less rigorous, touch upon a number of ideas talked about frequently on this tumblr: "Flexibility in definitions, design, analysis and interpretation of a research"; "Prevalence and strength of experimenter effects and self-fulfilling prophecies"'; "Non-publication of negative and/or statistically non-significant results"; "Prevalence and strength of manipulation of data and results".

Again, I'm not sure how useful the framing of hard vs soft is, but this paper does make clear that there are methodological and cultural differences between different scientific disciplines, and that we'd do well to examine them in order to determine best practices.

#journal: PLOS one #author: Fanelli #publication bias #meta-analysis #comparing disciplines

Copyleft for experimental data

A quick note on an interesting point raised by Roger Peng:

But what’s the problem with the three follow-up scenarios described? The one thing that they have in common is that none of the three responding people were subjected to the same standards to which the original investigator (me) was subjected. I was required to register my trial and state the outcomes in advance. In an ideal world you might argue I should have stated my hypotheses in advance too. That’s fine, but the point is that the people analyzing the data subsequently were not required to do any of this. Why should they be held to a lower standard of scrutiny?

The first person analyzed a different outcome that was not a primary or secondary outcome. How many outcomes did they test before the came to that one negatively significant one? The second person examined a subset of the participants. Was the study designed (or powered) to look at this subset? Probably not. The third person claims fraud, but does not provide any details of what they did.

I think it’s easy to take care of the third person–just require that they make their work reproducible too. That way we can all see what they did and verify that there was in fact fraud. But the first two people are a little more difficult. If there are no barriers to obtaining the data, then they can just get the data and run a bunch of analyses. If the results don’t go their way, they can just move on and no one would be the wiser. If they did, they can try to publish something.

What I think a good reproducibility policy should have is a type of “viral” clause. For example, the GNU General Public License (GPL) is an open source software license that requires, among other things, that anyone who writes their own software, but links to or integrates software covered under the GPL, must publish their software under the GPL too. This “viral” requirement ensures that people cannot make use of the efforts of the open source community without also giving back to that community.

One can argue that a culture of responsible, reproducible research would simply not value those actions in the same way they'd value pre-registered resesarch. I view the actions of the first and second people in Peng's hypothetical as much like the critic passing by commenting that a different analysis might be better, or did you realize there's a confound, or maybe we could try with a different population - part of the brainstorming process, not the actual research process. (The third person in Peng's hypothetical is just annoying.)

But maybe that's giving the scientific community too much credit. Unfortunately the large, recent survey of academics' opinions re: open access licensing didn't ask about "share-alike", so it's not clear what kind of support this would have.

#open access #copyleft #pre-registration #source: the scholarly kitchen #source: simply statistics

Solutions: P-Curves

A paper and associated web-based tool have been proposed to detect publication bias and p-hacking in the literature. From the paper:

The practices of p-hacking and file-drawering mean that a statistically significant finding may reflect selective reporting rather than a true effect. In this paper, we introduce p-curve as a way to distinguish between selective reporting and truth. P-curve is the distribution of statistically significant p-values for a set of independent findings. Its shape is diagnostic of the evidential value of that set of findings. We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole explanation of those findings. As detailed below, only right-skewed p-curves, those with more low (e.g., .01s) than high (e.g., .04s) significant p-values, are diagnostic of evidential value. P-curves that are not right-skewed suggest that the set of findings lacks evidential value, and p-curves that are left-skewed suggest the presence of intense p-hacking.

The basic intuition is this: if a studied effect does not actually exist (that is, if the null hypothesis is true) then the distribution of p-values found should be uniform. In a uniform distribution any value is equally likely. When there is no effect, any p-value is equally likely. While publication bias ensures that p-values greater than .05 are unlikely to be reported, we can still look at the distribution of values less than .05, and that distribution, when no effect exists, should be uniform. When a studied effect *does* exist, we should see more low p-values - that is, the distribution should be right-skewed.

Only uniform and right-skewed distributions should occur naturally. A left-skewed distribution therefore indicates p-hacking. P-hacking is a term which describes a wide range of unethical yet sadly common practices used by researchers to achieve significance so that they can get published. Because p-hacking typically stops once researchers reach the magic .05 threshold, the distribution of results between .00 and .05 will be left-skewed, with more results closer to .05.

The authors demonstrate p-curve analysis by collecting p values from two sets of twenty studies. One set was chosen by looking for signs of p-hacking - in this case, by selecting studies which reported their analysis with covariates. The authors explain, "We were suspicious of experiments reporting an effect only with a covariate because we suspect that many researchers make the decision to include a covariate only when and if the simpler analysis without such a covariate, the one they conduct first, is nonsignificant." The second set of studies was chosen by searching for articles lacking words typically associated with p-hacking such as “excluded”, “covariate”, and “transform”.

You can see their results below.

The "suspected p-hacking" data set is left-skewed (that is, it has more data points on the right -- I know, I find skewness labels confusing too). The other dataset, thought to be clean, is right-skewed.

You might have noticed the teal dotted line labelled "null of 33% power". That is the distribution you'd expect to see if the effect exists but the studies testing it are very underpowered. (The consensus is that 80% power is ideal, but see this recent write-up about chronic underpowering of studies.) The clean dataset matches the "null of 33% power" distribution.

The authors detail the necessary steps to perform a p-curve analysis. When selecting studies, they instruct:

1) Create a rule. Rather than decide on a case-by-case basis whether a study should be included, one should minimize subjectivity by deciding on an inclusion rule in advance...

2) Disclose the selection rule...

3) ... When the implementation of the rule generated ambiguity as to whether a given study should be included or not, results with and without those studies should be reported.

4) ... replicate [single-paper p-curves]. Given the risk of cherry-picking analyses that are based on single papers – for example, a researcher may decide to p-curve a paper precisely because s/he has already observed that it has many significant p-values greater than .025 – we recommend that such analyses be accompanied by a properly powered direct replication of at least one of the studies in the paper.

The also provide guidelines for selecting p-values from selected studies, to be documented in a six-column "disclosure table":

Step 1. Identify researchers’ stated hypothesis and study design (Columns 1 and 2)...

Step 2. Identify the statistical result testing the stated hypothesis (Column 3)...

Step 3. Report the statistical result(s) of interest (Column 4)...

Step 4. Recompute precise p-values based on reported test statistics (Column 5)... Recomputation is necessary because p-values are often reported merely as smaller than a particular benchmark (e.g., p<.01) and because they are sometimes reported incorrectly...

Step 5. Report robustness p-values (Column 6). Some experiments report results on two or more correlated dependent variables (e.g., how much people like a product and how much they are willing to pay for it)... P-curvers should not simultaneously include all of these p-values because they must be statistically independent for inference from p-curve to be valid. Instead, p-curvers should use selection rules and report robustness of the results to such rules.

The authors also provide documentation for how to use their web-based tool which generates p-curves when given p-values.

#solutions #publication bias #p-hacking #statistics #statistical power #author: Simonsohn #author: Simmons

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Methods reporting in the fMRI literature

I spent three years working in fMRI labs. To this day it's not clear to me if the field has exceptionally ambiguous standards, or if it's only one of many scientific subfields based around new technology struggling to define good practice. Whether it's got company or not, neuroimaging certainly has issues.

A paper by Joshua Carp in NeuroImage reviews the methodology of 241 recent fMRI articles:

The present study evaluated research reports according to a checklist adapted from the guidelines promulgated by Poldrack et al. (2008). Checklist items were grouped into five categories: experimental design, data acquisition, processing, statistical modeling, and visualization. In all, 179 methodological decisions were collected for each article. Parameters that could not be determined from the published reports were classified as missing. Parameters that were not relevant to particular papers (e.g., smoothing kernel for studies that did not report using spatial smoothing) were classified as not applicable.

You can view reporting rates of these 179 decisions, without subscribing to the journal, here and here.

They also looked at sample sizes across studies, and found that they average 15 subjects per group. They calculated power for each study and found that only 2.3% of the studies would have 80% power to find even a large (.8) effect size.

As described by Ioannidis (Ioannidis, 2005b), the rate of false positive results increases with the flexibility of analysis procedures. Thus, it is important to determine how many analytic strategies can be used to explore a given experiment.

As described above, the studies in the present sample were coded for 21 optional analysis procedures, ranging from experimental design to data visualization. Across the 241 studies, 223 unique combinations of analytic techniques were observed. In other words, there were nearly as many unique analysis pipelines as studies in the sample. Arguably, however, data visualization techniques like activation figures and tables do not constitute analysis. After collapsing across these visualization procedures, 207 unique pipelines were observed—again, nearly as many analysis methods as studies.

The order of processing procedures also permits substantial flexibility in analysis.

From the discussion:

Critically, however, the present results show that many published reports omit a number of key data collection and analysis parameters (Fig. 2 and Fig. 3). Over one third of studies did not describe the number of trials, trial duration, and the range and distribution of inter-trial intervals. Fewer than half reported the number of subjects rejected from analysis; the reasons for rejection; how or whether subjects were compensated for participation; and the resolution, coverage, and slice order of functional brain images.

Important methodological details were also omitted from descriptions of data analysis. Less than half of the studies in the present sample reported whether images were corrected for differences in slice acquisition timing or coregistered to high-resolution scans. Nearly half did not report whether temporal filtering was conducted; less than one fifth reported whether temporal autocorrelations were modeled. A minority of studies described the reference slice used for slice-timing correction, the reference image used for motion realignment or spatial normalization, and how or whether images were corrected for nuisance variables like head motion or physiological artifacts.

Countless studies have demonstrated that these methodological choices can have profound effects on research outcomes (e.g.,Dale, 1999,Lund et al., 2005,Mumford and Nichols, 2008,Sladky et al., 2011 and Zhang et al., 2009). The widespread omission of these parameters from research reports, documented here, poses a serious challenge to researchers who seek to replicate and build on published studies. Changing even a single critical methodological decision may qualitatively alter the results of an experiment; changing many decisions at once may exert profound and unpredictable effects on research outcomes. For example, following the failure of one research group (Nieuwenhuis et al., 2007) to replicate a high-profile finding from another lab (Brown and Braver, 2005 and Brown and Braver, 2007), both groups cited differences in methodological parameters like sample size, number of trials, inter-stimulus interval, and even subject nationality in explaining the divergent results. In sum, methodological choices matter, and reporting them matters, too.

#neuroimaging #journal: NeuroImage #statistical power #degrees of freedom #researcher degrees of freedom #standards #methods

Coding error influences public policy

There's an op ed in the New York Times about how a coding error in an economics article may have had profoundly negative influence on economic policy:

The other paper, which has had immense influence — largely because in the VSP world it is taken to have established a definitive result — was Reinhart/Rogoff on the negative effects of debt on growth. Very quickly, everyone “knew” that terrible things happen when debt passes 90 percent of GDP.

Some of us never bought it, arguing that the observed correlation between debt and growth probably reflected reverse causation. But even I never dreamed that a large part of the alleged result might reflect nothing more profound than bad arithmetic.

One can't help but wonder if the error would have been noticed if the paper had made the raw data and methods available to the public when it was published.

More detail here and here.

#open access #raw data #transparency #methods #source: new york times #journal: national bureau of economic research

Power failure

There's a new article out in Nature Reviews Neuroscience about the failure of scientific studies in general (and neuroscience and fMRI studies in particular) to adequately power their studies. The NRN paper isn't open access, but you can email the authors for a pre-print. There's a good write-up at National Geographic.

The paper discusses the effect of low powered studies, both in an ideal world and in the world we actually live in. Even in a best case scenario, underpowered studies harm research: "low power, by definition, means that the chance of discovering effects that are genuinely true is low." By decreasing the amount of true positive effects in the literature, low powered studies increase the percentage of false positives among all positive results. Again, from the article:

For example, suppose that we work in a scientific field where one in five of the effects we test are expected to be truly non-null (i.e., R = 1 / (5-1) = 0.25) and that we claim to have discovered an effect when we reach p < 0.05; if our studies have 20% power, then PPV = 0.20 × 0.25 / (0.20 × 0.25 + 0.05) = 0.05 / 0.10 = 0.50; that is, only half of our claims for discoveries will be correct. If our studies have 80% power, then PPV = 0.80 × 0.25 / (0.80 × 0.25 + 0.05) = 0.20 / 0.25 = 0.80; that is, 80% of our claims for discoveries will be correct.

They also discuss the "Winner's Curse". If a study is underpowered, it will be less likely to produce strong effects - but only those studies which produce abnormally strong effects will get published.

To illustrate the Winner’s Curse, suppose that an association truly exists with an effect size that is equivalent to an odds ratio of 1.20, and we are trying to discover it by performing a small (i.e., underpowered) study. Suppose also that our study only has the power to detect an odds ratio of 1.20 on average 20% of the time. The results of any study are subject to sampling variation and random error in the measurements of the variables and outcomes of interest. Therefore, on average our small study will find an odds ratio of 1.20 but, because of random errors, our study may in fact find an odds ratio smaller than 1.20 (e.g., 1.00) or an odds ratio larger than 1.20 (e.g., 1.60). Odds ratios of 1.00 or 1.20 will not reach statistical significance because of the small sample size. We can only claim the association as nominally significant in the third case, where random error creates an odds ratio of 1.60. The Winner’s Curse means, therefore, that the ‘lucky’ scientist who makes the discovery in a small study is cursed by finding an inflated effect.

These are major problems - and of course, we don't live in an ideal world. There is publication bias:

Smaller studies more readily disappear into a file drawer than very large studies that are widely known and visible and the results of which are eagerly anticipated (although this correlation is far from perfect). A ‘negative’ result in a high-powered study cannot be explained away as being due to low power, and thus reviewers and editors may be more willing to publish it, whereas they more easily reject a small ’negative’ study as being inconclusive or uninformative. The protocols of large studies are also more likely to have been registered or otherwise made publicly available, so that deviations in the analysis plans and choice of outcomes may become obvious more easily. Small studies, conversely, are often subject to a higher level of exploration of their results and selective reporting thereof.

In addition to making a compelling case about the danger of low-powered studies, the article also provides a meta-analysis of neuroscience studies showing that, yup, they tend to be pretty underpowered.

The authors identified 730 studies by searching 49 meta-analyses which included them. They then calculated their power by assuming a p-level of .05 and an effect size equal to that found in the meta-analysis that contained the study. They found that the average statistical power was 21%. (For contrast, the 'standard' taught in intro stats classes is 80%.) Interestingly, the studies fell into two groups - 42 low powered meta-analyses with an average of 18% power, and 7 high powered meta-analyses with an average of >90% power.

(The authors admit that these calculations rely on the summary effect sizes reported in the meta-analyses being correct, and agree that it is not an unassailable assumption.)

The authors also looked at specific subfields. In neuroimaging, the median statistical power was 8%, across 461 individual studies contributing to 41 separate meta-analyses. A look at rat studies found that "the median statistical power for the water maze studies and the radial arm maze studies to detect these medium to large effects was 18% and 31%, respectively".

The article finishes up with a discussion of the ethical consequences of underpowered studies, particularly for animal studies and for clinical trials. It also discusses potential solutions, including increasing standards both at the IRB/grant approval stage as well as at publication, pre-registration of studies, incentivizing replication, and open access to data and materials.

#journal: Nature Reviews Neuroscience #source: National Geographic #statistical errors #statistical power #publication bias #neuroimaging #author: Nosek #author: Ioannidis

Fraud on the rise, especially in the United States and in high impact journals

An article in the Washington Post details an alleged case of Fraud which led to an as-yet-uncorrected Nature paper and a suicide:

And within hours of this discovery, a note was sent from Lin’s e-mail account to Yuan. The e-mail, which Yuan saved, essentially blamed him for driving Lin to suicide. Yuan had written to Nature’s editors, saying that the paper’s results were overstated and that he found no evidence that the analyses described had actually been conducted. On the day of his death, Lin, 38, the father of three young daughters, was supposed to have finished writing a response to Yuan’s criticisms.

The article cites a 2012 study published in PNAS which found that two thirds (67.4%) of all retractions were due to misconduct - either fraud (43.4%), duplication (14.2%) or plagiarism (9.8%). They found that total number of retractions, as well as percentage of retractions due to fraud, have grown dramatically in recent years:

They looked at misconduct rates in various countries, and although they didn't do any sort of rigorous analysis, they did find that the United States and Germany were responsible for a disproportionate amount of retractions due to fraud, whereas China and India were overrepresented in cases of duplication or plagiarism.

They also found that impact factor correlated with retractions due to fraud - a modestly sized effect (R2 = 0.08664) but highly significant (P < 0.0001).

The authors discuss these results:

The recent increase in the incidence of retractions and the differing patterns by region (Fig. 2) argue that incentives may vary with the type of misconduct. Most articles retracted for fraud have originated in countries with longstanding research traditions (e.g., United States, Germany, Japan) and are particularly problematic for high-impact journals. In contrast, plagiarism and duplicate publication often arise from countries that lack a longstanding research tradition, and such infractions often are associated with lower-impact journals (Fig. 3 and Table 1). A highly significant correlation was found between the journal-impact factor and the number of retractions for fraud or suspected fraud and error (Fig. 3 A and B); the mean impact factor was found to be significantly higher for articles retracted for fraud, suspected fraud, or error, compared with those retracted for plagiarism or duplicate publication (Fig. 3D). An association between impact factor and retraction for fraud or error has been noted previously (4, 6, 29, 30). This finding may reflect the greater scrutiny accorded to articles in high-impact journals and the greater uncertainty associated with cutting-edge research. Alternatively, the disproportionately high payoffs to scientists for publication in prestigious venues can be an incentive to perform work with excessive haste (31) or to engage in unethical practices (4). The modest correlation between impact factor and time-to-retraction argues against an explanation based on increased scrutiny alone, but the higher proportion of fraud in highly prestigious journals is consistent with the suggestion that the benefits of publishing in such venues are powerful incentives for fraud (4, 6, 32). The 20 most highly cited retracted articles (Table 3) include no articles retracted for plagiarism or duplicate publication.

The origin article gives a more personal sense of the pressures and incentives that researchers face:

During Yuan’s time there, the lab received millions in NIH funding, and according to internal e-mails, the people in the lab were under pressure to show results. Yuan felt the pressure, too, he says, but as the point person for analyzing the statistical data emerging from the experiments, he felt compelled to raise his concerns.

As far back as 2007, as the group was developing the methodology that would eventually form the basis of the Nature paper, Yuan wrote an anguished e-mail to another senior member of the lab, Pamela Meluh.

“I continue to be in a state of chronic alarm,” he wrote in August 2007. “The denial that I am hearing from almost everyone in the group as a consensus is troubling to me.”

Meluh quickly wrote back: “I have the same level of concern as you in terms of data quality, but I have less basis to think it can be better. . . . I’m always torn between addressing your and my own concerns and being ‘productive.’ ”

Then Boeke weighed in, telling Yuan that if he could improve the data analysis, he should, but that “the clock is ticking.”

“NIH has already given us way more time than we thought we needed and at some point we’ve got to suck it up and run with what we have,” Boeke wrote to Meluh and Yuan.

#fraud #publish or perish #funding #retractions #source: the washington post #journal: PNAS

Avoidable waste in the production and reporting of research evidence

An opinion article from Iain Chalmers (of the very well regarded though not fully open access Cochrane Collaboration) and Paul Glasziou (of the Center for Evidence-based Medicine) discusses the high potential for waste in much research.

They list a number of barriers:

Poor engagement of end users of research in research questions and design

Incentives in fellowships and career paths to do primary research even if of low relevance

Poor training in research methods and research reporting

Lack of methodological input to research design and review of research

Incentives for primary research ignore the need to use and improve on existing research on the same question.

Published research fails to set the study in the context of all previous similar research

Non-registration of trials

Failure of sponsors and authors to submit full reports of completed research

Poor awareness and use by authors and editors of reporting guidelines

Many journal reviews focus on expert judgments about contribution to knowledge, rather than methods and usability

Space restrictions in journals prevent publication of details of interventions and tests

For each of these obstacles, the authors recommend one or more solutions.

They quote a medical researcher with myeloma in the introduction:

"“Research results should be easily accessible to people who need to make decisions about their own health… Why was I forced to make my decision knowing that information was somewhere but not available? Was the delay because the results were less exciting than expected? Or because in the evolving ﬁeld of myeloma research there are now new exciting hypotheses (or drugs) to look at? How far can we tolerate the butterﬂy behaviour of researchers, moving on to the next ﬂower well before the previous one has been fully exploited?”"

"This experience is not unusual: a recently updated systematic review of 79 follow-up studies of research reported in abstracts estimated the rate of publication of full reports after 9 years to be only 53%."

#clinical trials #waste #standards #meta-analysis #study registration #conflict of interest #academic culture #publish or perish #journal: the Lancet

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Obama administration enacts (qualified) open access

The Obama administration has announced that all research funded by federal agencies (with R&D budgets greater than $100 million) should be made open access within one year of publication:

To that end, I have issued a memorandum today (.pdf) to Federal agencies that directs those with more than $100 million in research and development expenditures to develop plans to make the results of federally-funded research publically available free of charge within 12 months after original publication. As you pointed out, the public access policy adopted by the National Institutes of Health has been a great success. And while this new policy call does not insist that every agency copy the NIH approach exactly, it does ensure that similar policies will appear across government.

It's not clear which agencies are covered by this memo, but at the very least it includes NIH (which already has an open access policy in place), NSF, DARPA, NASA, NIST, and NOAA.

I've also been trying to get a sense of how much scientific research is funded by government agencies, to see what percentage of published research is likely to be affected. I imagine it varies quite a bit from field to field. Industry funds a huge amount of, say, pharmacological research, while social psychology and physics seem more government based.

It may also be worthwhile to put pressure on major private donors to adopt an open access policy now that the government has done so.

#open access #open science #government #funding

The harm done by tests of significance

An interesting article from Accident Analysis and Prevention from 2004 goes over three case studies where Null Hypothesis Significance Testing may have cost lives.

Case 1: Right Turns on Red

Looking at the data in Table 1, persons without training in statistics would think that after RTOR was allowed, these intersections were somewhat less safe. However, the consultant concluded, quite correctly, that the change was not statistically significant. The Commissioner of the Virginia Department of Highways and Transportation sent the consultant’s report to the Governor and in the letter of transmittal wrote: “we can discern no significant hazard to motorists or pedestrians from implementation of the general permissive rule (i.e. of RTOR). No significant increase in traffic crashes has been noted following adoption of right-turn-on-red in any state including Virginia”. Obviously, there was miscommunication. In English ‘significant’ means ‘having or likely to have considerable influence or effect’; the synonym of ‘significant’ is ‘important’. In statistics ‘not’ significant’ means that the data is insufficient to reject the (null) hypothesis of ‘no effect’. Thus, the consultant said one thing and the Commissioner transmitted something entirely different.

... And so the sequence of small studies all pointing in the same direction but with statistically not significant results continued to accumulate, till that last study which I followed was published in 1983. While 287 crashes to right turning vehicles were expected, 313 were counted. The authors concluded, once again, that there was no significant difference in vehicular crashes.

...The problem is clear. Researchers obtain real data which, while noisy, time and again point in a certain direction. However, instead of saying: “here is my estimate of the safety effect, here is its precision, and this is how what I found relates to previous findings”, the data is processed by NHST, and the researcher says, correctly but pointlessly: “I cannot be sure that the safety effect is not zero”. Occasionally, the researcher adds, this time incorrectly and unjustifiably, a statement to the effect that: “since the result is not statistically significant, it is best to assume the safety effect to be zero”. In this manner, good data are drained of real content, the direction of empirical conclusions reversed, and ordinary human and scientific reasoning is turned on its head for the sake of a venerable ritual.

Case 2: Paved shoulders on rural roads

Once again common sense and statistical ritual point in opposite directions. The figures show that, e.g. after a two-foot paved shoulder has been added, the crash rate has declined for all crash types and all severities. Therefore, ordinary reasoning would lead to the conclusions that paving shoulders has reduced crashes. And yet, because of the paucity of the data, none of these reductions proved statistically significant. But quasi-science wins again; and so, in their Conclusion section the authors write:

The study could not discern any statistically significant differences in either crash rate or severity rate between two- and four-foot shoulder installations. Unless (other) benefits … are considered important to practitioners, this study does not show the increased construction cost of four-foot shoulders on state routes to be justified by an increase in traffic safety (p. 37).

Case 3: Speed Limit Increases

The two above cases could be seen as researchers failing to appropriately communicate their findings to lawmakers. In Case 3, we see researchers themselves misusing NHST to deadly effect:

Table 3. Predicted percentage increase in the number of fatal crashes attributed to the speed-limit increases on rural interstates (from Balkin and Ord, p. 10, Table 3)

State First % (1987) Second % (1995)

Alabama 0.0 24.8

Arizona 41.0 0.0

………

Missouri 13.0 42.2

Nebraska 35.5 0.0

………

West Virginia 46.2 0.0

Wisconsin 24.3 0.0

It is obvious that 0.0 is not the best estimate of the change in fatal crashes in all these instances. Why the authors decided to enter 0.0 can perhaps be understood from the numerical example by which they explain their method. In their paper there is a graph of the monthly time series of fatal crashes from 1975 to 1998 for rural interstates in Arizona and, referring to this graph, the authors say (p. 6) that:

“We see a significant increase in the level around 1987 but none around 1995. … Statistically it is estimated that the 1987 speed-limit increase resulted in a 41% increase in rural interstate crashes an Arizona. There is no statistical evidence that the 1995 speed-limit increase has any additional effect on the number of crashes.”

That is, failure to reject the null hypothesis of zero effect at the 10% level of significance was equated with the absence of statistical evidence for an increase in the expected number of crashes. In all these cases, 0.0 was entered in the table. Thus, the table contains two kinds of entries: either estimates of percentage change when the increase was statistically significant, or 0.0 by NHST convention but unsupported by either data or prior-knowledge when the increase was not statistically significant.

The article is behind a paywall. Feel free to message me for a copy of it.

#significance testing #journal: Accident Analysis and Prevention #statistics #statistical errors #meta-analysis

Trending Blogs

Last Seen Blogs

metascience