Abstract
The ARRIVE (Animal Research: Reporting of In Vivo Experiments) guidelines are widely endorsed but compliance is limited. We sought to determine whether journal-requested completion of an ARRIVE checklist improves full compliance with the guidelines. In a randomised controlled trial, manuscripts reporting in vivo animal research submitted to PLOS ONE (March-June 2015) were allocated to either requested completion of an ARRIVE checklist or current standard practice. We measured the change in proportion of manuscripts meeting all ARRIVE guideline checklist items between groups. We randomised 1,689 manuscripts, 1,269 were sent for peer review and 762 accepted for publication. The request to complete an ARRIVE checklist had no effect on full compliance with the ARRIVE guidelines. Details of animal husbandry (ARRIVE sub-item 9a) was the only item to show improved reporting, from 52.1% to 74.1% (X2=34.0, df=1, p=2.1×10-7). These results suggest that other approaches are required to secure greater implementation of the ARRIVE guidelines.
Background There are widespread failures across in vivo animal research to adequately describe and report research methods, including critical measures to reduce the risk of experimental bias (Kilkenny et al., 2009, Macleod et al., 2015). Such omissions have been shown to be associated with overestimation of effect sizes (Macleod et al., 2015, Hirst et al., 2014) and are likely to contribute, in part, to translational failure. In an effort to improve reporting standards, an expert working group coordinated by the National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) developed the Animal Research: Reporting of In Vivo Experiments (ARRIVE) guidelines (Kilkenny et al., 2010), published in 2010.
Since the ARRIVE guidelines were first published, they have been endorsed by many journals in their instructions to authors, but this has not been accompanied by substantial improvements in reporting (Baker et al., 2014, McGrath and Lilley, 2015, Gulin et al., 2015a, Avey et al., 2016). Simply endorsing the guidelines does not appear to be sufficient to encourage compliance. Recent findings suggest that following the introduction of mandated completion of a distinct reporting checklist at ten Nature Journals at the stage of first revision significantly improved the quality in reporting versus that of comparator journals (Han et al., 2017, Macleod and The NPQIP Collaborative Group, 2017)
PLOS ONE is an open access online only journal which at the time this study began published around 32,000 research articles per year. Of these, some 5,000 described in vivo research. At present, PLOS ONE instructions to authors encourage compliance with the ARRIVE guidelines, but do not mandate checklist completion. Journals have an important role to play in ensuring that the quality of reporting in the research they publish is robust, yet the most effective mechanism by which they can achieve this remains unclear.
Our aim was to test the impact on the quality of published reports of an intervention which would request, at the time of manuscript submission, that authors complete a checklist detailing where in the manuscript the various components of the ARRIVE checklist were met. This study, to our knowledge, is the first randomised controlled trial of requested ARRIVE guideline completion.
Methods
Methodology and open data
Our protocol, data analysis plan, analysis code, data validation code, and complete dataset are available on the Open Science Framework (http://dx.doi.org/10.17605/OSF.IO/XSJBV)
Ethical approval
We sought an informal ethical opinion from the BMJ Ethics Committee, who were prepared to consider our proposal although it was slightly out of scope. We did this because we were unable at the time to identify an institutional ethics committee who considered this research to fall within their remit. The majority view of the committee was that it was ethical for manuscripts to be randomised between different handling methods; that it was ethical for authors, peer reviewers and academic editors to be kept unaware of the existence of the study while it was in progress; and that it was ethical for the study to receive funding from the NC3Rs.
Randomisation of Manuscripts
We developed (http://doi.org/10.5281/zenodo.1188821) an online platform to support each stage of the project (https://ecrf1.clinicaltrials.ed.ac.uk/iicarus/).
The PLOS ONE editorial process involves an initial screening process, including a determination of whether a manuscript describes animal studies, whether it describes human studies (one manuscript might describe both); and categorises the area of research according to an established taxonomy. For studies reporting the use of animals, checks are carried out to ensure that appropriate institutional animal care and use committee/ethical approvals were in place, and authors of studies perceived to be at high risk – for instance those animal studies which used death as an endpoint - are contacted to provide a valid justification. Manuscripts are then allocated to an academic editor (AE), who assigns peer reviewers as appropriate
Manuscripts submitted to PLOS ONE between March and June 2015 describing in vivo animal research were randomised using the IICARus web platform to receive standard editorial processing (control group) or checklist completion requests (intervention group). The randomisation procedure used minimisation (weighted at 0.75) to ensure that country of origin (of the corresponding author) was balanced between groups.
On submission, authors receive an automated acknowledgement from the publisher that their submission had entered a screening phase. For manuscripts identified during screening to include in vivo research and which were randomised to the intervention, corresponding authors were informed in the post screening email that a completed ARRIVE checklist must be completed before the manuscript could advance through the review process. The email advised that this should include details of the page of their manuscript on which each ARRIVE item was addressed. If the PLOS editorial team did not receive a checklist, it was sent back to authors once more for completion. Manuscripts by authors who did not complete the checklist after the second contact, for any reason, were still passed to the next stage and continued in the study. The content of completed checklists were not checked against the manuscript for compliance at any stage.
Blinded Manuscript Processing
Authors, AEs, and peer reviewers were blinded to the existence of the study. Study personnel took care, in their public comments, not to disclose details of the study or the journal at which the study was being conducted. The journal was not named in the study protocol. If authors enquired as to why their manuscript was being processed differently, they were to be advised that these differences were due to variation within the editorial team in the intensity with which they pursued efforts to improve the review process.
For studies randomised to the control group, PLOS ONE processed the manuscript according to their normal editorial processes.
Once a final decision regarding publication was made, the pre-publication materials for accepted manuscripts were collated by the PLOS editorial team. Where an ARRIVE checklist was included in the accepted materials for publication this was redacted, along with any reference in the text to the submission of a completed ARRIVE checklist. The format of manuscripts largely excluded any evidence that the manuscript was submitted to PLOS ONE. If a reference to PLOS ONE was discovered in the text by internal outcome assessors (within our research group), this was also redacted to prevent any change in behaviour which may result from external outcome assessors knowing which publisher was involved in the study. Where authors stated that the work complies with the ARRIVE guidelines this statement was not redacted. Redacted PDFs of all materials were provided to our research team and uploaded to the IICARus web platform.
Outcome assessment
Our primary outcome was to assess whether the proportion of publications in each group considered to fully comply with all of the ARRIVE criteria was independent of group allocation.
Our secondary outcome measures were to assess whether:
the proportion of publications meeting each of the individual 38 ARRIVE sub-items was independent of group allocation (intervention/ control)
the proportion of studies reporting all Landis criteria (Landis et al., 2012) risk of bias items (randomisation, blinded assessment of outcome, sample size calculation and criteria for exclusion of experimental subjects) was independent of group allocation the proportion of submitted manuscripts accepted for publication was independent of group allocation
Our tertiary outcomes, we assessed whether:
the proportion of publications meeting each of the 38 ARRIVE sub-items was independent of group allocation, stratified by experimental animal
the proportion of studies reporting all of the Landis criteria risk of bias items (blinded assessment of outcome, sample size calculation and criteria for exclusion of experimental subjects), stratified by experimental animal, was independent of group allocation
the proportion of publications meeting each of the 38 ARRIVE sub-items was independent of group allocation, stratified by the country of the address of the corresponding author the proportion of publications meeting each of the 38 ARRIVE sub-items was independent of group allocation, stratified by whether or not the research also contains human data
To examine the feasibility of implementing requests for ARRIVE checklist completion at PLOS ONE, we also assessed the following for accepted manuscripts in each group:
Time (days) spent in PLOS editorial office in handling the manuscript (prior to editor assignment).
Time (days) from manuscript submission to AE assignment.
Time (days) from AE assignment to first reviewer agreed.
Time (days) from AE assignment to first decision
Time (days) from receipt of last review to AE decision In addition, we assessed the following outcome measures for manuscripts which were accepted following resubmission:
Time (days) from initial decision letter to resubmission.
Number of cycles of resubmission.
Time (days) from resubmission to final decision.
We also conducted some exploratory analyses not defined in the study protocol to investigate:
The proportion of publications meeting each of the 38 ARRIVE sub-items for publications in the intervention group where authors had completed an ARRIVE checklist (equivalent to an “on treatment” analysis)
The proportion of publications meeting each Landis criteria item in each group
We operationalised the 38 subitems of the ARRIVE checklist into 108 questions which were scored by trained outcome assessors on the web platform (Appendix 1). It was later determined by the steering committee that 7 questions from the original 108 were not strictly required to comply with the ARRIVE checklist and were therefore excluded from the analysis.
PDF files of manuscripts were available alongside the scoring questions. Each manuscript was scored by two independent reviewers who were blinded to both intervention status and to the score given by the alternative reviewer. Manuscripts were presented to reviewers in random order, and the platform did not allow the same user to review the same manuscript twice. Discrepancies between reviewers were reconciled by a third reviewer, who could view both previous scores.
There were several deviations from the outcome measures specified our study protocol. The time spent in the PLOS editorial office was not disentangled from time with the authors, therefore it includes time for the authors to follow any copyediting changes and requests for documents (including the request to complete an ARRIVE checklist). Similarly, the time spent with authors was also included in the time from manuscript submission to AE assignment. In addition, we had originally intended to analyse the time in the PLOS editorial office in minutes, but the measurement of this was not feasible. We were unable to analyse “The proportion of submitted manuscripts accepted for publication, stratified by experimental animal” (Secondary outcome measure) as we did not receive species categorisation data for studies which were not accepted. PLOS ONE were unable to provide us with one of our specified feasibility measures “Time (days) for each reviewer, from solicitation of reviews to receipt of reviews)” and instead provided “Time (days) from AE assignment to first decision”. “Time (days) from AE assignment to first reviewer agreed” also differs to the outcome described in our study protocol (“Time (days) from AE assignment to solicitation of reviews”). In addition, we had originally set out in our protocol that we would look at the following feasibility outcome measures for manuscripts when the decision was other than “Accept” or “Reject”: whether a revised manuscript is submitted, time (days) from initial decision letter to resubmission, number of cycles of resubmission, time (days) from resubmission to final decision. However, we did not attain this information for manuscripts which were not eventually accepted. Therefore, all feasibility measures apply to accepted manuscripts only.
Reviewer Training
This was a challenging project, and we used crowdsourcing to recruit additional reviewers external to our research group. We used our research networks and social media to identify researchers and students across the biomedical sciences and recruit them as outcome assessors for the project. As an incentive, rewards were given to external reviewers who reached a pre-specified number of manuscript reviews or completed the most reviews in a certain time period.
To ensure that reviewer quality was high, we required reviewers to complete online training prior to reviewing manuscripts as part of the project. We developed a training program with a pool of 10 manuscripts for which we described “Gold standard” correct answers with explanations; and an accompanying document with further elaboration (Appendix 2). To successfully complete the training, external reviewers had to score 80% against these gold standard answers overall and score 100% on gold standard questions relating to the Landis criteria items for three consecutive training papers. The training platform remains available (https://ecrf1.clinicaltrials.ed.ac.uk/iicarus/) and can still be used as a training tool for assessing manuscripts against the ARRIVE guidelines.
Power Calculations
When the study was being designed the PLOS ONE editorial team estimated that complete compliance with the ARRIVE guidelines was close to zero. To have 80% power with an alpha of 0.05 to detect an increase in full compliance from 1% to 10% (the primary outcome) would require 100 published manuscripts per group. To examine each of the individual 38 ARRIVE subitems (Secondary outcome), after correction for multiplicity of testing (alpha = 0.0013) we would require 200 published manuscripts per group to detect with 80% power an increase from 30% to 50% in the prevalence of reporting of an individual subitem. It was estimated that at time of the trial PLOS ONE accepted around 70% of manuscripts, and to account for some drop out because of the use of the same academic editor, we increased our group estimate to 150 manuscripts per group for the primary outcome, and 300 manuscripts in each group for secondary outcomes. During the course of the study it appeared that acceptance rates were lower than the estimate, and so we increased target recruitment to 1000 manuscripts, of which we estimated 600 would be accepted for publication. We did not curtail the study when we had reached the required number of manuscripts accepted for publication because we were concerned that manuscripts with short submission to acceptance times would be enriched in the study population and might not be representative of all manuscripts.
Data Validation
We validated the dataset, blinded to group allocation, to minimise errors. For example, where a paper was assessed as not including fish experiments, “Yes” or “No” responses to IICARus questions only relevant to fish species (e.g. Appendix 1, Questions 9.2.3 - 9.2.4), should not have been recorded. The R code for validation, with explanations of each response validated, and the changes we made to the data are available on the OSF. These were uploaded prior to the unblinding of the final results, at which point database lock occurred and the data were not subsequently altered in any way.
Statistical Analysis
All analyses were carried out using RStudio v1.0.143 with the level of statistical significance set at p<0.05, corrected as appropriate for multiple comparisons. Our full statistical analysis plan and accompanying R code was uploaded to the OSF prior to database lock.
We performed logistic regression with group allocation and corresponding author country of origin included as independent variables to determine any effects on full compliance (Primary Outcome) and compliance with each of the 38 items (Secondary Outcome), adjusting for stratified randomisation. For our primary, secondary, and tertiary outcome measures we used the Chi Squared Test of Independence to test whether compliance was independent of group membership (Intervention/ Control). To determine if the differences between proportions is meaningful for each outcome measure, effect sizes were calculated using Cohen’s H. For feasibility outcomes, medians and inter-quartile ranges were calculated and the Mann-Whitney U test was used to test whether a significant difference existed between groups. To control the familywise error rate of multiple comparisons, the Holm-Bonferroni method (Aickin and Gensler, 1996) was used to adjust p-values for secondary, tertiary, and feasibility outcomes. Only means and confidence intervals were calculated for our exploratory analyses.
Statistical Considerations
The proportion of compliant manuscripts was assessed based on the number compliant manuscripts divided by the number of applicable manuscripts. In some cases, particularly when stratifying by manuscript country of origin or animal species, the number of manuscripts in each group is very low. If the number of applicable manuscripts for any subitem (with or without stratification) was less than 10, we did not perform statistical analysis.
The Chi Squared Test of Independence relies on the assumption that no more than 20% of expected counts are less than 5 and that no individual expected counts are less than 1. In cases where counts were less than 5, a Fisher’s exact test was used.
Unpaired t-tests rely on a normal distribution, therefore if the distribution was non-normal the Mann-Whitney U test, a non-parametric alternative, was used and summary medians and inter-quartile ranges were presented. In the case of parametric data with unequal variance between groups Welch’s t-test was used due to higher reliability.
Results
We randomised 1689 PLOS ONE manuscripts; 845 manuscripts to the intervention and 844 to control. We later excluded 420 manuscripts which in retrospect did not meet the inclusion criteria (largely because they described ex vivo rather than in vivo research). Of the remaining 1269 manuscripts, 672 were accepted for publication (340 control, 332 intervention) and underwent web based outcome assessment (Figure 1). Manuscript allocation to group, and the corresponding number of manuscripts from each country post-randomisation and post-acceptance is shown in Table 1. No authors questioned the differences in manuscript processing occurring within the intervention group. A complete dataset detailing the proportion compliance for each of the 108 questions is available online (http://dx.doi.org/10.17605/OSF.IO/XSJBV).
Quality of Outcome Assessment
360 individuals registered with the online platform; 47 completed reviewer training and 42 contributed at least one outcome assessment. The percentage agreement between the first and second reviewer for each manuscript was high. For the majority (71.6%) of manuscripts, reviewers were in agreement on at least 80% of the questions. The agreement of reviewers varied considerably at the level of each of the 108 individual questions (Supplementary Table 1), from a kappa coefficient of 0.90 (0.86 - 0.93) for Question 1.1 (Is the species of animal model studied reported in the title?) to a worse than chance kappa coefficient of −0.03 (−0.10 - −0.04) for Question 13.2 (Is the unit of analysis for at least one test explicitly specified?). This distribution of kappa agreement is displayed in a histogram (Supplementary Figure 1)
Primary Outcome
No manuscript achieved full compliance with the ARRIVE checklist, therefore there was no difference between intervention and control groups. Compliance with individual ARRIVE items ranged from 8% to 65%. The median compliance was 36.8 % and 39.5% of relevant items in the control and intervention groups respectively.
Secondary Outcomes
Logistic Regression
Manuscript country of corresponding author had no influence on compliance either overall or for any individual subitems. Only one subitem had improved reporting in the intervention group, subitem 9b (Provide details of husbandry conditions e.g. breeding programme, light/dark cycle, temperature, quality of water etc for fish, type of food, access to food and water, environmental enrichment) (increased log odds of compliance by 1.03 (p<0.0001).
Compliance with individual ARRIVE Subitems
Only one ARRIVE item had improved compliance in the intervention group. Reporting of ARRIVE subitem, 9b increased significantly from 52.1% (177/340) in the control group to 74.1% (246/332) in the intervention group (X2 = 34.0, df=1, p<0.0001). Reporting of animal characteristics and health status (Item 14) was very low, with 0.29% (1/339) and 0% (0/332) compliance in the control and intervention groups respectively. Similarly, reporting of animal housing (Item 9a), adverse events (Item 17b), the order of treatment and assessment (Item 11b), implications for replacement, refinement, or reduction (Item 18c), defining primary and secondary outcomes (Item 12), and rationale for experimental procedures (Item 7d) was low, with less than 5% of manuscripts reporting each of these items in both groups. Figure 2 shows the percentage compliance in each group for each ARRIVE subitem in each section of the manuscript.
Reporting of Landis 4 items
Reporting of the Landis 4 criteria (blinding, randomisation, animal exclusions, and use of a sample size calculation) was low and did not differ between groups (X2 = 16.8, df=1, p=0.003). 1.5% of the control group manuscripts (5/340) and 0.9% (3/332) of intervention group manuscripts reported all 4 items of the Landis criteria (Fisher’s estimate for difference = 0.61, df=1, p=0.73).
Manuscript Acceptance
There was no difference in the proportion of accepted manuscripts between the control and intervention groups, being 54.7% (340/622) and 51.3% (322/647) respectively.
Tertiary Outcomes
Compliance by Animal Species
In studies involving mice, reporting of one ARRIVE subitem, 9b (husbandry related) increased significantly from 49.5% (105/211) in the control group to 70.2% (135/192) in the intervention group (X2 = 16.8, df=1, p=0.003). No subitem had significant differences between groups in rat studies. Results are summarised in Table 3a and Table 3b. There was no difference in Landis 4 compliance between animal species.
Feasibility Measures
Re-assignment of academic editors occurred in a small number of cases (7/672), which confounds the recorded time in each stage and prevented us from analysing the feasibility outcomes for these manuscripts. The time from receipt of last review to final AE decision was missing from a significant proportion of the remaining manuscripts (342/665) and so this analysis was not performed. 10 additional manuscripts were also excluded from the feasibility dataset due to missing data on one or more feasibility outcome measures. After these exclusions, the feasibility analysis was performed on 328/340 manuscripts in the control group and 327/332 in the intervention group. For analysis of resubmitted articles, 7 manuscripts were removed as these were accepted at first decision leaving 323/340 in the control group and 325/332 in the intervention group.
Data were skewed so we used Mann-Whitney U test to compare timings between groups. Time spent in the PLOS editorial office was significantly higher (p<0.0001) for manuscripts in the intervention group with a median of 9 days (IQR=6-16.5) compared to the control group with a median of 6 days (3-10). Time from submission to academic editor assignment was also significantly higher in the intervention group (13 days, range 9-22) than in the control group (9 days, range 7-14) (p<0.0001). No significant differences were identified for other feasibility outcomes (Table 4).
Compliance by Country
There were no differences in compliance between control and intervention groups across any corresponding author country of origin. Although we did not set out to compare differences in compliance with different ARRIVE subitems across countries, we present these data in Figure 3.
Human Studies Compliance
In manuscripts without human subjects, reporting of one ARRIVE subitem, 9b (husbandry related) increased significantly from 52.4% (172/316) in the control group to 76.9% (227/295) in the intervention group (X2 = 33.2, df=1, p<0.0001). In manuscripts containing human subjects, compliance also rose from 20.8% (5/24) to 51.35% (19/37) in the intervention group for this subitem, although we were limited by small sample sizes and this change was not found to be significant (X2= 4.47, df=1, p=1).
Exploratory Outcomes
Compliance in True Intervention Group
Despite allocation to the intervention group, a small subset (n=31/332) of authors did not comply with the request to submit a completed checklist and therefore 31 manuscripts were in the intervention group without a completed ARRIVE checklist. We sought to determine compliance with each of the 38 subitems in the “true” intervention group (those submitted with a completed checklist), compared to the control group. The pattern of compliance is similar to that of the full intervention group compared to controls, suggesting that these instances of non-compliance did not impact on results. Summary statistics are presented in Table 3.
Landis Item Individual Compliance
In the ARRIVE guidelines, randomisation and blinding are part of the same subitem therefore a paper should report both randomisation and blinding to be compliant. However, to determine if there had been any changes in individual Landis items we investigated randomisation, blinding, and the other two Landis criteria items (reporting of a sample size calculation, reporting of exclusions) separately (Figure 4). Although we did not analyse these comparisons statistically, there appears to be some improvements in reporting of randomisation and sample size calculations. 29.1% (91/313) of manuscripts in the control group reported whether or not random assignment occurred, compared to 41.5% (125/301) in the intervention group (Cohen’s H effect size = 0.26). While 3.5% (12/40) of control manuscripts reported sample size calculations compared to 7.6% (25/330) in the intervention group (Cohen’s H effect size = 0.18). For the reporting of animal exclusions, 12.6% (43/340) of manuscripts complied in the control group versus 14.5% (48/332) in the intervention group. Finally, 18.8% (63/334) and 19.2% (62/323) of manuscripts reported blinded outcome assessment in the control and intervention groups respectively.
Discussion
Requesting completion of an ARRIVE checklist at submission did not increase full adherence with the ARRIVE guidelines. Compliance with the operationalised ARRIVE checklist was poor overall, with no papers in either group even approaching full compliance; the median compliance was less than 40%, equivalent to around 15 of 38 items; and the intervention only increased compliance with one item, reporting of animal husbandry conditions. There is considerable room for improvement, and this study shows that an editorial policy of making ARRIVE checklist completion “mandatory” without compliance checks has little or no impact.
It may be that simply requesting that authors complete checklist, without any additional editorial checks to determine whether the checklist is truly indicative of compliance, may not be enough to improve adherence to the ARRIVE guidelines. Adherence to the reporting guidelines within in the clinical literature such as CONSORT and STROBE have been widely assessed and may inform interventions to improve compliance with preclinical guidelines. Journal endorsement of these guidelines appear to have improved reporting quality (Prady et al., 2008, Turner et al., 2012), however, it is often unclear what actions journals take to promote adherence (Stevens et al., 2014) and the extent of editorial involvement is likely to have an impact. Prior reports indicate that assessing compliance with reporting guidelines at the stage of peer review leads to a significant improvement of reporting quality (Cobo et al., 2011). Other approaches (e.g. actions on the part of funders or institutions) may also be beneficial, but a successful strategy is likely to be multi-dimensional. Further, the findings reported here and the limited agreement between outcome assessors both in this study and in the recent investigation of study quality following the introduction of a new editorial policy at Nature journals (Macleod and The NPQIP Collaborative Group, 2017) suggests that an important part of guideline development should be refinement of the content; the number of items (with fewer generally being better) and the agreement between assessors.
Our findings are in line with prior reports that endorsement by editors and reviewers has not significantly improved reporting of ARRIVE quality items (Gulin et al., 2015b, Baker et al., 2014). We need therefore a better understanding of the barriers to implementing quality checklists for animal experiments. It has been suggested that requesting checklist adherence at the submission stage may be too late, given the observed correlation between reporting at the planning application stage and at the publication stage (Vogt et al., 2016). The PREPARE (Planning Research and Experimental Procedures on Animals: Recommendations for Excellence) guidelines (Adrian et al., 2017) were published recently and may be a useful tool, in combination with the ARRIVE checklist, to promote a greater focus on experimental rigour at all stages of the research cycle.
Our results contrast with recent reports of improvement in quality following mandated checklist completion following a change in editorial policy at Nature journals (Han et al., 2017, Macleod and The NPQIP Collaborative Group, 2017). However, in both reports study quality was retrospectively assessed in publications published prior to and after the introduction of the Nature quality checklist, which was established in 2015 as part of an organisation wide approach with substantial editorial involvement. In contrast, the current trial investigated an intervention targeted at selected manuscripts, without further editorial involvement. Perhaps unsurprisingly, due to the additional time required for ARRIVE checklist requests, both the number of days manuscripts spent in the PLOS editorial office and the number of days from manuscript submission to AE assignment were found to be significantly longer in the intervention group. The editorial resource required to ensure that all accepted publications meet the requirements of the ARRIVE checklist is likely to be considerable, given that PLOS ONE is a high-volume publisher, with around 44,000 submissions per year. The most feasible and effective way to encourage compliance to the ARRIVE guidelines, or indeed any reporting guideline, remains to be determined, but an ongoing review of interventions to improve adherence to reporting guidelines may shed some light on this issue and direct future investigations (Blanco et al., 2017).
Another consideration is the perceived clarity of the checklist to authors and reviewers. Although reviewer agreement was generally high, a few questions were less well understood by our outcome assessors which suggests the current guidelines may require clearer dissemination among the research community.
Limitations
Due to modest sample sizes we were unable to investigate whether the intervention was more successful in countries with high awareness and adoption of the ARRIVE guidelines such as the United Kingdom, where the ARRIVE guidelines were developed and where many institutions have endorsed them. Furthermore, we did not perform a power calculation for outcomes beyond our primary and main secondary outcomes and it is possible that this, coupled with stringent adjustments for multiplicity of testing in some instances, may have prevented us from detecting any significant differences.
Furthermore, our intervention only involved requests for authors to complete an ARRIVE checklist. PLOS ONE did not fully mandate checklist completion, as manuscripts without a checklist were still allowed to proceed through the trial. Furthermore, PLOS ONE did not evaluate the accuracy of the completed checklists against each manuscript. It is possible that further emphasis on evaluation and checklist adherence may result in an enhancement of study quality.
Our interpretation of compliance was also influenced by our operationalisation of the ARRIVE checklist used for outcome assessment. It was often difficult to determine how many of the details provided in the ARRIVE guidelines were sufficient for full compliance to that ARRIVE item.
There were unforeseen difficulties in attaining data for some outcomes, which meant that we could not assess all outcomes presented in our study protocol. This was most apparent for feasibility outcomes, where there were substantial deviations from our protocol. Furthermore, the project was subject to research waste due to overpowering our primary and secondary outcome measures. As manuscripts submitted to PLOS ONE as part of the study were treated differently, the existence of the study could have leaked to external sources however, to the best of our knowledge, this was not the case.
Since the ARRIVE guidelines were developed by the NC3Rs, no individual employed by the NC3Rs was permitted to conduct any outcome assessment on the IICARus platform.
Conclusions
Research must be described in sufficient detail to allow research users critically to appraise experimental design, to allow them to assess the validity of the findings presented. Replication studies require, for their design, full details of what was done. Transparency in the reporting of research is paramount. Manuscripts must therefore be described in enough detail for readers to understand the research methodology and make informed judgement of quality and risk of bias. At present, reporting quality is, on average, disappointingly poor. However, our findings show that simply requesting that researchers improve reporting is not effective. It may be that a more formal adoption of research improvement strategies, with an original focus on a smaller number of items judged by a stakeholder to be of greatest importance, will allow an incremental approach to enabling and measuring improvement. Furthermore, editorial checks of compliance and further measures to mandate checklist completion may be required to see improvements in quality.
Funding
This study was funded by a joint grant from the National Centre for Reduction, Refinement, and Replacement (NC3Rs), The Wellcome Trust, The Medical Research Council (MRC) and the Biotechnology and Biological Sciences Research Council (BBSRC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests
ES and MM are in receipt of competitive research grants from the NC3Rs who developed the ARRIVE guidelines. SL was funded by an NC3Rs PhD studentship. ES, MM, DH, and NK are members of an NC3Rs working group to review the ARRIVE guidelines (https://www.nc3rs.org.uk/revision-arrive-guidelines). CM, GM, AC, GA and MD were all editors at PLOS ONE throughout the duration of the study. ES is Editor in Chief at BMJ Open Science. All other authors have no other competing interests to declare.