Abstract
Objectives To test a simplified evaluation of fellowship proposals by analyzing the agreement of funding decisions with the official evaluation, and to examine the use of a lottery-based decision for proposals of similar quality.
Design The study involved 134 junior fellowship proposals (Postdoc.Mobility). The official method used two panel reviewers who independently scored the application, followed by triage and discussion of selected applications in a panel. Very competitive/uncompetitive proposals were directly funded/rejected without discussion. The simplified procedure used the scores of the two panel members, with or without the score of an additional, third expert. Both methods could further use a lottery to decide on applications of similar quality close to the funding threshold. The same funding rate was applied, and the agreement between the two methods analyzed.
Setting Swiss National Science Foundation (SNSF).
Participants Postdoc.Mobility panel reviewers and additional expert reviewers.
Primary outcome measure Per cent agreement between the simplified and official evaluation method with 95% confidence intervals (95% CI).
Results The simplified procedure based on three reviews agreed in 80.6% (95% CI 73.9-87.3) with the official funding outcome. The agreement was 86.6% (95% CI 80.8-92.4) when using the two reviews of the panel members. The agreement between the two methods was lower for the group of applications discussed in the panel (64.2% and 73.1%, respectively), and higher for directly funded/rejected applications (range 96.7% to 100%). The lottery was used in eight (6.0%) of 134 applications (official method), 19 (14.2%) applications (simplified, three reviewers) and 23 (17.2%) applications (simplified, two reviewers). With the simplified procedure, evaluation costs could have been halved and 31 hours of meeting time saved for the two 2019 calls.
Conclusion Agreement between the two methods was high. The simplified procedure could represent a viable evaluation method for the Postdoc.Mobility early career instrument at the SNSF.
Strengths and limitations of this study
▪ The study examined the outcome between a simplified and the official evaluation procedure for junior fellowship applications for different research disciplines.
▪ The study discussed the agreement between the two evaluation methods in the context of the general uncertainty around peer review and estimated the costs that could have been saved with the simplified evaluation procedure.
▪ It is the first study to provide insight into lottery-based decisions in the context of the evaluation of junior fellowship applications.
▪ The study lacks statistical power because the sample size of applications was relatively small.
▪ The study addressed the specific context and evaluation of the SNSF Postdoc.Mobility funding scheme, results may thus not be generalizable to other funding programs.
Introduction
Peer review of grant proposals is costly and time-consuming. The burden on the scientific system is increasing, affecting funders, reviewers, and applicants [1,2]. In response, researchers have studied the review process and examined simplifications. For example, Snell [3] studied the number of reviewers and consistency of decisions and found that five evaluators represented an optimal tradeoff. Graves et al. [4] assessed the reliability of decisions made by evaluation panels of different sizes. They concluded that reliability was greatest with about ten panel members. Herbert et al. [5] compared smaller panels and shorter research proposals with the standard review procedure. The agreement was about 75% between simplified and standard procedures. As an alternative to face-to-face (FTF) panels, the use of virtual, online meetings has also been examined. Bohannon [6] reported that at the National Science Foundation (NSF) and National Institutes of Health (NIH), virtual meetings could reduce costs by one-third. Gallo et al. [7] compared teleconferencing with FTF meetings and found only few differences in the scoring of the applications. Later studies also found that virtual and FTF panels produce comparable outcomes [8–10].
With virtual formats, panel members still need to attend time-consuming meetings. Using the reviewers’ written assessments without FTF or virtual panel discussions would simplify the process further. Fogelholm et al. [11] reported that results were similar when using panel consensus or the mean of reviewer scores. Obrecht et al. [12] noted that panel review changed the funding outcome of only 11% of applications. Similarly, Carpenter et al. [8] found that the impact of discussions was small, affecting the funding outcome of about 10% of applications. Pina et al. [13] studied Marie Curie Actions applications and concluded that ranking applications based on reviewer scores might work for some but not all disciplines. In the humanities, social and economic sciences, an exchange between reviewers may be particularly relevant. The triaging of applications has also been examined: after an initial screening, noncompetitive and very competitive proposals are either directly rejected or funded. Vener et al. [14] validated the triage model of the NIH and found that the likelihood of erroneously discarding a competitive proposal was very small. Bornmann et al.’s [15] findings on a multi-stage fellowship selection process also supported the use of a triage.
Mandated by the government, the Swiss National Science Foundation (SNSF) is Switzerland’s foremost funding agency, supporting scientific research in all disciplines. Following innovations in career funding, the SNSF will experience a significant increase of applications for the junior “Postdoc.Mobility” fellowship scheme, which offers postdoctoral researchers a stay at a research institution abroad for up to 24 months. The aim of this work was to compare the evaluation of applications by expert review, triage, and discussion in an evaluation panel with expert reviews only.
Methods
Sample
We included applications submitted for the August 2019 Postdoc.Mobility fellowship call. We also included applications by Postdoc.Mobility fellows for a return grant to facilitate their return to Switzerland. Both, fellowship and return grants were evaluated according to the same criteria by the Humanities panel, the Social Sciences panel, Science, Technology, Engineering, Mathematics (STEM) panel, the Biology or Medicine panels.
Study design
We compared funding outcomes based on the official, legally binding evaluation with a simulated, hypothetical evaluation. The official evaluation was based on the triage of applications based on expert reviews, followed by a discussion of the meritorious applications in an FTF panel: the Triage-Panel Meeting (TPM) format (Figure 1). In a first step, each proposal was independently reviewed and scored by two panel members. For the assessment, the evaluation criteria defined in the Postdoc.Mobility regulations [16] were applied. The criteria address different aspects of the applicant, the proposed research project, and the designated research location. Panel members used a 6-point scale: outstanding=6 points, excellent=5 points, very good=4 points, good=3 points, mediocre=2 points, poor=1 point. Applications were then allocated to three groups based on the ranking of the mean scores given to each proposal: Fund without further discussion (F in Figure 1), Discuss in panel meeting (D), and Reject (R). Panel members could request that applications in the F or R group are reallocated to D and discussed. In a second step, the D proposals were discussed in the FTF panel meeting, ranked and funded or rejected. Random Selection (RS in Figure 1) could be used to fund or reject proposals of similar quality close to the funding threshold if the panel could not reach a decision. Funding decisions were based on the standard two-stage method, which included FTF panel meetings (TPM).
The simulated alternative procedure consisted only of the first step, i.e., was entirely based on the ranking of proposals based on the expert reviews: the Expert Review-Based (ERB) evaluation. In addition to the two panel members, a third expert reviewer who was not a member of the panel assessed the proposal. The same 6-point scale was used. The proposals were then allocated to one of three groups based on the mean scores (F, RS, and R in Figure 1). Random selection was used whenever the funding line went through a group of two or more applications with identical scores. The funding rate of the TPM was applied to the simulated ERB method.
Data analysis
To determine the agreement between the two evaluation methods, we used 2×2 contingency tables. We calculated the simple agreement with 95% Wald confidence intervals (CI) for proportions. We also examined the agreement between the TPM and the ERB approach using only the assessments from the two panel members, thus excluding the assessment from the third reviewer. We calculated discipline-and gender-specific levels of agreement and tested for differences in agreement between disciplines and gender using chi-squared tests for categorical data.
Costs
We determined the costs related to the evaluation. The costs comprised expenses related to the scientific assessment of the individual applications and the panel meetings. The SNSF compensates panel reviewers with USD 275 per scientific assessment. Panel reviewers further receive a meeting allowance of up to USD 550 depending on the duration of the meeting. Further, the SNSF reimburses travel expenses and accommodation costs. The five panels included 96 members and met twice in 2019.
Ethics approval
The Ethics Committee of the Canton of Berne confirmed that the study does not fall under the Federal Act on Research involving Human Beings. No reviewer, applicant or application can be identified from this study.
Results
Study sample and success rates
The sample consisted of 134 applications, including 124 fellowship applications and ten requests for a return grant. The mean age of applicants was 32.7 years (SD 3.2 years) in men and 33.5 years (SD 2.8 years) in women. Each reviewer received a mean of 2.5 (SD 1.4) applications to evaluate.
Table 1 shows the distribution of applications and success rates across disciplines, genders and the three evaluation methods: the legally binding TPM format and the simulated ERB evaluations with three or two reviewers. Most applications came from biology, followed by the STEM disciplines and the social sciences. Almost two-thirds of applications came from men. With TPM, success rates were slightly higher in women (60.4%) than in men (50.0%). This was driven by the middle group of applications that were discussed in the panels, where the success rates of women overall was 66.7% (24 of 67 applicants were women in this group). Success rates were similar across disciplines, ranging from 56.2% in the humanities to 52.2% in the social sciences. By design, overall success rates were the same with the ERB evaluations; however, the difference between genders was smaller with ERB than with TPM (Table 1).
Agreement between evaluation by ERB or TPM
Comparing the ERB evaluation based on three reviewers with the standard TPM format, the agreement overall was 80.6% (95% CI 73.9-87.3). The agreement was highest in the Medicine panel (90.0%; CI 76.9-100), and lowest in the Social Sciences panel (73.9%; CI 56.0-91.8). However, the statistical evidence for differences in agreement between panels was weak (P=0.58, Table 2). As expected, the agreement was higher when comparing the ERB evaluation based on the two panel members with TPM. Overall, for two reviews, the agreement was 86.6% (95% CI 80.8-92.4). It ranged from 75.0% (CI 53.8-96.2) in the Humanities panel to 91.3% (CI 79.8-100) in the Social Sciences panel (P=0.51). Both for ERB evaluation with three and two reviewers, the agreement was slightly higher for women than for men (P>0.70, Table 3).
In Table 4, we calculated agreement separately for the triage categories: Fund (F), Discuss (D), Reject (R). With the ERB evaluation based on three reviewers, agreements for F and R were close to 100% (97.3% and 96.7%, respectively) but considerably lower for D: 64.2% (95% CI 52.7-75.7), with P<0.001 for differences in agreement across categories from chi-squared test. For ERB evaluation with two reviewers (the two panel members), the agreement was 100% for F and R, but 73.1% (95% CI 62.5-83.7) for D, with P<0.001 for differences in agreement.
Random selection in TPM and ERB evaluation
With the standard TPM evaluation, only eight (11.9%) of the 67 applicants in the D group, or eight (6.0%) of 134 applicants were entered into a lottery of whom four were funded. With the simulated ERB evaluation based on three reviewers, 19 (14.2%) of the 134 applicants would have entered the lottery, and with the ERB with two reviewers 23 (17.2%) applications would have been subjected to random selection.
Cost savings
We determined the resources that could be saved with the use of an ERB evaluation compared to the TPM. By comparison with the current valid TPM evaluation procedure for the Postdoc.Mobility, we calculated that about USD 91,000 related to the holding of meetings could have been saved if an ERB evaluation had been used for the two Postdoc.Mobility calls in 2019. This saving corresponds to 55% of total costs. Moreover, the holding of all panel sessions in 2019 amounted to a total duration of 31 hours. This represents a significant workload that could have been eliminated with the use of the ERB approach.
Discussion
In this comparative study of the evaluation of early-career funding applications, we found that the simulated funding outcomes of a simplified, expert review-based (ERB) approach agreed well with the official funding outcomes based on the standard, time-tested triage and panel meeting (TPM) format. Applications for fellowships covered a wide range of disciplines, from the humanities and social sciences to STEM, biology and medicine. The agreement was very high for proposals which, in the TPM evaluation, were either allocated to the Fund or Reject categories, but lower in the middle category of proposals that were discussed by the panels. More applicants entered the lottery with the simplified ERB approach than with TPM evaluation. Finally, the simplified ERB evaluation approach was associated with a substantial reduction in costs. Overall, our results support the notion that a sound evaluation of early-career funding applications is possible with an ERB approach.
Although panel review is considered as a “de facto” standard, the consistency of decisions from panels has been shown to be limited. For example, previous work by Cole [17], Hodgson [18], Fogelholm [11] and Clarke [19] found an agreement of 65% to 83% between two independent panels evaluating the same set of applications. Thus, in these studies, the funding outcome also depended on the panel that evaluated the application, and not only on the scientific content. Against this background, the agreement of over 80% between ERB and TPM in this study is remarkable. Among the different discipline-specific review panels, our results showed a slightly lower agreement in the humanities and social sciences compared to life sciences and medicine. These differences did not reach conventional levels of statistical significance but were in line with previous findings reported by Pina et al. [13].
In the middle group of applications based on the triage step of TPM, the agreement was lower; 64% with three reviewers and 73% with the two reviewers. This is not surprising considering the results from previous studies that suggest that peer review has difficulties in discriminating between applications that are neither clearly competitive nor noncompetitive [20–22]. Agreement between ERB and TPM was also generally lower with ERB using three reviewers than with ERB with two reviewers. An additional reviewer may introduce a different viewpoint. Also, the third reviewer was not a member of the corresponding panel, and not involved in previous panel discussions, which have led to some degree of calibration between assessments of panel members. Such calibration is more difficult to achieve with a remote, ERB approach. However, information and briefing sessions could be held to compensate for the lack of FTF panel meetings. Of note, previous studies reported that reviewers appreciated the social aspects and the camaraderie in FTF settings and that physical meetings are important for building trust among the evaluators [8,9].
We found that the panel discussions in the TPM format resulted in higher success rates for women compared to the ERB format. Gender equality is a key concern at the SNSF, which is committed to promoting women in research. The panels will have been aware of the under-representation of female researchers in certain areas, for example, the STEM disciplines, and the SNSF’s agenda to promote women. It is, therefore, possible that during the panel deliberations and for funding decisions, the gender of applicants was taken into account in addition to the quality of the proposal.
We estimated that about USD 91,000 could have been saved for the two Postdoc.Mobility calls in 2019 if they had been evaluated by ERB rather than by TPM. The meeting costs represented about 55% of the total evaluation costs. In other words, the ERB evaluation based on the two panel reviewers would have cut the expenses by more than half. The experience described here with the junior Postdoc.Mobility fellowship scheme indicates that substantial cost savings could also result from simplifications in the evaluation of other funding instruments at the SNSF. However, any such changes need to be considered carefully. The quality of the evaluation should not be allowed to be compromised because costs may be reduced.
To the best of our knowledge, the Health Research Council of New Zealand (HRC-NZ) [23], the Volkswagen Foundation [24], and recently the Austrian Research Fund FWF [25] are the only funders that have used or examined the use of a random selection element in the evaluation process of funding instruments, with a focus on transformative research or unconventional research ideas. The random selection for decisions on applications close to the funding threshold could avoid bias if evaluation criteria do not allow any further differentiation for a small set of similarly qualified applications [22,26]. The applicants were informed about the possible random selection and the evaluation process thus complied with the San Francisco Declaration on Research Assessment (DORA) [27], which states that funders must be explicit about assessment criteria. There was some reservation on the random selection approach among some panel members, but acceptance grew over time. Of note, panels applied the random selection only in a few cases, in eight (6.0%) of 134 applications. In the context of the Explorer Grant scheme of the HRC-NZ, Liu et. al [28] recently reported that most applicants agreed with the use of a random selection. In this study, no negative or positive reactions to the use of random selection were received from applicants.
Our study has several limitations. It addressed the specific context of the SNSF Postdoc.Mobility funding scheme and results may not be generalizable to other funding instruments. The sample size was relatively small, and the study lacked statistical power, for example, to examine differences in agreement between TPM and ERB evaluation across disciplines. The two evaluation methods were not independent, since the two assessments of the panel reviewers were used for both methods. We were relying on reviewer evaluation scores which might not always perfectly reflect the quality of the proposed project, might be biased, and depend on the reviewers’ previous experience with grant evaluation. However, our study design allowed us to investigate the impact of panel meetings on funding outcomes compared to an ERB approach. This study provides further insights into peer review and a modified lottery approach selection in the context of the evaluation of fellowship applications. More research on the limitations inherent in peer review and grant evaluation is urgently needed. Funders should be creative when investigating the merit of different evaluation strategies [29].
Conclusions
In conclusion, we simulated an ERB approach in the evaluation of the junior Postdoc.Mobility funding scheme at the SNSF and compared the funding outcome to the standard TPM format, which has been in use for many years. We found an overall high agreement between the two methods. Discrepancies were mainly observed in the middle group of applications that were discussed in the panel meetings. Based on the evidence that peer review has difficulties in making fine-grained differentiations between meritorious applications [20–22], we are unsure which method performs better. Our findings indicate that the ERB approach represents a viable evaluation method for the Postdoc.Mobility selection process that could save cost and time which could be invested in science and research.
Contributors
Conceived and designed the experiments: MB KR ME. Performed the experiments: MB KR. Analyzed the data: KR RH. Contributed reagents/materials/analysis tools: MB KR ME RH. Wrote the initial draft: MB. Contributed to writing: KR ME RH.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Acknowledgements
We thank the Management of the SNSF Administrative Offices for approving additional resources for the conduction of this study. We also thank the SNSF Postdoc.Mobility staff of the Administrative Offices for their excellent support in implementing the additional reviewers used for the study.