Abstract
Background Linking nucleotide sequence data (NSD) to scientific publication citations can enhance understanding of NSDs provenance, scientific use, and re-use in the community. By connecting publications with NSD records, NSD geographical provenance information, and author geographical information, it becomes possible to assess the contribution of NSD to infer trends in scientific knowledge gain at the global level.
Findings For this data note, we extracted and linked records from the European Nucleotide Archive to citations in open-access publications aggregated at Europe PubMed Central. A total of 8,464,292 ENA accessions with geographical provenance information were associated with publications. We conducted a data quality review to uncover potential issues in publication citation information extraction and author affiliation tagging and developed and implemented best-practice recommendations for citation extraction. Flat data tables and an data warehouse with an interactive web application were constructed to enable ad hoc exploration of NSD use and summary statistics.
Conclusions The extraction and linking of NSD with associated publication citations enables transparency. The quality review contributes to enhanced text mining methods for identifier extraction and use. Furthermore, the global provision and use of NSD enables scientists around the world to join literature and sequence databases in a multidimensional fashion. As a concrete use case, statistics of country clusters were visualized with respect to NSD access in the context of discussions around digital sequence information under the United Nations Convention on Biological Diversity.
Competing Interest Statement
The authors have declared no competing interest.
List of abbreviations
- CBD
- Convention on Biological Diversity
- ITPGRFA
- International Treaty for Plant Genetic Resources for Food and Agriculture
- DOI
- Document Object Identifier
- EMBL
- European Molecular Biology Laboratory
- ENA
- European Nucleotide Archive
- ePMC
- Europe PubMed Central
- DSI
- Digital Sequence Information - synonym for nucleotide sequence data in international policy circles
- GR
- Genetic Resources
- INSDC
- Nucleotide Sequence Database Collaboration
- NSD
- Nucleotide Sequence Data - synonym to DSI in a technical and database context
- ORCID
- Open Researcher and Contributor ID
- PGR
- Plant Genetic Resources
- WiLDSI: German
- “wissenschaftsbasierte Lösungsansätze für digitale Sequenzinformation”,
- English translation
- Science-based Approaches for Digital Sequence Information