Abstract
The Wheat@URGI portal (https://wheat-urgi.versailles.inra.fr) has been developed to provide the international community of researchers and breeders with access to the bread wheat reference genome sequence produced by the International Wheat Genome Sequencing Consortium. Genome browsers, BLAST, and InterMine tools have been established for in depth exploration of the genome sequence together with additional linked datasets including physical maps, sequence variations, gene expression, and genetic and phenomic data from other international collaborative projects already stored in the GnpIS information system. The portal provides enhanced search and browser features that will facilitate the deployment of the latest genomics resources in wheat improvement.
Background
The International Wheat Genome Sequencing Consortium (IWGSC) [1] is an international collaborative group of growers, academic scientists, and public and private breeders that was established to generate a high quality reference genome sequence of the hexaploid bread wheat, and to provide breeders with state-of-the-art tools for wheat improvement. The vision of the consortium is that the high quality, annotated ordered genome sequence integrated with physical maps will serve as a foundation for the accelerated development of improved varieties and will empower all aspects of basic and applied wheat science to address the important challenge of food security. A first analysis of the reference sequence produced by the consortium (IWGSC RefSeq v1.0) was recently published [2].
To ensure that wheat breeding and research programs can make the most of this extensive genomic resource, the IWGSC endorsed the establishment of a data repository at URGI (Unité de Recherche Génomique Info / research unit in genomics and bioinformatics) from INRA (Institut National de la Recherche Agronomique / French national institute for agricultural research) to develop databases and browsers with relevant links to public data available worldwide. The IWGSC data repository is thus hosted by URGI to support public and private parties in data management as well as analysis and usage of the sequence data. Wheat functional genomics (expression, methylation, etc.), genetic, and phenomic data has increased concurrently, requiring the development of additional tools and resources to integrate different data for biologists and breeders. To manage this escalation of data, URGI have built this data repository for the wheat community with the following specific aims: (i) store resources for which no public archive exists (e.g. physical maps, phenotype information); (ii) enable pre-publication access to specific datasets (e.g. sequence assemblies and annotations, physical maps, markers); and (iii) rapid release of integrated resources upon publication. The repository has been designed in accordance with the “FAIR” principles [3] to ensure that the data are Findable, Accessible, Interoperable and Reusable. To address the challenge of integrating diverse data types from multiple sources, URGI employs solutions that provide enhanced features for data exploration, mining and visualisation using the GnpIS information system [4] combined with a high level of data interoperability.
Here we describe the data and tools currently available through the Wheat@URGI portal [5], the primary resource for the reference sequence of the bread wheat genome (IWGSC RefSeq v1.0) and other IWGSC wheat genomic data. The links to functional genomics, genetic and phenomic data from many other large wheat projects are also described.
A large wealth of data is available throughout the Wheat@URGI portal
The data hosted by the Wheat@URGI portal are available through flat files stored in the IWGSC data repository and through the GnpIS information system [4]. GnpIS encompasses a set of integrated databases to manage genomic data using well-known tools such as BLAST, JBrowse, GBrowse and InterMine, and an in-house database called GnpIS-coreDB developed by URGI to manage genetic and phenomic data.
IWGSC data
Through its concerted efforts to achieve a high quality, functionally annotated reference wheat genome sequence, the IWGSC has developed a variety of resources for the bread wheat (Triticum aestivum L.) accession Chinese Spring. The IWGSC data hosted in the Wheat@URGI portal within the IWGSC data repository are shown in Table 1. They fall into four broad categories: (i) physical maps, (ii) sequence assemblies and annotations, (iii) gene expression, and (iv) variation data.
Physical maps
physical maps assembled by IWGSC scientists for the 21 bread wheat chromosomes, based on high information content fluorescence fingerprinting (HICF) [6] or whole genome profiling (WGP™) [7] of flow-sorted chromosome or chromosome-arm specific BAC libraries, are stored and displayed. The positions of individual BAC clones, markers, and deletion bins are mapped onto physical contigs. The database maintains all released versions of each physical map with the software used to produce the BAC clone assemblies (FPC [8] or LTC [9]), information from the group that produced the map and a link to order the BAC clones from the French plant genomic resource centre [10].
Sequence assemblies and annotations
the IWGSC wheat genome sequence assemblies available for download, BLAST [11], and display in genome browsers include the draft survey sequence assemblies released in 2014 (IWGSC Chromosome Survey Sequence (CSS) v1) and two improved versions (CSS v2 and v3) [12], and the chromosome 3B reference sequence (the first reference quality chromosome sequence obtained by the consortium) [13]. Associated with these assemblies are the virtual gene order map generated for the CSS (Genome Zipper), the POPSEQ data used to order sequence contigs on chromosomes [14] and mapped marker sets. The reference sequence of the bread wheat genome (IWGSC RefSeq v1.0, 14.5 Gb assembly with super scaffold N50 of 22.8Mb) was obtained by integrating whole genome shotgun Illumina short reads assembled with NRGene’s DeNovoMAGIC™ software with the wealth of IWGSC map and sequence resources [2]. The IWGSC RefSeq v1.0 is available for download, BLAST, and browser display. Users can access the whole genome, pseudomolecules of individual chromosomes or chromosome arms, and scaffolds with the structural and functional annotation of genes, transposable elements, and non-coding RNAs generated by the IWGSC. In addition, mapped markers as well as alignments of nucleic acid and protein evidence supporting the annotation are available. Updated versions of the annotation for genes belonging to specific gene families or regions of specific chromosomes that have been manually annotated (ca. 3685 genes) can be found in the IWGSC RefSeq v1.1 annotation.
In addition to the bread wheat sequence, the IWGSC also assembled seven diploid and tetraploid wheat related species: Triticum durum cv. Cappelli, Triticum durum cv. Strongfield, Triticum durum cv Svevo, Triticum monococcum, Triticum urartu, Aegilops speltoides, Aegilops sharonensis [12]. Download and BLAST is available for these data.
Expression data
RNA-Seq expression data are available as reads counts and transcripts per kilobase million (TPM) for the IWGSC RefSeq v1.1 annotation. It is a transcriptome atlas developed from 850 RNA-Seq datasets representing a diverse range of tissues, development stages and environmental condition [15].
Wheat gene pool
As well as IWGSC resources, URGI also hosts other open access wheat sequence data to facilitate research into the wheat gene pool. Sequence assemblies available for download and BLAST include the bread wheat whole genome sequence assembly Triticum aestivum TGACv1 [18] and the diploid progenitor of Aegilops tauschii [19].
Genetic and phenomic data
In addition to sequence data, the Wheat@URGI portal hosts, within GnpIS-coreDB, several sets of genetic and phenomic wheat data [20] that have been produced from French, European, and international projects since 2000 [21]. A significant amount of these data is available without restriction. However, access to restricted data can be obtained through a material transfer or intellectual property agreement. Table 2 presents the types and number of genetic and phenomic data hosted in the GnpIS-coreDB database.
Genetic information corresponds to genetically mapped markers, quantitative trait loci (QTLs), genetic resources (germplasms), and genetic studies (genome wide association studies - GWAS). Genomic information consists of variation from SNP discovery experiments, genotyping, comparative genomics (synteny) and expression data (microarray, RNA-Seq). Phenomic data are available as whole trials including phenotypic and environmental observations recorded using ontologies controlled variables with MIAPPE [22] compliant metadata.
Germplasm data were mainly provided by the French small grain cereals genebank maintained by INRA at Clermont-Ferrand [23] but also by partners of several EU projects. They were linked together with related genotyping or phenotyping characterization data. Generally, genetic and phenomic data have been produced by INRA and its partners in large collaborative projects.
Browsing and searching a large variety of integrated data
Data can be easily accessed through the Wheat@URGI portal [5] using (i) tabs at the top of the pages allowing access in one click to the data, tools, and projects descriptions as well as the IWGSC data repository, (ii) direct links on the home page to the different data types (e.g. clicking on “Physical maps” opens the physical maps browser), and (iii) data discovery and InterMine [24] tools on the home page.
The IWGSC data repository [25] allows accessing consortium data by (i) clicking on a chromosome to open a pop-up menu with all related data (e.g. 3A, 3B, etc.), (ii) using the tabs on the left to access the data by type (e.g. Assemblies, Annotations, etc.), or useful links to the news, the BLAST tool, the FAQ, the access status of the data (e.g. open access), etc.
Physical maps browser
A GBrowse [26] displays the physical maps generated by the IWGSC members [27]. A clickable image on the top of the browser gives access to all versions of the physical map for each chromosome. The browser displays physical contigs, BACs, deletions bins, and markers. From the BACs track, it is possible to order BAC clones directly at the INRA French plant genomic resource centre [10]. From the BACs and markers tracks, one can go directly to the corresponding region in the IWGSC RefSeq v1.0 browser.
Genome browser and BLAST
The IWGSC RefSeq v1.0 is displayed in a dedicated JBrowse [28], [29]. The “markers track” provides links to additional genetic information stored in GnpIS-coreDB which includes access to the position of the marker in cM on genetic maps and to the overlapping QTLs. The most popular tool of the IWGSC data repository is the BLAST search tool (476,000 BLAST searches launched in 2017). All of the wheat sequences available on the Wheat@URGI portal are indexed for BLAST search (see [30] for the complete list). A set of databanks can be selected: e.g. IWGSC RefSeq v1.0 and IWGSC CSS v3 for a given chromosome. The result is presented in a classical tabular format with (i) links to download the data (matching contigs and high scoring pairs - HSP), (ii) links on the genome browsers directly zooming in on the matching region and (iii) external links to EnsemblPlants [31].
Genetic and phenomic data in GnpIS-coreDB
The IWGSC sequence data are linked to genetic and phenomic data within the GnpIS information system [4]. This integration is organized around key data, also called “pivot data” as they are pivotal objects which allow integration between data types. The key objects used to link genomic resources to genetic data are markers and QTLs. Markers are mapped on the genome sequences and provides information on neighbour genes and their function. They also have links to GnpIS-coreDB genetic maps, QTLs, genotyping and GWAS data. Additional information on the marker itself can be found regarding the marker type (e.g. SSR, DArT), the primers sequence for PCR amplification, and SNP details (including the flanking sequences) when relevant. QTLs link the genetic data to the phenomic data in GnpIS-coreDB and to synteny data displayed by the PlantSyntenyViewer tool [32], [33].
The accession (i.e. germplasm) and the variables (i.e. observed trait) described with dedicated ontologies are another important key data for genetic studies as they allow linking phenotype data to genetic associations or QTLs through traits and to genotype diversity data. The genetic resources stored in GnpIS-coreDB displays the unambiguous identification of the accession used (with digital object identifier - DOI) and a rich set of associated data following the MCPD (multi-crop passport descriptors, [34]) standard: a picture, synonyms, descriptors, geolocation of the sites (origin, collecting and evaluation), the collections or panels it belongs to, the holding stock centre with a link to order the accession when possible. The phenotype data includes traceability on trials with timing, like year and temporal series, location, and environment including soil and cultural practices. The phenotype and environment variables follow the crop ontologies format [35] that includes unique identifiers for each variable which are composed of a trait description (e.g. grain yield, plant height top, spike per area), a unit and a method. All these data are displayed in the GnpIS-coreDB web interface and can be downloaded in different file formats, all compliant with the MIAPPE standard [22].
Mining and data discovery tools
To complete this already rich integrated datasets, a gene centric data warehouse, the WheatMine, has been set-up using the well-established InterMine tool [24]. The gene card displays gene function, gene ontology terms, and overlapping genomic features. WheatMine [36] provides access to IWGSC RefSeq v1.0 annotation data (genes, mRNA, polypeptides, transposable elements), polymorphisms (markers) and, through pivotal objects, to genetic data (QTL, metaQTL). It is also possible to navigate from a gene card to its position on the wheat genome browser or to relevant marker details in GnpIS-coreDB.
Figure 1 presents the concept and the tools to navigate through the key data in GnpIS.
Finally, to facilitate data search and access to this wealth of data, we developed a data discovery tool, which, similar to a google search, allows the user to enter keywords or terms to find all the matching information in the various data warehouses. The results are presented in a table with details on the matches (database source, type, species, description) and a direct link to the feature (e.g. a gene in a browser, a marker page in GnpIS-coreDB, etc.).
A practical use case describing how to use the Wheat@URGI portal to go from a gene sequence to find the genetic studies related is detailed in the Supplementary data.
Conclusion and future directions
The Wheat@URGI portal hosts and gives access to essential, high quality wheat data from the IWGSC, European, and international projects. Furthermore, its added value is that it integrates different data type altogether (genomics, genetics and phenomics) and provides dedicated tools to explore them.
As new wheat resources such as GWAS, genomic selection, and pan-genome data are generated in the frame of ongoing projects, GnpIS will allow their management and integration with other data already available in the information system, linking new upcoming data to this central IWGSC genomic resource.
At a wider scale, an expert working group (EWG) of the international Wheat Initiative build an international wheat information system, called WheatIS, with the aim of providing a single-access web-based system to all available wheat data resources and bioinformatics tools [37]. The Wheat@URGI portal is a major node of the WheatIS federation that expose genomic, genetic and phenomic integrated data to the community. The WheatIS data discovery tool allows a one-stop search in GnpIS [4] (including IWGSC browsers, InterMine and GnpIS-coreDB; URGI), EnsemblPlants (EMBL-EBI) [31], CrowsNest [38] (PGSB), CR-EST [39], GBIS [40] and MetaCrop [41] (IPK), The Triticeae Toolbox (Triticeae CAP), CIMMYT DSpace and Dataverse (CIMMYT), Gramene [42] (CSH, OSU, EMBL-EBI), Cropnet (IPGPAS), WheatPan [43] (UWA) and GrainGenes [44] (USDA).
The Figure 2 presents the WheatIS ecosystem.
Data integration is fundamental for researchers and breeders that want to use genomic information to improve wheat varieties. However, the diversity of data type and the concomitant lack of data harmonisation and standards hamper cross-referencing and meta-analysis. A joint action between the WheatIS EWG and a group of linked data scientists created the Wheat Data Interoperability Working Group under the Research Data Alliance (RDA) umbrella [45] to help tackle this difficult issue [46]. The Wheat@URGI portal continuously evolves its repository to follow the standard recommendations [47].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
The open access data (including all the IWGSC data) are available through the Wheat@URGI portal: https://wheat-urgi.versailles.inra.fr.
Competiting interest
The authors declare that they have no competing interests.
Funding
The development of the information system and the integration of wheat data was supported by INRA and several projects: BreedWheat (ANR-10-BTBR-03, France Agrimer, FSOV), Whealbi (EU FP7-613556), TriticeaeGenome (EU FP7-KBBE-212019), 3BSEQ (ANR-09-GENM-025, FranceAgrimer), TransPLANT (EU FP7-283496).
Authors’ contributions
MA, JR, TL, FA, KE designed, developed and filled the IWGSC data repository.
MA, TL, RF, FA, CP, NM, SD, EK, CM, CG, MLo, MLa, DS, AFAB, HQ designed, developed and filled the GnpIS information system.
FC, HR, PL, NG, JS, CF, IWGSC, EP generated, submitted the data and give feedback on the tools.
MA, JR, EP, KE, AFAB, HQ draft the manuscript.
All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank for their help or advices at various stages of the project, the following people from INRA-URGI: Véronique Jamilloux, Joëlle Amselem, Dorothée Charruaud, Guillaume Cornut, Laura Burlot, Florian Philippe, Nicolas Francillonne, Loïc Couderc, Daphné Verdelet, Baptiste Brault, Kirsley Chennen; from INRA-GDEC: Jacques Le Gouis, Gilles Charmet, Fran#x00E7;ois Balfourier, Pierre Sourdille, Catherine Ravel, Fran#x00E7;ois-Xavier Oury, Audrey Didier; from INRA-DIST: Esther Dzale, Sophie Aubin, Odile Hologne; and from INRA-Agronomie: Arnaud Gauffreteau.
Thanks to Isabelle Caugant (IWGSC), Héléne Lucas (Wheat Initiative), the International Wheat Genome Sequencing Consortium and its sponsors, the WheatIS expert working group, the URGI platform, and all the data submitters.
Abbreviations
- IWGSC
- international wheat genome sequencing consortium
- INRA
- institut national de la recherche agronomique / French national institute for agricultural research
- URGI
- unité de recherche génomique info / research unit in genomics and bioinformatics
- FAIR
- findable, accessible, interoperable, reusable
- BLAST
- basic local alignment search tool
- HICF
- high-information-content fingerprinting
- WGP™
- whole genome profiling
- BAC
- bacterial artificial chromosome
- FPC
- fingerprinted contig
- LTC
- linear topological contig
- CSS
- chromosome survey sequence
- POPSEQ
- population sequencing
- RNA
- ribonucleic acid
- TPM
- transcripts per kilobase million
- VCF
- variant call format
- SSR
- simple sequence repeats
- SNP
- single nucleotide polymorphism
- DArT
- d iversity arrays technology
- QTL
- quantitative trait loci
- GWAS
- genome-wide association study
- cM
- centimorgan
- HSP
- high scoring pairs
- PCR
- polymerase chain reaction
- DOI
- digital object identifier
- MCPD
- multi-crop passport descriptors
- MIAPPE
- minimum information about a plant phenotyping experiment
- EWG
- expert working group
- EMBL-EBI
- European bioinformatics institute
- PGSB
- plant genome and systems biology group
- IPK
- Leibniz institute of plant genetics and crop plant research
- CIMMYT
- international maize and wheat improvement center
- CSH
- Cold Spring Harbor laboratory
- OSU
- Ohio State University
- IPGPAS
- institute of plant genetics of the Polish academy of science
- UWA
- University of Western Australia
- USDA
- U.S. department of agriculture
- EWG
- expert working group
- RDA
- research data alliance