ABSTRACT
Background Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still undeveloped. Whole-genome sequence (WGS) data in pedigreed macaque colonies could provide substantial experimental power, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for whole-genome sequencing, and imputes these dense marker data into all remaining individuals having sparse marker data, obtained using Genotyping-By-Sequencing (GBS).
Results We developed GBS for the macaque genome using a single digest with PstI, followed by sequencing to 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized an optimal 22,455 sparse markers spaced ~125 kb apart. To characterize dense markers for imputation, we performed WGS at 30X coverage on 9 of the 16 individuals, yielding ~10.2 million high-confidence variants. Using the approach of “Genotype Imputation Given Inheritance” (GIGI), we imputed alleles at an optimized dense set of 4,920 variants on chromosome 19, using 490 sparse markers from GBS. We assessed changes in accuracy of imputed alleles, 1) across 3 different strategies for selecting individuals for WGS, i.e., a) using “GIGI-Pick” to select informative individuals, b) sequencing the most recent generation, or c) sequencing founders only; and 2) when using from 1-9 WGS individuals for imputation. We found that accuracy of imputed alleles was highest using the GIGI-Pick selection strategy (median 92%), and improved very little when using >4 individuals with WGS for imputation. We used this ratio of 4 WGS to 12 GBS individuals to impute an expanded set of ~14.4 million variants across all 20 macaque autosomes, achieving ~85-88% accuracy per chromosome.
Conclusions We conclude that an optimal tradeoff exists at the ratio of 1 individual selected for WGS using the GIGI-Pick algorithm, per 3-5 relatives selected for GBS, a cost savings of ~67-83% over WGS of all individuals. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for expensive WGS data on all individuals.
Footnotes
Author email addresses: Ben Bimber: bimber{at}ohsu.edu, Michael Raboin: raboin{at}ohsu.edu, John Letaw: letaw{at}ohsu.edu, Kimberly Nevonen: nevonen{at}ohsu.edu, Jennifer Spindel: jes46{at}cornell.edu, Susan McCouch: srm4{at}cornell.edu, Rita Cervera-Juanes: cerveraj{at}ohsu.edu, Eliot Spindel: spindele{at}ohsu.edu, Lucia Carbone: carbone{at}ohsu.edu, Betsy Ferguson: fergusob{at}ohsu.edu, Amanda Vinson: vinsona{at}ohsu.edu
List of abbreviations
- WGS
- whole-genome sequencing
- GBS
- Genotyping-By-Sequencing
- SNV
- single-nucleotide variant
- ONPRC
- Oregon National Primate Research Center
- MAF
- minor allele frequency
- GIGI
- Genotype Imputation Given Inheritance
- CNV
- copy number variant
- BWA
- Burrows-Wheeler Aligner
- VCF
- variant call format
- GATK
- Genome Analyzer ToolKit
- MCMC
- Markov Chain Monte Carlo
- ML
- most likely genotype calling method
- THR
- threshold genotype calling method