Abstract
Background Analysis of population structure and genomic ancestry remains an important topic in human genetics and bioinformatics. Commonly used methods require high-quality genotype data to ensure accurate inference. However, in practice, laboratory artifacts and outliers are often present in the data. Moreover, existing methods are typically affected by the presence of related individuals in the dataset.
Results In this work, we propose a novel hybrid method, called SAE-IBS, which combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions. I.e., it yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations. The proposed approach achieves higher accuracy than existing methods for projecting poor quality target samples (genotyping errors and missing data) onto a reference ancestry space and generates a robust ancestry space in the presence of relatedness.
Conclusion We introduce a new approach and an accompanying open-source program for robust ancestry inference in the presence of missing data, genotyping errors, and relatedness. The obtained ancestry space allows for non-linear projections and exhibits orthogonality with clearly separable population groups.
Competing Interest Statement
The authors have declared no competing interest.
Abbreviations
- LD
- Linkage Disequilibrium
- GWAS
- Genome-Wide Association Study
- SNP
- Single Nucleotide Polymorphism
- PCA
- Principal Component Analysis
- UPCA
- Unnormalized Principal Component Analysis
- SUGIBS
- Spectral decomposition of an Unnormalized Genomic relationship matrix generalized by an Identity-by-State similarity matrix between unseen individuals and individuals in the reference dataset
- SVM
- Support Vector Machine
- SVD
- Singular Value Decomposition
- AE
- AutoEncoder
- VAE
- Variational AutoEncoder
- DAE
- Denoising AutoEncoder
- DAE-L
- Denoising AutoEncoder with modified Loss
- SAE-IBS
- Singular AutoEncoder generalized by IBS similarity matrix
- D-SAE-IBS
- Denoising Singular AutoEncoder generalized by IBS similarity matrix
- D-SAE-IBS-L
- Denoising Singular AutoEncoder generalized by IBS similarity matrix with modified Loss
- 1KGP
- 1,000 Genomes Project
- HDGP
- Human Genome Diversity Project
- ABCD
- Adolescent Brain Cognitive Development
- KNN
- K-Nearest Neighbors
- MSE
- Mean Squared Error
- MAE
- Mean Absolute Error
- RMSD
- Root Mean Square Deviations
- NRMSD
- Normalized Root Mean Square Deviations
- EUR
- Europe; AFR: Africa; EAS: East Asia; SAS: South Asia; AMR: Americas; ASI: Asian; REL: relatives
- MMD
- Mahalanobis distance
- CAE
- Contractive AutoEncoder
- CHB
- Han Chinese in Beijing, China
- GWD
- Gambian in Western Division, The Gambia
- YRI
- Yoruba in Ibadan, Nigeria
- KHV
- Kinh in Ho Chi Minh City, Vietnam
- GAN
- Generative Adversarial Network
- DEC
- Deep Embedded Clustering