TY - JOUR T1 - Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes JF - bioRxiv DO - 10.1101/078600 SP - 078600 AU - Nicole M. Roslin AU - Li Weili AU - Andrew D. Paterson AU - Lisa J. Strug Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/09/30/078600.abstract N2 - Citation For any use of the 1000 Genomes Project data, please use the citation as noted here: http://www.1000genomes.org/faq/how-do-i-cite-1000-genomes-project. To cite this report or the lists described here, please use the following:Roslin NM, Li W, Paterson AD, Strug LJ. Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes (Abstract/Program #576/F). Presented at the 66th Annual Meeting of The American Society of Human Genetics, October 18-22, 2016, Vancouver, Canada.Data Summary Chips: IlluminaHumanOmni2.5-4v1_B and Illumina HumanOmni25M-8v1-1_BInitial number of SNPs: 2 458 861Initial number of samples: 2318Number of SNPs passing QC: 1 989 184 (80.9%)Number of samples passing QC: 2318 (100%)Number of quasi-unrelated samples with consistent ethnicity and well inferred sex: 1736Abstract The 1000 Genomes Project genotype 2318 individuals (48.1% male) from 19 populations in 5 continental groups on the Illumina Omni2.5 platform. The data are publicly available, and will prove a valuable resource to obtain ethnic-specific allele frequencies, as well as exploring population histories through principal components analysis (PCA), estimation of inbreeding coefficients, and admixture analysis. As in any study, the data should be cleaned prior to analysis, to remove individuals or markers of questionable quality. Furthermore, a thorough understanding of the relationships between individuals must be established. Here we report our findings after comprehensive examination of the data for quality control.The basic quality of the genotypes was assessed using standard procedures. KING version 1.4 was used to confirm the relationships in the provided pedigrees, and also to detect undeclared relationships. PCA was used to examine the similarities and differences between individuals among and between population groups.In general, the data was found to be of high quality. No samples were removed due to low call rate (<97%) or excess heterozygosity. Sex chromosome genotypes showed two individuals with discrepancies between reported and inferred sex, and were unable to determine sex in an additional 20 individuals; the sex for these was changed to unknown. Relationship checking found discrepancies between first-degree relationships in the provided pedigrees and the genotypes in 9 families, including one instance where a reported parent/child pair was unrelated, two instances where full sibs were unrelated, and one set of three individuals who formed a newly defined trio. A set of 1756 individuals who were inferred to be more distant than 3rd degree relatives was extracted and used in PCA. These individuals clustered in a pattern that is consistent with other published reports of global populations. We identified 4 individuals whose genotypes clustered more closely with a different geographic region than the one in the provided data.Although the genotype data is of high quality, errors exist in the publicly available dataset that require attention prior to using the genotypes. PLINK-format files including SNPs with good quality metrics and revised pedigree structures is available at http://tcag.ca. Files with distantly related or unrelated individuals, with sex inference consistent with provided gender, and with PCA consistent with continental group are also available. ER -