Abstract
The reduced costs of sequencing have led to the availability of whole genome sequences for a large number of microorganisms, enabling the application of microbial genome wide association studies (GWAS). Given the successes of human GWAS in understanding disease aetiology and identifying potential drug targets, microbial GWAS is likely to further advance our understanding of infectious diseases. By building on the success of GWAS, microbial GWAS have the potential to rapidly provide important insights into pressing global health problems, such as antibiotic resistance and disease transmission. In this review, we outline the methodologies of GWAS, the state of the field of microbial GWAS today, and how lessons from GWAS can direct the future of the field.
Glossary
- Beta
- The standardized regression coefficient, derived from linear regressions in GWAS of continuous traits. It is reported as an estimate of the effect size of a SNP, and reflects the change in phenotype expected from carrying a copy of the reference allele of the SNP.
- Clonal
- Where reproduction produces genetically identical organisms, and so does not introduce novel variants or recombination.
- Effect size
- The proportion of variance in a phenotype predicted by a variant.
- Epistatic interaction
- Interactions between variants at different locations in the genome.
- False Positive
- A variant, or any predictor, that is identified as significantly associated with a phenotype but is not causal. In the case of GWAS this is usually due to confounding from population structure or insufficient quality control.
- Genome-wide association study
- A hypothesis-free method that tests hundreds of thousands of variants across the genome to identify alleles that are associated with a phenotype.
- Genome-wide significance
- The p-value cut-off for declaring a variant significantly associated with a phenotype, accounting for for the number of variants tested and the correlations between them.
- Heritability
- The proportion of phenotypic variance that is due to inherited genetic variation.
- k-mers
- A specific sequence of bases that, in microbial GWAS, can be used as the genetic variant tested for association with the phenotype.
- Linkage disequilibrium (LD)
- Correlations between variants due to co-inheritance. LD is usually higher between variants that are closer together, and is broken down by recombination.
- Main effect
- The effect of a variant on the phenotype without accounting for any possible interactions with other variants or environmental factors.
- Odds Ratio
- The odds ratio, often abbreviated to OR, is the typical means of reporting the effect size of a SNP in a case-control (or other binary phenotype) GWAS. It is derived from a logistic regression, and represents the the odds of the phenotype when carrying the reference allele, compared to the odds of the phenotype in absence of the reference allele.
- Panmictic
- A population where all organisms are potential partners with each other.
- Phenotype
- A trait or disease that is the outcome of interest in an analysis of genetic variants.
- Phred scores
- A measure of the quality of sequencing at a given locus, specifically the confidence in the calling of alleles at that locus.
- Pleiotropic
- Pleiotropic variants are those that have an effect on multiple distinct phenotypes.
- Polygenic methods
- Statistical approaches that focus on the combined effects of many genetic variants rather than the effect of any individual variant.
- Power
- The probability that an analysis will reject the null hypothesis when the alternative hypothesis is true. Is influenced by numerous factors, such as the effect size and sample size.
- Single nucleotide polymorphism (SNP)
- A base position where two alleles exist with a frequency of >1% in the population. Superinfection When an individual is infected with multiple strains of the same microorganism.