TY - JOUR T1 - New method to reconstruct phylogenetic and transmission trees with sequence data from infectious disease outbreaks JF - bioRxiv DO - 10.1101/069195 SP - 069195 AU - Don Klinkenberg AU - Jantien Backer AU - Xavier Didelot AU - Caroline Colijn AU - Jacco Wallinga Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/08/12/069195.abstract N2 - Whole-genome sequencing (WGS) of pathogens from host samples becomes more and more routine during infectious disease outbreaks. These data provide information on possible transmission events which can be used for further epidemiologic analyses, such as identification of risk factors for infectivity and transmission. However, the relationship between transmission events and WGS data is obscured by uncertainty arising from four largely unobserved processes: transmission, case observation, within-host pathogen dynamics and mutation. To properly resolve transmission events, these processes need to be taken into account. Recent years have seen much progress in theory and method development, but applications are tailored to specific datasets with matching model assumptions and code, or otherwise make simplifying assumptions that break up the dependency between the four processes. To obtain a method with wider applicability, we have developed a novel approach to reconstruct transmission trees with WGS data. Our approach combines elementary models for transmission, case observation, within-host pathogen dynamics, and mutation. We use Bayesian inference with MCMC for which we have designed novel proposal steps to efficiently traverse the posterior distribution, taking account of all unobserved processes at once. This allows for efficient sampling of transmission trees from the posterior distribution, and robust estimation of consensus transmission trees. We implemented the proposed method in a new R package phybreak. The method performs well in tests of both new and published simulated data. We apply the model to to five datasets on densely sampled infectious disease outbreaks, covering a wide range of epidemiological settings. Using only sampling times and sequences as data, our analyses confirmed the original results or improved on them: the more realistic infection times place more confidence in the inferred transmission trees.Author Summary It is becoming easier and cheaper to obtain whole genome sequences of pathogen samples during outbreaks of infectious diseases. If all hosts during an outbreak are sampled, and these samples are sequenced, the small differences between the sequences (single nucleotide polymorphisms, SNPs) give information on the transmission tree, i.e. who infected whom, and when. However, correctly inferring this tree is not straightforward, because SNPs arise from unobserved processes including infection events, as well as pathogen growth and mutation within the hosts. Several methods have been developed in recent years, but none so generic and easily accessible that it can easily be applied to new settings and datasets. We have developed a new model and method to infer transmission trees without putting prior limiting constraints on the order of unobserved events. The method is easily accessible in an R package implementation. We show that the method performs well on new and previously published simulated data. We illustrate applicability to a wide range of infectious diseases and settings by analysing five published datasets on densely sampled infectious disease outbreaks, confirming or improving the original results. ER -