TY - JOUR T1 - Assessing Pathogens for Natural versus Laboratory Origins Using Genomic Data and Machine Learning JF - bioRxiv DO - 10.1101/079541 SP - 079541 AU - Tonia Korves AU - Christopher Garay AU - Heather A. Carleton AU - Ashley Sabol AU - Eija Trees AU - Matthew W. Peterson Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/10/06/079541.abstract N2 - Pathogen genomic data is increasingly important in investigations of infectious disease outbreaks. The objective of this study is to develop methods for using large-scale genomic data to determine the type of the environment an outbreak pathogen came from. Specifically, this study focuses on assessing whether an outbreak strain came from a natural environment or experienced substantial laboratory culturing. The approach uses phylogenetic analyses and machine learning to identify DNA changes that are characteristic of laboratory culturing. The analysis methods include parallelized sequence read alignment, variant identification, phylogenetic tree construction, ancestral state reconstruction, semi-supervised classification, and random forests. These methods were applied to 902 Salmonella enterica serovar Typhimurium genomes from the NCBI Sequence Read Archive database. The analyses identified candidate signatures of laboratory culturing that are highly consistent with genes identified in published laboratory passage studies. In particular, the analysis identified mutations in rpoS, hfq, rfb genes, acrB, and rbsR as strong signatures of laboratory culturing. In leave-one-out cross-validation, the classifier had an area under the receiver operating characteristic (ROC) curve of 0.89 for strains from two laboratory reference sets collected in the 1940’s and 1980’s. The classifier was also used to assess laboratory culturing in foodborne and laboratory acquired outbreak strains closely related to laboratory reference strain serovar Typhimurium 14028. The classifier detected some evidence of laboratory culturing on the phylogeny branch leading to this clade, suggesting all of these strains may have a common ancestor that experienced laboratory culturing. Together, these results suggest that phylogenetic analysis and machine learning could be used to assess whether pathogens collected from patients are naturally occurring or have been extensively cultured in laboratories. The data analysis methods can be applied to any bacterial pathogen species, and could be adapted to assess viral pathogens and other types of source environments. ER -