Abstract
Background The heterogeneous phenotype and complex genetic architecture of Autism Spectrum Disorder (ASD) has thus far limited our understanding of genotype-phenotype correlations, hindering early diagnosis and patient prognosis. Copy Number Variants (CNVs) targeting a diversity of genes have been implicated in ASD, however correlations with clinical patterns are unclear.
Methods In this study, we developed a novel machine learning integrative approach that seeks to delineate associations between ASD clinical profiles and disrupted biological processes inferred from CNVs spanning brain-expressed genes.
Results Clustering analysis of relevant clinical measures from 2446 ASD cases, retrieved from the Autism Genome Project (AGP) database, identified two distinct phenotypic subgroups, with a milder and a more severe phenotype. Patients in the two clusters differed significantly in verbal status, ADOS-defined severity, adaptive behaviour profiles and intellectual ability, with verbal status contributing the most for cluster stability and cohesion. In the clustered ASD cases, functional enrichment analysis of brain-expressed genes disrupted by rare CNVs identified 15 statistically significant biological processes. These biological processes included cell adhesion, nervous system development, cognition and protein polyubiquitination and were in line with previous ASD findings. Random Forest feature importance analysis showed a positive contribution of all disrupted biological processes to the classification of ASD cases in the identified clusters. A Naive Bayes classifier was generated to predict the ASD phenotype from the identified disrupted biological processes. For a subset of patients with higher Information Content scores calculated for the disrupted biological processes, the classifier achieved predictions with a high precision but low recall (Precision: 0.82, Recall: 0.39).
Conclusions This study highlights the usefulness of machine learning approaches to reduce clinical heterogeneity by taking advantage of multidimensional clinical measures. Furthermore, it shows that phenotype-genotype correlations can be established in ASD, and that milder and more severe clinical presentations have distinct underlying biological mechanisms. However, precise predictions of the phenotype from genetic data were only achieved for the subset of patients with higher biological information content. These findings are therefore a first step towards the translation of genetic information into clinically useful applications, while emphasizing the need for larger datasets with complete clinical and biological information.