SUMMARY
Unsupervised machine learning methods provide a promising means to analyze and interpret large datasets. However, most gene expression datasets generated by individual researchers remain too small to fully benefit from these methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. We trained a Pathway Level Information ExtractoR (PLIER) model using on a large public data compendium comprised of multiple experiments, tissues, and biological conditions. We then transferred the model to small rare disease datasets in an approach we term MultiPLIER. Models constructed from large, diverse public data i) included features that aligned well to important biological factors; ii) were more comprehensive than those constructed from individual datasets or conditions; iii) transferred to rare disease datasets where the models describe biological processes related to disease severity more effectively than models trained on specifically those datasets.