PT - JOURNAL ARTICLE AU - Prasad Patil AU - Pierre-Olivier Bachant-Winner AU - Benjamin Haibe-Kains AU - Jeffrey T. Leek TI - Avoiding test set bias with rank-based prediction AID - 10.1101/005983 DP - 2014 Jan 01 TA - bioRxiv PG - 005983 4099 - http://biorxiv.org/content/early/2014/06/06/005983.short 4100 - http://biorxiv.org/content/early/2014/06/06/005983.full AB - Background Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized. The most effective normalization methods depend on the data from multiple patients. From a biomedical perspective this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction.Methods We developed a new prediction modeling framework based on the relative ranks of features within a sample in order to prevent the need for cross-sample normalization, therefore effectively avoiding test set bias. We employed modeling with previously published Top-Scoring Pairs (TSPs) methodology to build the rank-based predictors. We further investigated the robustness of the rank-based models in case of heterogeneous datasets using diverse microarray technologies.Results We demonstrated that results from existing genetic signatures which rely on normalizing test data may be unreproducible when the patient population changes composition or size. Using pairwise comparisons of features, we produced a ten gene, platform-robust, and interpretable alternative to the PAM50 subtyping signature and evaluated the robustness of our signature across 6,297 patients samples from 28 curated breast cancer microarray datasets spanning 15 different platforms.Conclusion We propose a new approach to developing genomic signatures that avoids test set bias through the robustness of rank-based features. Our small, interpretable alternative to PAM50 produces comparable predictions and patient survival differentiation to the original signature. Additionally, we are able to ensure that the same patient will be classified the same way in every context.