Abstract
Clinicians’ experts in mechanical ventilation are not continuously at each patient’s bedside in an intensive care unit to adjust mechanical ventilation settings and to analyze the impact of ventilator settings adjustments on gas exchange. The development of clinical decision support systems analyzing patients’ data in real time offers an opportunity to fill this gap. The objective of this study was to determine whether a machine learning predictive model could be trained on a set of clinical data and used to predict hemoglobin oxygen saturation 5 min after a ventilator setting change. Data of mechanically ventilated children admitted between May 2015 and April 2017 were included and extracted from a high-resolution research database. More than 7.105 rows of data were obtained from 610 patients, discretized into 3 class labels. Due to data imbalance, four different data balancing process were applied and two machine learning models (artificial neural network and Bootstrap aggregation of complex decision trees) were trained and tested on these four different balanced datasets. The best model predicted SpO2 with accuracies of 76%, 62% and 96% for the SpO2 class “< 84%”, “85 to 91%” and “> 92%”, respectively. This pilot study using machine learning predictive model resulted in an algorithm with good accuracy. To obtain a robust algorithm, more data are needed, suggesting the need of multicenter pediatric intensive care high resolution databases.
Introduction
In case of respiratory failure, mechanical ventilation supports the oxygen (O2) diffusion into the lungs and the carbon dioxide (CO2) body removal. As an expert in mechanical ventilation cannot reasonably be expected to be continuously present at the patient’s bedside, specific medical devices aimed to help in ventilator settings adjustments may help to improve the quality of care. Such devices are developed using either algorithms based on respiratory physiology/medical knowledge that adapt ventilator settings in real time based on patients’ characteristics but are not accurate enough to be used widely in clinical practice, especially in children [1, 2]; or physiologic models that simulate cardiorespiratory responses to mechanical ventilation settings modifications but none was validated for this indication [3]. The above-mentioned models all share the limitation of not being suited to learn from ever-growing sets of clinical research data, and potentially improve their performances. To overcome this drawback, another avenue is the development of algorithms using artificial Intelligence to provide caregivers with support in their decision-making tasks. In this study, we assessed machine learning methods to predict transcutaneous hemoglobin saturation oxygen (SpO2) of mechanically ventilated children after a ventilator setting change using a high resolution research database.
Materials and Methods
This study was conducted at Sainte-Justine Hospital and included the data collected prospectively between May 2015 and April 2017 of all the children, age under 18 years old, admitted to the Pediatric Intensive Care Unit (PICU) who were mechanically ventilated with an endotracheal tube. Patients’ data were excluded if the patient was hemodynamically unstable defined as 2 or more vasoactive drugs delivered at the same time (ie., epinephrine, norepinephrine, dopamine or vasopressin) or with an uncorrected cyanotic heart disease defined by no SpO2 > 97% during all PICU stay. All the respiratory data from included patients were extracted from the PICU research database [4], after study approval by the ethical review board of Sainte-Justine hospital (number 2017 1480).
Data extraction
To determine the data that will be extracted for each child, an item generation was conducted by three physicians (PJ, MS, DB). The resulting items are presented in Fig 1 within their sources, means of extraction and a schematic of the main components of the study. The predictive SpO2 value was the SpO2 5 minutes after a change of a ventilator setting. The delay of 5 min corresponded to the shortest period of time to reach a steady state after modification of a ventilator setting [5].
Data formatting
The data extracted from the research database needed: (1) to remove erroneous data due to disconnection of the patient from the ventilator or the monitor, or due to transient interventions such as suctioning; (2) to remove the rows at which no ventilator setting variables was modified; (3) to adapt data format for classifier training. The methodology to format the data is described in S1 file.
Data categorization
SpO2 levels at 5min were classified into three categories (Table 1). The thresholds were selected according to clinical value: a SpO2 < 92% is a target to increase oxygenation in mechanically ventilated children [6]. The critical level of 85% SpO2 is used as an alarm of severe hypoxemia in intensive care [7].
Data balancing
The data analysis showed a severe imbalance with most SpO2 at 5min above 92%. This is logical as caregivers want to maintain SpO2 in normal range during child PICU stay. In such condition, the classifier learns the majority class label (class 3) (Table 1) but doesn’t learn the minority class labels (class 1 and 2) [8]. The data balancing process aims to allow the classifier to learn from all class equally. The data balancing process used in this study included a combination of down-sampling and up-sampling techniques: to balance the three classes of the data involved, a down-sampling of the SpO2 class 3 using TOMEK algorithm [9] and an over-sampling of SpO2 class 1 and 2 using Synthetic Minority Oversampling Technique (SMOTE) [10] were performed.
The creation of synthetic data points by SMOTE can be formulated as follows:
In equation (2), xsyn represents the synthetic data point. The variables xi and xknn are respectively the original instance, and the nearest neighbor data point which is randomly picked among the k nearest neighbors. The random number δ is generated in [0,1] to determine the position of the created synthetic data point along a straight line joining the original data point xi and its chosen nearest neighbor xknn.
To study which data balancing method provided the more accurate algorithm, four datasets were produced via four different balancing procedures, involving different combinations of data balancing techniques (Fig 2).
Predicted SpO2 Classification
To identify the best machine learning classification method, we tested two classification models: artificial neural network and bagged complex decision trees, on the four balanced datasets.
Artificial Neural Network (ANN)
Once the data has been pre-processed, a machine learning predictive model was trained on a sub-set of labeled training data. The model is then used to predict the target variable values on a testing subset where the class labels are hidden. We used Artificial Neural Networks (ANN) to make predictions of the SpO2 variable, based on the values of other variables of interest. Through the function approximation that the ANN performs, it is possible to make predictions of SpO2 variable, based on the input data.
The ANN is learned from training data, using the backpropagation algorithm [11] and is tested on a test set made of the remaining rows of data to validate the generalization of the model. The learning algorithm runs through all the rows of data in the training data set and compares the predicted outputs with the target outputs found in the training data set. The weights are adjusted via supervised learning, in a manner to minimize the error of predicted SpO2 vs target SpO2. The process is repeated until the error is minimized.
The ANN classifier was implemented through cycles of forward propagation followed by backward propagation through the network’s layers. The backpropagation algorithm is used for performance optimization.
For a given number of classes K > 2, the cross-entropy error can be formulated as shown in eq. 3, where (Wi)i is the matrix of weights between the neuron layers, ri is the target value. yi is the value generated by the ANN, ie., its output.
The outputs of the ANN are:
Using stochastic gradient-descent (SGD) for error minimization, the update rule for the ANN weights is:
In equation 5, η is the learning rate, which, when SGD is used, decreases as the error is minimized. During ANN training, each observation, comprised of an input vector and a target output, is denoted (xt, rt), with rt ϵ (“1”, “2”, “3”). The reason why the cross-entropy (eq. 3) is used instead of the Least Square Error (LSE) is to avoid long periods of training, due to the ANN going through stages of slow error reduction.
Bootstrap aggregation of complex decision trees
Bootstrap aggregating (acronym: bagging) was proposed by L Breiman in 1994 to improve classification by combining classifications of randomly generated training sets [12]. Bagging allows for the creation of an aggregated predictor via the use of multiple training sub-sets taken from the same training set. Let (Ti) denote the replicate training sub-sets bootstrapped from the training set T. These replicate sub-sets each contain N observations, drawn at random and with replacement from T. For each of these sub-sets of N observations, a prediction model, or classifier, is created. The computational model we used for bagging was complex decision trees. This means that, for each bootstrapped sub-set of training data, a complex decision tree is trained and thus a classifier is created. If i = 1, …, n, then n classifiers are created through the bagging process.
A decision tree is a flowchart computational model which can be used for both regression, as well as classification problems. Paths from the root of the tree to its various leaf nodes go through decision nodes in which decision rules are applied in a recursive manner, based on values of input variables. Each path represents an observation (X, y) = (x1, x2, x3, …, xn, y), where the label assigned to the target y is given in the leaf node, at the end of the path, ie., classification [13].
In the aim of maximizing the model’s generalization capability during the training process, the Bagged Complex Trees’ performance is tested via k-fold cross-validation. A value k = 10, which is common practice, was used in this study. The training using k-fold cross-validation is carried out as described in Fig 3.
The mathworks Matlab R2016b Machine Learning toolbox was used for the creation of the ensemble of Bagged complex trees model.
Assessment of the performances of the classifiers
We evaluated the performances of the classifiers based on the metrics including testing confusion matrix, average accuracy, precision, recall and F score [14] with a 5minSpO2 prediction expected above 0.9 for each class.
Precision
The Precision (eq. 6) is the ratio of all correct classifications for class i to all instances labeled as class label i by the model. In a non-normalized confusion matrix, this would mean dividing the number of instances classified in class label i by the total of instances in column i.
Recall
Recall is the ratio of the number of instances classified in class label i to the number of true class i labels. In a non-normalized matrix, this would require dividing the number of instances classified in class label i by the total of row i
F-score
The F-score provides a single measure of classification performance of the model used.
Results and discussion
We developed and assessed the performances of two machine learning classifiers on four different balanced datasets to predict SpO2 at 5 min after a ventilator setting change (ie FiO2, PEEP, Vt/Pressure), in 610 mechanically ventilated children. In Fig 4 and Table 2, we report the performances of these two classifiers. Using the classification performance metrics, the bagged trees classifier trained on dataset #3 (see Fig 2) has yielded the best classification performance on the test sets (Table 2). The confusion matrix of the whole bagged trees shows that SpO2 at 5 min could correctly predict in 76% of class “1” data, 62% of class “2”, and 96% of class “3” (Fig 4). This huge variation in classification performances of the three class labels can be explained by the large variation in the numbers of observations available for each of the class labels in the initial dataset that has limited the machine learning (Table 1).
For the artificial neural network, the variation of the number of hidden layers and number of neurons per hidden layer did not seem to have a significant effect on the model’s classification performance (Table 3). As for the Bagged complex trees, the variation of the number of complex trees did not yield significant changes in classification performance (Table 4).
In agreement with previous studies regarding bagging being a better method for medical data classification, tree Bagging fared better than the artificial neural network used in this study [12]. It is noteworthy however that the gaps in performance results between the training and testing confusion matrices are relatively higher in the case of bagged trees model than in that of the artificial neural network (Fig 5). This seems to indicate that, although the bagged trees model was capable of learning very well from the data, there’s still room for improvement in the generalization. The SMOTE algorithm is designed in such a way that should theoretically not affect the generalization of the trained model. In cases of extreme data imbalance, however, as is the case in this study, the over-sampling within the data space of a given minority class label, used for increasing the cardinality of the class label’s set, is also likely to be extreme. This may render the data space of this class relatively dense with respect to the rest of the data, made up of real data points of the studied patient sub-population. This may potentially explain the classification model’s relatively poor generalization for 5minSpO2 class “1” and “2” with respect to the generalization for 5minSpO2 class “3”. Also, since SMOTE generates synthetic data points by interpolating between existing minority class instances, it can obviously increase the risk of over-fitting when classifying minority class labels, since it may duplicate minority class instances. The fact that the training confusion matrix shows extremely high classification performances for the minority 5minSpO2 class “1” and “2”, as opposed to those shown in the testing confusion matrix, suggests that the over-sampling of the minority 5minSpO2 class using SMOTE could have caused some overfitting for these classes, but this would have to be further investigated.
The strengths of this study include a large clinical database of mechanically ventilated children used with more than 7.105 rows. In a recent similar study in PICU, 200 patients were included with 1.15.103 rows [15]. However, the volume of data is clearly insufficient. To use such machine learning predictive models, the pediatric intensive care community needs to combine multicenter high resolution database. In addition, children data could be pooled to neonatal and adult intensive care data, when possible, such as MIMIC III database [16]. The other strength is the process used to transform the data into a usable format and to correct a variety of artifacts present (S1 file). In health care, there is a significant interest in using clinical databases including dynamic and patient-specific information into clinical decision support algorithms. The ubiquitous monitoring of critical care units’ patients has generated a wealth of data which presents many opportunities in this domain. However, when developing algorithms domains, such as transport or finance, data are specifically collected for research purposes. This is not the case in healthcare where the primary objective of data collection systems is to document clinical activity, resulting in several issues to address in data collection, data validation and complex data analysis [17]. As detailed in S1 file, a significant amount of effort is needed, when data have been successfully archived and retrieved, to transform the data into a usable format for research.
This study has several limitations. The limited row number reduced the SpO2 classification for machine learning predictive model to three clinically relevant classes. SpO2 is a continuous variable and the use of three class is probably insufficient, especially when high SpO2 range is suggested as potentially harmful [18, 19]. Instead of the classification model, the next step could be to test regression models’ performance. SpO2 was predicted at 5min after ventilator setting change, a clinically relevant delay. However, the delay between ventilator setting change and oxygenation steady state is not well defined and vary from 1 to 71 minutes according to the parameter set (FiO2, PEEP or other parameters that change mean airway pressure) and clinical conditions studied [15, 20, 21]. This needs further research and probably more sophisticated clinical decision support systems using machine learning predictive models should consider these factors. Finally, we excluded hemodynamic unstable patients using a treatment criteria (≥ 2 vasoactive drugs infused) because this condition decreases pulse oximeter reliability [22, 23]. The validation and electronic availability of reliable markers of hemodynamic instability in children such as plethysmographic variability indices could be helpful [24].
Conclusion
This pilot study using machine learning predictive model resulted in an algorithm with good accuracy. To obtain a robust algorithm with such a method, more data rows are needed, suggesting the need of multicenter pediatric intensive care high resolution databases.
Supporting information
S1 File: Data formatting process
Acknowledgments
We would like to thank Mr. Redha Eltaani for his support in all tasks related to data access at Ste-Justine Hospital. This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), by the Institut de Valorisation des Données (IVADO), by grants from the “Fonds de Recherche du Québec – Santé (FRQS)”, the Quebec Ministry of Health and Sainte Justine Hospital.