Abstract
A central challenge of medical imaging studies is to extract quantitative biomarkers that characterize pathology or predict disease outcomes. In high-resolution, high-quality magnetic resonance images (MRI), state-of-the-art approaches have performed well. However, such methods may not translate to low resolution, lower quality images acquired on MRI scanners with lower magnetic field strength. Therefore, in low-resource settings where low field scanners are more common and there is a shortage of available radiologists to manually interpret MRI scans, it is essential to develop automated methods that can accommodate lower quality images and augment or replace manual interpretation. Motivated by a project in which a cohort of children with cerebral malaria were imaged using 0.35 Tesla MRI to evaluate the degree of diffuse brain swelling, we introduce a fully automated framework to translate radiological diagnostic criteria into image-based biomarkers. We integrate multi-atlas label fusion, which leverages high-resolution images from another sample as prior spatial information, with parametric Gaussian hidden Markov models based on image intensities, to create a robust method for determining ventricular cerebrospinal fluid volume. We further propose normalized image intensity and texture measurements to determine the loss of gray-to-white matter tissue differentiation and sulcal effacement. These integrated biomarkers are found to have excellent classification performance for determining severe cerebral edema due to cerebral malaria.
1. Introduction
Magnetic resonance imaging (MRI) is a non-invasive technique that uses powerful electromagnetic fields to visualize brain structures and assess both disease diagnosis and prognosis. In recent years, cutting-edge technology using magnetic field strengths of up to 7 Tesla has allowed for extremely high-resolution human brain images to be captured. Alongside this improvement in technology, there has been a concurrent explosion of sophisticated automated methods ranging from lesion segmentation in multiple sclerosis (Shiee et al., 2010; Valcarcel et al., 2018; Valverde et al., 2017) or brain tumor (Gordillo, Montseny and Sobrevilla, 2013), to the prediction of clinical outcomes in patients with psychosis (Nieuwenhuis et al., 2017).
However, access to advanced MRI technology is not consistent across the globe (Marques, Simonis and Webb, 2019). In low-resource settings, challenges related to cost, infrastructure, and unreliable power sources may limit the availability of high field strength MRI (Latourette et al., 2010), sparking interest in lower field strength (< 0.5T) MRI alternatives. Even in areas where 1.5-3T MRI is in use, low field MRI is witnessing a resurgence in popularity: stand-alone and portable MRI systems have demonstrated potential for use in emergency rooms, critical care units, ambulances, and public areas (Campbell-Washburn et al., 2019; Sarracanie et al., 2015; Sheth et al., 2020). However, this technology is often limited by low signal to noise ratio, which is strongly dependent on the main magnet field strength. The challenge is to translate modern image analysis pipelines—which were developed on higher resolution images—to low-resolution and lower quality scans.
In response, we propose methods to extract information from images acquired on a 0.35T scanner to assess the severity of cerebral edema in children with cerebral malaria (CM). Malaria is a parasitic infection that results in more than 400,000 deaths annually, with the great majority occurring in children living in sub-Saharan Africa, and continues to be a public health priority (World Health Organization, 2018). CM is a serious complication of malaria infection characterized by impaired consciousness and ultimately coma (Idro et al., 2010). In children, CM is a leading cause of malarial death and has a case fatality rate of 15-20% despite optimal treatment (Dondorp et al., 2010). The pathophysiological mechanisms behind CM are not completely understood, though diffuse brain swelling, intracranial hypertension, and higher brain weight-for-age are commonly seen in fatal cases (Seydel et al., 2015; Idro et al., 2010). It is thus hypothesized that brain swelling, in conjunction with non-central nervous systemic factors, plays a critical role in disease outcome.
MRI has been proposed to study the pathogenesis of CM (Looareesuwan et al., 2009) as well as assess participants’ eligibility for enrollment in clinical trials (Kampondeni et al., 2018). In the latter case, the severity of brain swelling (cerebral edema) must be evaluated rapidly with MRI, ideally by a trained on-call neuroradiologist. However, these visual determinations involve a degree of subjectivity, and there is a shortage of trained neuroradiologists in resource-limited settings, which are additionally impacted by the limited availability of high field strength machines. We are therefore interested in automating this assessment of brain swelling in order to increase the utility of low field MRI in low-resource settings.
As a first approach to this problem, we asked if standard image processing tools could be applied to low field strength scanners and evaluated existing pipelines to extract measurements of brain volume and structure. We experimented with a common pre-processing task in brain MRI analysis referred to as brain segmentation, where voxels (3-dimensional pixels) of the brain image are identified and isolated, and voxels from the skull, eyes, and other non-brain tissues are removed. Popular surface-based methods such as Brain Extraction Tool (Smith, 2002) perform well on 3T images; however, BET performed poorly when applied to the 0.35T images in our sample of patients with cerebral malaria (Figure 1). Parameter tuning did not resolve these issues.
To address this, we developed a novel integrative framework for assessing brain edema severity in children with CM by adapting image processing pipelines originally designed for high field strength scanners to images from low field strength scanners. We leveraged existing high-quality brain imaging data to identify cerebral tissue and remove all non-brain voxels from images in our study. We adapted currently available image processing techniques to extract volume-, intensity-, and curvature-based biomarkers from low resolution MRI scans, assessing each of the key metrics used by neuroradiologists to assess disease severity. Finally, we incorporated these biomarkers into a logistic regression model that parsimoniously characterizes how radiologists score brain edema on MRI. Our model exhibits high prediction accuracy (measured by area under the curve) and is validated by its classification performance on a separate testing set, and through Monte Carlo sampling.
2. Data
2.1. Magnetic Resonance Imaging
Participants in this study were children (aged 6 months to 14 years old) admitted to the Blantyre Malaria Project, a long-term study of CM pathogenesis located in Blantyre, Malawi. All children had a Blantyre Coma Score of 2 or less, malaria parasitemia on peripheral blood smear, and no other known cause of coma. After clinical stabilization and beginning of intravenous antimalarial medications, participants were imaged with a General Electric 0.35T Signa Ovation MRI system (General Electric Healthcare, Chicago, Illinois). We considered two pulse sequences to highlight diverse tissue structures in the brain: a typical T1-weighted image exhibits brightest signal for fat, brighter signal for white matter than gray matter, and darkest signal for cerebrospinal fluid (CSF); in a T2-weighted image, CSF and fat are both bright and white matter is relatively dark. During the enrollment period, 100 participants were imaged. Five children were excluded from all analyses as both T1 and T2 sequences were unavailable for assessment.
MRI acquisition parameters were not uniform across subjects, nor were they uniform across modalities within a subject. For instance, while most subjects had T1 and T2 scans that had high in-plane resolution in the axial dimension, many had a mixture of T1 and T2 scans that had high in-plane resolution in axial, coronal and/or sagittal planes. A further challenge was that, in almost all images, the top of the brain was outside of the field of view; this was due to time constraints with respect to image acquisition. Finally, some images contained banding and other artifacts due to subject motion and technical factors.
The images were partitioned into training (n = 46) and testing (n = 49) sets at the outset. In the training set, subjects were scored by 3 radiologists, while in the testing set, subjects were scored by up to 8 radiologists. Exploratory analyses and biomarker identification were performed on the training data, with the better-validated testing data reserved to assess pre-diction performance.
2.2. Brain Volume Severity Score
Brain volume (BV) scores were obtained from 8 radiologists who had been trained in assessing cerebral edema in the context of CM. MRI images were assigned a brain volume score ranging from 1 to 8 according to several neuroradiological criteria (Table 1) (Kampondeni et al., 2018; Potchen et al., 2012; Seydel et al., 2015). Scores of 7 or 8 indicate patients with severe brain swelling who are at high mortality risk. Scores were assigned based on all available MRI images, including the T1 and T2 sequences as well as occasionally acquired diffusion weighted images; our automated method involved only the T1 and T2 sequences. As each subject’s BV score was assigned by several radiologists, the overall BV score for that subject was calculated as the median of these ratings.
3. Methods
The goal of our analyses was to develop an automated approach using statistical modeling to predict the BV score given the acquired images. In the following sections, we first describe a pre-processing procedure for the T1 and T2 scans to reduce the effect of image artifacts and to identify brain tissue. We then motivate and develop biomarkers of the three primary assessments used to determine severity of brain swelling. Finally, we integrate these metrics into a logistic regression model and assess the classification performance of that model using two validation schemas based on the initial training and testing data. All code can be provided upon request, and we intend to make it available on GitHub.
3.1. Pre-processing
3.1.1. Notation
For subject i and modality τ ∈ {1, 2} corresponding to T1 and T2 scans, a brain image consists of the voxel vector xi = {1,…, Vi} where Vi is the total number of voxels in that image. At any voxel x ∈ {1,…, Vi}, the intensity viτ (x) defines a function from the integers to the real numbers. By evaluating viτ (x) at all voxels, we obtain the vector viτ, which is collectively referred to as the image.
3.1.2. Bias Correction
A common artifact of the MRI acquisition process is intensity inhomogeneity or bias, wherein the intensities vary in a gradient over the entire image (Vovk, Pernus and Likar, 2007). Because this can affect the quality of subsequent analyses where tissues are identified based on the observed image intensities, bias correction is a common pre-processing step in neuroimaging studies (Sled, Zijdenbos and Evans, 1998). All images in our sample were corrected using N4 bias correction (Tustison et al., 2010), which assumes a multiplicative bias model for the observed image for subject i and modality τ such that uiτ is the true image, hiτ is a smooth bias field, and εiτ is Gaussian noise that is independent of uiτ.
3.1.3. Brain Segmentation
Because the skull and other non-brain tissues contain noisy and irrelevant information, it is necessary to perform tissue segmentation, where voxels corresponding to a tissue of interest are identified. In the case of brain segmentation, we define the class assignment for voxel x and subject i to be The voxelwise evaluation of bi(x) at x ∈ {1,…, Vi} yields the subject-specific brain mask bi, a binary vector of length Vi where the ith entry corresponds to the classification of the ith voxel as either brain or non-brain.
Popular surface-based brain segmentation tools such as the Brain Extraction Tool (Smith, 2002) did not perform well on our images, as the low resolution precluded a clear separation between brain and skull (Figure 1). Therefore, we appealed to a class of methods that borrow strength from existing “gold standard” segmentations on atlases, which consist of a high-resolution brain image together with its highly-validated brain mask. Our atlas set comprises a sample of 12 subjects imaged at 3T as part of the study-specific atlas in the Philadelphia Neurodevelopmental Cohort (Satterthwaite et al., 2016). For these subjects, the whole brain was automatically segmented and the masks were manually corrected slice by slice, a time-intensive process. The 12 youngest subjects in this group were selected to reduce age-related biases that may result from atlases developed from images of patients who are older than the subjects in our sample.
Atlas-based segmentation is typically a two-stage process: first, the atlas is continuously deformed to match the target image (that is, the image to be segmented); this deformation is referred to as the registration function. We used symmetric image normalization (SyN) to estimate non-linear registration functions (Avants et al., 2008). The registration function is then applied to the atlas’s brain mask, yielding a mask warped into the target image coordinate space indicating where various tissues are located in the target image. To address heterogeneity across subjects under study, it is common practice to repeat this process using multiple atlases and brain masks; such methods are referred to as multi-atlas methods (Rohlfing et al., 2004).
The second step is to produce a consensus segmentation of the target in a process called label fusion. We employed a majority voting consensus algorithm (Artaechevarria, Munoz-Barrutia and de Solorzano, 2009; Kittler, 1998): at each voxel, the final designation of brain versus non-brain was decided by the majority of warped atlas brain masks at that voxel. Although majority voting has been criticized as overly simplistic, we found it to perform well in our data and further noted that more advanced label fusion methods (Wang et al., 2013) failed in our dataset, likely due to lower image quality.
3.2. Biomarkers of Severe Edema
Based on the radiological criteria for BV scores of 7 and 8 (Table 1), we developed three image-based multi-modal biomarkers to quantify a) ventricular volume, b) gray and white matter delineation, and c) sulcal effacement.
3.2.1. Ventricular CSF Volume
We hypothesized that severely increased brain volume would be associated with a smaller ventricular volume relative to the whole brain (Figure 2). As such, a measure of ventricular CSF requires the identification of ventricular and CSF voxels in the image.
To identify CSF regions, we leveraged a model of the observed intensities to partition voxels into classes. We used FSL FAST (Zhang, Brady and Smith, 2001), a popular approach that assumes that intensities and tissue classes can be modeled by a Gaussian Hidden Markov Random Field (GHMRF). Within the image for subject i, a voxel x can be classified as either gray matter, white matter, or CSF. This assignment can be summarized by a tissue class segmentation function wi(x): The collective tissue class assignments obtained by evaluating wi(x) at all voxels are denoted wi and must be estimated. In the GHMRF, both the observed voxel intensities viτ and the true tissue classes wi are considered to be random, and the goal is to find the class assignment maximizing their joint likelihood where the conditional distribution of viτ (x) ∣ wi(x) is assumed to be Gaussian, and the tissue classes w are realizations of a Markov random field, and hence follow a Gibbs distribution.
However, the whole-brain CSF volume measurements have high variability among subjects, especially at the brain boundary; this is likely due to the brain segmentation performed in the previous step. Therefore, we limited the measure to ventricular CSF volume, which yielded a more stable estimate and demonstrated better identification of subjects with highly increased BV than either whole-brain CSF volume or ventricular volume alone. To segment the ventricles, we obtained adult ventricle atlases from the publicly available OASIS cross-sectional data set (Marcus et al., 2007). Using the same procedure of SyN registration and majority-voting label fusion as in the brain segmentation step, we obtained a ventricle mask for each subject, and calculated the ventricular CSF mask as the intersection of the CSF mask from FSL FAST and the OASIS ventricle mask.
For voxel x and subject i, we define the ventricular CSF mask function
The first image-based biomarker (Figure 3) is the brain parenchymal fraction (BPF) of the T2 scan, or the proportion of non-ventricular CSF voxels to total brain voxels: Higher values of BPF represent a lower proportion of ventricular CSF volume and thus higher levels of brain swelling.
3.2.2. Grey-to-White Matter Differentiation
To translate the loss of gray and white matter delineation into a function of the observed image, it is necessary to normalize voxel intensities (within modalities) so that subjects can be compared. Therefore, for each subject, we applied a linear scaling of the image intensities based on normal-appearing white matter (NAWM) using the WhiteStripe technique (Shinohara et al. 2014).
For subject i and modality τ, the observed intensities viτ are assumed to follow a mixture distribution with K components. That is, the probability density f: ℝ → ℝ is a function of intensity value v that decomposes as where the fik: ℝ → ℝ are subject-specific probability density functions and the weights yik sum to 1. It is assumed that there exists a transformation fik(v) → gk(v) so that the intensity distribution is not subject-specific:
The white stripe of NAWM is found by smoothing the empirical intensity histogram with a penalized spline (Ruppert, Wand and Carroll, 2003) and identifying the peak corresponding to white matter (in T1 scans this is the rightmost or highest-intensity peak; in T2 scans this is the overall mode). The interval around this peak, whose width may be adjusted by tuning parameters, is the white stripe. Every voxel intensity within the brain is then linearly scaled by the mode and trimmed standard deviation of intensities within the white stripe:
Letting Bi equal the total number of brain voxels for subject i, we calculated the second and third biomarkers (Figure 3) using the T1 and T2 scans by taking the average normalized intensity after WhiteStripe within the brain voxels only (as determined by the brain segmentation mask in the previous section):
3.2.3. Sulcal Effacement
Finally, we considered that sulcal effacement, which is associated with BV scores of 6 to 8, might be extracted from the MRI by detecting gyri, or ridges, of the cerebral cortex (Figure 4). To do so, we used a filter on the Hessian matrix of each MRI image (Frangi et al., 1998).
In a 3-dimensional image, the Hessian matrix at voxel x contains information about the local curvature around x for subject i and modality τ. Typically, is calculated by convolving a neighborhood of x with derivatives of a Gaussian kernel. The three eigenvalues of with smallest magnitudes have a geometric interpretation: gyral or planar structures correspond to small values of and , and a high value of (Frangi et al., 1998). Tubular and spherical structures, on the other hand, are associated with different patterns in these eigenvalues, so the following dissimilarity measures are required to identify gyral features: The vesselness image is a function of these dissimilarity measures and other tuning parameters. By calculating its value at each voxel, we produced a probability-like map highlighting gyral features.
We used the Hessian filter implemented in ITK-SNAP’s Convert3D tool (Yushkevich et al., 2006) on the T2 images, as they showed the best contrast between CSF and brain tissue. Due to limited field of view and the quality of brain segmentation around the top and bottom of the brain, the Hessian filter was only calculated in MRI slices taken from central portions of the cerebral hemispheres. The middle portion was defined by removing the top and bottom 3 slices (of the axial sequence) of each T2 image, as well as any voxels neighboring extracranial tissue (Figure 4).
We defined the final biomarker (Figure 3) as the median of the observed Hessian filter intensities in the T2 image limited to the brain voxels defined by the erosion method above: The median was chosen as the distribution of the Hessian filter intensities within each image was highly right skewed; this resulted in a more conservative and robust characterization of the difference between severe and non-severe cases.
4. Analysis and Prediction
4.1. Model
To predict highly increased brain volume (a median BV score of 7 or higher), we defined the binary outcome variable for subject i and median BV score threshold k as the indicator variable . The main outcome of interest in our study was . Together with the biomarkers introduced in the previous section, we formed the multivariate logistic regression model
The estimated coefficients fit on the training set are shown in Table 2. Subjects with a BV score of 7-8 were associated with a lower median Hessian filter (p < 0.01), signifying fewer prominent sulci in the brain. This was the only coefficient found to be significantly different from zero, consistent with radiologists’ reports that sulcal effacement was the most important factor in determining a higher BV score. Both the intercept term and the coefficient of BPF had high standard errors, suggesting that there may have been near-complete data separation. For this reason, we did not interpret those coefficients.
4.2. Prediction Accuracy
In addition to model (1), we performed two sets of sensitivity analyses. The biomarkers γ·j for j ∈ {1,…, 4} were developed to identify images with BV ≥ 7; we considered an additional set of models with outcomes and , where the threshold for determining severely increased brain volume was relaxed. Because γi4, the measure of sulcal effacement, was both the most statistically and clinically important predic-tor of severe cases, we also considered models with γi4 alone (“Sulcal Only”) as opposed to the “Full Set” of covariates. In total, we examined 5 alternate models in addition to the main model, which are summarized in Table 3.
Prediction accuracy was assessed using area under the Receiver Operating Characteristic curve (AUC), which considers the sensitivity and specificity across different thresholds of the predicted outcome. For all models, the AUC was high, ranging from 0.81 to 0.96 in the training set, and 0.90 to 0.97 in the testing set. Models with outcome received a lower AUC than those with outcome and , although the 95% confidence intervals for all models intersect. This finding suggests—in corroboration with clinical observations—that patients with a BV score of 6 represent borderline cases and are therefore graded with the most uncertainty.
We compared models with γi4 as the sole covariate to the full model using a likelihood ratio test, and found that the reduced model performed similarly when the outcome was (p > 0.3 in both training and test sets), moderately when the outcome was (p < 0.01 in the testing set only), and poorly when the outcome was (p < 0.01 in both training and test sets). In other words, the sulcal effacement biomarker was sufficient to classify cases scoring a 7 to 8, but all 3 biomarkers were needed to accurately classify cases with BV scores of 6 to 8. This suggests that the measures of ventricular CSF and gray-to-white matter differentiation, while less relevant for predicting cases scoring 7 or higher, may be useful in differentiating cases that were assigned a score of 5 versus 6.
Since there were systematic differences in the number of raters between the training and test sets, and to confirm that our results were not dependent on the initial split, we re-sampled the full dataset 100 times to form 100 training and test sets. In each re-sampling iteration, we fit the six aforementioned models using the training set and calculated the AUC (and 95% confidence interval) on the training and test set. Then, the average AUC (and 95% confidence interval endpoints) was calculated over the 100 iterations (Table 4).
In the resampling schema, we observed the same pattern in prediction performance: over-all, classification accuracy was high for all models in both the training (mean AUC = 0.85-0.97) and test sets (mean AUC = 0.86-0.96). We again found that the Sulcal Only model performed as well as the full model for and , but notably worse than the full model for . Together, these results show that our derived biomarkers and model are able to accurately and robustly differentiate between subjects with highly increased brain volume (BV scores of 7 or 8) from those who do not. The measurement of sulcal effacement γi4 was found to be the most important factor to identify cases with BV ≥ 7, while measures of ventricular CSF volume and gray and white matter differentiation γi1, γi2, γi3 provided more value in identifying “borderline” cases.
5. Discussion
We developed a method for statistical image analysis of low resolution, noisy brain MRI after determining that standard methods developed for images obtained on a high-field MRI scanner did not perform well on lower resolution images. Our method involved creating and validating a multi-atlas, integrative framework to automate the radio-logical assessment of brain volume, a biomarker strongly associated with death in children with CM. An attractive feature of our pipeline is that it requires only T1- and T2-weighted MRI sequences for analysis. Our logistic classification model is parsimonious, aligns with clinical observations, and has high predictive accuracy. We hope that the implementation of these findings in low-resource environments can help to address both the shortage of radiologists for manual MRI interpretation, as well as the challenge of interpreting images from low field MRI.
Our results provide insight into how radiologists score brain edema on MRI, generally supporting the stated importance of sulcal effacement over ventricular size and gray-to-white delineation. The superior performance of the sulcal effacement biomarker suggests that higher scores of brain edema were predominantly derived from this MRI feature, even as other features such as loss of gray-to-white matter delineation are also prescribed features for images assigned brain volume scores of 7 and 8. Future efforts to provide Hessian filter images or sulcal effacement scores might assist individual radiologists in producing more consistent gradings of brain volume.
A limitation of our approach is that logistic regression only accommodates binary out-comes (severe and non-severe brain volume scores); we did not predict the brain volume score itself. In exploratory analyses, models which predicted BV score directly had high misclassification error and mean squared error (results omitted), suggesting that more information or more sophisticated models may be needed to predict the ordinal score.
Future analyses could apply our pipeline to predict disease outcome: (Kampondeni et al., 2018) found that global CSF volume was the best predictor of prognosis in patients with CM. However, the accuracy of global CSF measurements in our current pipeline is limited by the quality of segmentation at the edge of the brain. Recent developments in deep learning methods for brain segmentation (Ronneberger, Fischer and Brox, 2015) could address this issue, although such procedures may require larger sample sizes with manual delineations of brain tissue than are currently available.
In summary, we introduced and validated a biologically and statistically principled method of biomarker development using images from low field strength MRIs, even in images with additional artifacts. We note that these strategies, which involve borrowing strength from publicly available high-resolution data whenever possible, and considering aggregate statistics that are more robust to extreme values, can be applied to any study of low-resolution brain images. The principles behind the tools introduced in this study are also broadly applicable to the design of new techniques that automate existing, clinically validated tasks.
Acknowledgements
The authors were supported by NIH Grants R01 MH112847, R01 NS112274 and R01 NS060910.