Abstract
The majority of conclusions and interpretations in quantitative sciences such as neuroscience are based on statistical tests. However, the statistical inferences commonly rely on the p-values, but not on more expressive measures such as posterior probabilities, false discovery rates (FDR) and statistical power (1 - β). The aim of this report is to make these statistical measures further accessible in single and multiple statistical testing. For multiple testing, the Empirical Bayesian Inference (Efron et al., 2001; Efron, 2007) was implemented using non-parametric test statistics (Area Under the Curve of the Receiving Operator Characteristics Curve or Spearman’s rank correlation) and Gaussian Mixture Model estimation of the probability density function of the original and bootstrapped data. For single statistical tests, the same test statistics are used to construct and estimate the null and non-null probability density functions using bootstrapping under null and non-null grouping assumptions. Simulations were used to test the reliability of the results under a wide range of conditions. The results show conformity to the real truth in the simulated conditions, which is held under various conditions imposed on the simulation data. The open-source MATLAB codes are provided and the utility of the approach has been discussed for real-world electroencephalographic signals. This implementation of Empirical Bayesian Inference and informed selection of statistical thresholds are expected to facilitate more realistic scientific deductions in versatile fields, especially in neuroscience, neural signal analysis and neuro-imaging.
1 Introduction
The majority, if not all, of the conclusions and interpretations in quantitative sciences, especially in neuroscience and neuro-imaging are based on statistical tests. While the traditional hypothesis tests based on p-values are still dominant, there has been legitimate remarks on the need for more reliable and thorough statistical procedures and practices (Nuzzo, 2014). It is therefore vital that more meaningful statistical measures be accessible for statistical inference, namely: posterior probabilities, false discovery rates (FDR), and statistical power (1 − β). These measures are especially useful in statistical inferences involving high-dimensional data in neural signal connectivity or imagining.
Empirical Bayesian Inference (EBI) has shown promise in large-scale between-group comparisons (Efron, 2007b, 2004), especially in genomics (Efron, Tibshirani, Storey, and Tusher, 2001) and to some extent in the applications of neuroelectric signal and connectivity analysis (Singh, Asoh, Takeda, and Phillips, 2015). In EBI, constant prior probabilities are estimated from the data in large-scale multi-variable inferences or hypothesis testing and these priors are subsequently used to find the posterior probabilities using the estimated probability density functions of the pooled test statistics and the null distribution. It is possible to relate the posterior probabilities to frequentist concepts such as False Discovery Rate, FDR (Benjamini and Hochberg, 1995; Benjamini, Krieger, and Yekutieli, 2006), as well as power. While the theoretical framework is adequately established, the existing numerical implementations Efron, 2007b are optimised and suitable only for specialised applications (i.e. statistical genetics, where only a small fraction of the tests are real findings), and some of the essential measures such as FDR and power are not immediately available for informed threshold selection. From a practical viewpoint, the existing software package [locfdr] in R (r_core_team_r:_2016) may requires selection of several parameters and not immediately available for neuro-electro-magnetic signal and connectivity analysis in packages such as FieldTrip oostenveld_fieldtrip:_2010 or for neuroimaging analysis in packages such as SPM. Consequently, there is the need for new implementations to facilitate the application of EBI that work in wider range of situations (e.g. small or large proportions of test variables belonging to the affected group), to more explicitly relate the posterior probabilities to FDR and power (allowing informed decision on threshold selection), and to intrinsically account for data with non-normal distributions.
Such informed selection of statistical threshold is also challenging in complex statistical inferences (e.g. with non-normal data distributions) involving single or only a few comparisons or inferences. It would be desirable to similarly select a threshold value for the test statistic that corresponds to a known combination of Type I (α) and Type II (β) errors in a single comparison.
Here, we address these needs using an implementation of EBI using non-parametric test statistics, Gaussian Mixture Models and null bootstrapping. This implementation readily handles one-sample, two-sample (between-group comparison) and correlation problems in multi-dimensional data with arbitrary distributions, which is usable for a wide range of applications. Furthermore, for threshold selection in univariate testing (in the absence of prior probabilities), the non-null distribution is estimated using a non-null bootstrapping. This approach approximates the non-null probability density functions in order to enable the threshold selection for a desired combination of α and β values, regardless of the distribution of data.
2 Methods
2.1 Empirical Bayesian Inference (EBI) for Multiple Inferences
2.1.1 EBI framework
EBI, initially used in genomic applications (Efron et al., 2001) was subsequently expanded theoretically and in terms of computational implementation (Efron, 2004, 2007a). Here, the fundamentals are briefly explained.
Suppose Xij, i = 1…m, j = 1…N represents N variables or N-dimensional data sampled from m observations/subjects. The grouping information of data is represented by αi, which is a binary choice (0 or 1) for a two-sample (2-group) comparison. Statistical testing, performed independently in each variables according to the grouping information (e.g. between-group comparisons), using the test statistic zj, yields N values. The probability density function of zj, i.e. the probability of data given the hypotheses, is denoted by: where p0 and p1 are the prior probabilities of the null and non-null hypotheses (p1 = 1 − p0) and f0(z) and f1(z)are the probability density functions of z under the null and non-null (grouping) assumptions, respectively. The posterior probability, i.e. the probability of the hypotheses given the data, are subsequently given by:
Comparison of the posterior probability of the non-null hypothesis P1(zj) against a threshold Pcrit provides a Bayesian inference, as well as subsequent frequentist quantities such as local false discovery rate, fdrloc(z) = P0(z), Type I error α, Type II error β, and the FDR value pertaining to the chosen Pcrit. The classic EBI includes several stages for estimating the posterior probabilities. First, using a measure of between-group difference (e.g. Student’s t-statistic or p-values) and transforming the values to normal (e.g. by inverse normal cumulative distribution function); second, estimation of f(z) from the zj histogram; third, estimating f0(z) by theoretical assumptions on the distribution of zj or bootstrapping; forth, estimation of p0, usually through the assumption that f1(argmaxz f0(z)) = 0; and finally, P1(z) is found by (2) - (3). Here, we will explain the details of each stage for the new implementation of the EBI.
2.1.2 Test Statistic
Instead of using Student’s t-statistic or a p-value which reflects the difference of the means of two groups, here a non-parametric measure was used as a test statistic. The Area Under the Receiver Operating Curve (AUROC), A, is closely related to the Mann-Whitney U statistic. AUROC is the probability of data in one group being larger (or smaller) than the other group (Pr(Xg=0) < Pr(Xg=1)); hence it is considerably independent of the distribution of the original data, as well as any measure (e.g. mean or median) forcomparison. This has been thoroughly discussed elsewhere (Zhou, McClish, and Obuchowski, 2009). AUROC was therefore taken as the test statistic for comparing the m data points in the two groups for each comparison of the N variables.
It is noteworthy that while AUROC is independent of the underlying distribution of the data, the data in all N variables should come from the same null and alternative distributions. The AUROC distributions depend on the number of data in the first and second group, as well as the distribution of the original data. Therefore, the number of data points in each group should be (ideally) the same for all variables, and all of them should come from the same arbitrary distribution (e.g. normal, Beta, Gamma, uniform). This is especially relevant as the curve fitting for null and mixed density functions, as well as bootstrapping (for construction of null data) rely on pooling data from all variables.
While not essential to transform the AUROC values to normal, it’s beneficial to do so computationally. The transformation to normal allows the use of more robust estimation methods such as Gaussian kernel methods that work best in unbounded domain, rather than in bounded ([0 1]) domains. As the AUROC distribution is bounded between 0 and 1, with the expected value of 0.5 under null, a mapping of 2(.) − 1 combined with Fisher’s Z-transform, zi = 0.5loge((.)+1)− 0.5loge((.) −1) (Fisher, 1915; Zhou et al., 2009), can approximately map the data to normal (Zhou et al., 2009; Qin and Hotilovac, 2007): where σ2(.) indicates the variance. The tuning parameter ϵ which is added to the classic definition here, serves to limit the extreme z values; hence, facilitating numerical integration in later steps. To limit the z values to [−10,10], ϵ = 4.55e − 5 was adopted here. To avoid sharp distributions where AUC = 1 and z = 10 (AUC = −1 and z = −10), the values larger than 9 (smaller than −9) were redistributed to a truncated normal distribution. The redistribution of h extreme values larger than 9, assigned the ith value to , where iCDF is the inverse cumulative distribution function and , is the normal distribution with mean 9, standard deviation 0.25, truncated between [5, 13]. The values smaller than −9 were similarly reassigned.
2.1.3 Estimating the f (z) Histogram
Gaussian Mixture Model (GMM) distributions (McLachlan and Peel, 2004) were used to estimate the probability density f(z), using the pool of zj, j = 1…N. Using maximum likelihood estimates of GMM parameters, models with increasing number of Gaussian kernels were set for fitting zj values. The model with minimum Akaike Information Criterion, AIK, (Akaike,1974) was eventually considered as the preferred fit. This was concluded when the increasing number of kernels yielded 3 consecutive increases in the AIK. A similar approach has been previously used (Le, Pan, and Lin, 2003) in statistical genetics applications, but not in the context of EBI.
2.1.4 Estimating the Null Distribution f0(z)
For robust estimation of the null distribution, the data labels gi were set for B0 times re-sampling with substitution; for each set of the obtained labels, the above-mentioned procedures used for the original data and labels were applied to yield Aj and subsequently the zj values. The data from all the B0 bootstraps and all the N tested variables were used for pooling to estimate the null distribution. For computational efficiency, it may be helpful to sparse the null distribution. In this implementation, when the null data points exceeded 20000, the null data were sorted and only the every sk data values were kept for GMM estimation (sk: the integer multiples of 20000 in the number of null datavalues). UsingsimilarGMM estimation asfor f(z), the null distribution f0(z) was estimated.
2.1.5 Estimating the prior p0
The approach used by EBI for estimation of the prior p0, relies on the key assumption that at maximum (peak) value of f0(z), the value of f1(z) is zero. Due to the smooth and reliable estimation of f0(z) and f(z) by AIK-guided GMM fits, we may directly use the values of the estimated probability density functions to find the prior p0: where iCDFf0(z)(0.25) is the z value at which the Cumulative Density Function (CDF) of f0(z) is 0.5, which is the inverse CDF of 0.5, i.e. the median of the null data.
2.1.6 Estimating the Posterior P0(z)
Given the estimates of p0, f0(z) and f(z), the calculation of posterior probabilities from (2) and (3) are straightforward. A bound between 0 and 1 was considered to protect against numerical instability at very small probability values.
2.1.7 Estimating FDR and Power (1 − β)
Following the calculation of p0, p1, f0(z), f1(z), f(z), P0(z) and P1(z), the Type I error α, Type II error β, power (1-β) and FDR (q) can be found by numerical integrations. This is achieved by using a decision threshold value (see Section 2.1.8 for how this threshold is decided on) on either of these measures to infer which variables do or do not show an interesting effect. For a given set decision threshold, a criterion on the Posterior, Pcr, we may write: as the parameter ϵ in section 2.1.2 limits the values of z, the integration would suffice to take place in the range [−20,20]. Additionally, the global values αg and βg show the sperability of the probability density distributions regardless of a chosen Pcr:
2.1.8 Threshold Selection
The threshold selection and subsequent inference is driven by setting a Pcr value and comparing the values of P1(z) against this specific Pcr value, as required in the specific context of application. Alternatively, the availability of the computed values of α, β, q, and the detection ratio #{P1(zj) ≥ Pcr}/N as a function of P1(z), allows setting the Pcr values that correspond to specific values of these measures.
2.2 Non-null Bootstrapping for Single Inference
The above-mentioned procedure is applicable for large-scale multiple testing, as this enables the estimation of empirical mixed density f(z), the priors p0 and p1, and eventually the Posteriors P0(z) and P1(z). For single statistical testing (N = 1), similar stages can be followed in order to calculate the test statistic, estimate the histograms or probability density functions, and calculate the null distribution. However, it is not possible to estimate the priors; hence, these variables are not available. Notwithstanding, it is possible to estimate f1(z) by a different approach, namely the non-null bootstrapping which will make it possible to estimate α and β for a specific threshold, which would be a criterion on z (rather than on P1(z)). This approach is explained below:
2.2.1 Non-Null Bootstrapping & Estimating f1(z)
To estimate the non-null distribution f1(z) we may rely on bootstrapping when the grouping information of the data gi are respected (in contrary to the commonly practised null bootstrapping that aims to find the null distribution where the grouping information/labels are not respected). For this purpose, the following re-sampling algorithm was used:
If mb0 is the number of observations in group g = 0, take mb0 samples (with substitution) from {Xi.| gi = 0} to build Ξb0.
If mb1 is the number of observations in group g = 1, (mb0 + mb1 = m), take mb1 samples (with substitution) from {Xi.|gi = 1} to build Ξb1.
find the A (AUC for ROC curve) and consequently the z value according to (4) using the obtained set of the bootstrapped data Ξb0 and Ξb1.
Repeat this procedure (1-3) for B1 times to obtain the needed samples of zb, b = 1…B1.
The GMM was then chosen to find the distribution of zb values, which estimates f1(z).
2.2.2 Estimating α and Power (1 − β)
Similar to the calculation of α and β for a given Pcr value in (7) and (8), it is possible to use numerical integration to find the relationship between α and β. If F0(z) is the cumulative density function corresponding to f0(z), then for a given two-tail decision region Z(α), it is possible to describe β a as function of α:
2.2.3 Threshold Selection
The threshold selection in single testing is straightforward, given the known relationship between α and β in (13).
2.3 Numerical Implementation & Simulations
The numerical programming for the proposed EBI implementation was performed in MATLAB (versions 2016b-2018a, Mathworks Inc., Natick, MA, USA). The Empirical Bayesian Inference Toolbox for MATLAB is publicly available at https://github.com/NeuroMotor-org/EBI and is licensed under BSD 3-Clause “New” or “Revised” License. To demonstrate the utility of the proposed implementation and to test its validity, it was applied to simulated data. Simulated data allow comparison of the performance measures to real truth, which is not available in real life applications. All simulations were performed in MATLAB. Different simulations were carried out as detailed below.
2.3.1 A Demonstrating Example
An example similar to applications in neural signal analysis and neuroimaging was considered. The simulation included Nvar = 2000 variables, each with m0 = 20 observations/subjects in first sample (e.g. controls), and m1 = 60 observations/subjects in second sample (e.g. patients), totalling mj = 80 observations/subjects. In control observations, all variables had a normal distribution , while in the other group the first 1600 variables have the same distribution, the next 300 variables come from and the last 100 variables were from .
2.3.2 Comparison Against Real Truth
Using the simulated data from previous section, the β and FDR (or α) values were calculated as a function of the posterior threshold Pcr and were compared to the true values of β and FDR (α). The real values were found by using the original labels of the variables in the simulations. By comparing them to the detected labels by EBI, the true positive (TP), false positive (FP), false negative and true negative (TN) rates were calculated at each threshold, from which the the real β and FDR were found. Additionally, the same data underwent EBI analysis with a previous implementation of EBI (Efron, 2007b) in R (r_core_team_r:_2016), when used with the default value.
2.3.3 Performance Under Different Condition
Several simulations were performed to test the performance of the framework in a broader range of conditions. This controlled variation of the simulation condition can test and inform of the performance in real-life applications, which is not easy with typical experimental data due to the lack of a gold standard or real truth. The parameters for generation of simulated data and the application of the new EBI implementation included: Nvar = 200, 2000, p0 = 0.25, 0.75, m0 = 25,100, m1 = 25,100, Normal (σ = 1) vs. Beta (a = 2,b = 10) distribution types, difference or effect size (Cohen’s d = 0.2, d = 0.9 for normal distributions or shift d = 0.05, d = 0.09 for Beta distributions). The estimated p1, real FDR at expected value of 0.05, and real β at expected value 0.2 were compared across all the 64 simulation conditions. Each simulation condition was repeated 3 times to account for the non-deterministic nature of the implemented bootstrapping and estimation procedures.
2.3.4 Simulation of a Uni-Variate Example
To demonstrate the derivation of the α − β curve, a simple simulation with m0 = 15 and m1 = 25, random data with normal distributions for both groups (σ = 1), and shift value (Cohen’s d) of 0.5 was considered.
3 Results
3.1 Example of Multiple Testing with EBI
Figure 1 exemplifies the generated results and a report for a typical simulated case as described in Section 2.3.1. Notice how the two probability density functions f0(z) and f1(z), as well as the prior p1 are the essential components in giving rise to the Posterior distributions P0(z) and P1(z). In addition pay attention to how the choice of different threshold levels (color-coded based on the criteria) helps to choose an informed statistical threshold for inference based on the levels of FDR and power they afford.
3.2 Comparison of Typical Behaviour Against Truth
Figure 2 compares the estimated FDR and β as a function of the threshold values (P1), against the real truth, using simulation labels as described in Section 2.3.2. Notice the similarity of real truth curves and estimates by the new implementation. In addition, as the simulated condition had a low prior p1, the results from the previous locfdr implementation in R showed a good conformity to the real truth and the new implementation.
3.3 Performance of the EBI under Various Simulation Conditions
Figure 3 compares the estimated prior p1 against the real values, as well as the real FDR and β values when estimated at nominalvalues at 0.05 and 0.2 (Described in Section 2.3.3). In the majority of conditions, the estimated measures were very close to the real values and there was negligible difference between the 5 different iterations of the simulation. The exception is at low effect sizes, combined with low number of observation/subjects and extreme prior values, where the estimation errors increase (possibly due to dissociations between the affected and non-affected variables and density functions). In the majority of the simulation cases the locfdr R package did not converge; therefore, the results were not included.
3.4 Example Single Testing
Figure 4 shows the correspondence of different α values to β values, for a an exemplary simulated data (Section 2.3.4).
4 Discussion
While EBI (Efron et al., 2001; Efron, 2007b) provided a comprehensive theoretical framework for multivariate high-dimensional inference, the previous numerical implementation of EBI provided valid results only in limited conditions, namely, low prior pl values and for rather high threshold values. Importantly, the previous implementation required numerous adjustments and parameter selection. The new implementation eliminates the need for parameter tuning (especially by using AIK for GMD fitting) and allows the method to be used in broader range of conditions. Importantly, the statistical power is explicitly estimated and made available for inference.
4.1 Applications
The new approach suits applications involving neural signal analysis, such as electromyography (EMG), Electroencephalography (EEG), as well as neuroimaging, e.g. Magnetic Resonance Imaging (MRI). More specifically, spectral, time-frequency, as well as functional and effective connectivity analyses can benefit most from the new statistical implementation. In applications such as fMRI, the need for improved statistical inference has been explicitly emphasised by highlighting the limitations with existing techniques that lead to high false discovery rates (Eklund, Nichols, & Knutsson, 2016). The existing attempts to improve the statistical inference in EEG connectivity analysis (Singh, Asoh, & Phillips, 2011; Singh et al., 2015), have yield only partial success to date. Here, we used simulations to compare the EBI reports against the real truth. Importantly, in 2 recent studies, we showed that EBI is reasonably cross-validated against traditional frequentist methods. EBI was cross-validated against the correction of significance level α according to the number of principal components in data (Iyer et al., 2017) when testing the significance of EEG time series. Moreover, the inference of EBI was cross-validated against adaptive False Discovery Rates (aFDR) (Benjamini & Hochberg, 1995; Benjamini et al., 2006) in comparing the average EEG connectivity patterns between healthy individuals and patient groups (Nasseroleslami et al., 2017).
A unique advantage of EBI is its ability to implicitly account for potential positive and negative correlations that may be present in the data. It is therefore a suitable candidate for situations where positive or negative correlations exist in multi-dimensional data (e.g. EEG/MEG network connectivity analysis or structural or functional MR imaging). This is afforded by the way the individual z-values pertaining to each variable are aggregated (i.e. the independent calculation of the test scores) and by the chosen approach for the calculation of a null distribution that similarly corresponds to the same data with or without correlation structures (Efron, 2007a). The flexible estimation of null distribution from permuted data by GMM affords such flexibility for inference in rather broad conditions.
In applications where only simple statistical testing is required, the calculated FDR is an accurate estimation equivalent to the procedures for pooled multivariate permutation tests, which can be used without reference to Bayesian inference.
4.2 Limitations
The practical range for the number of variables is between 100-10000. The performance beyond this range degrades. This limitation originates from the EBI framework, rather than a specific numerical implementation.
Too few variables lead to inaccurate probability density estimations where few isolated data points are not adequately represented by continuous distributions. In this situation, the extreme values of prior probabilities would correspond to fewer data points with real effect; hence, the probability densities fitted to these values will not be very representative and accurate.
On the other hand, too many variables lead to unwanted spread of the null distribution to the extent that inference at low FDR values does not yield significant results. This situation, however, can be partly remedied for by applying the EBI as several independent batches of analyses on the mutually exclusive chunks of data, each containing different variables. This is permissible as the quantities such as FDR and Posterior probability (and to a reasonable extent the power) are not affected by multiple testing (as is the case for p-values).
As the complete procedure for EBI relies on permutations for building the null distribution, the procedure would depend on random number generations and some variability in each run. Additionally, the numerical procedures for estimating the GMM fits to the distributions are subject to minor variability in each run. These 2 factors make the inference a non-deterministic procedure, which is subject to some variability. While important to take this into consideration, the results in Figure 3 idicate that this variability does not change the nature of the results.
Future studies are expected to focus on the factors that lead to inaccurate numerical estimations, further extending the range of operating conditions, as well as theoretical developments for robust estimation of prior when extreme data and conditions are processed.
5 Conclusion
The implementations of statistical inferences such as EBI that can inform of the posterior probabilities and statistical power need to be converted to common practice. This implementation of threshold selection for EBI and single testing has potential to add value to the neural signal analysis and neuroimaging studies by enabling realistic inference on high-dimensional multivariate data.
Acknowledgement
The author would like to thank the students and staff in the Academic Unit of Neurology, at Trinity College Dublin, the University of Dublin for facilitating and supporting this work. The study was supported by Irish Research Council (Government of Ireland Postdoctoral Research Fellowship GOIPD/2015/213 to the author) and by Science Foundation Ireland (SFI/16/ERCD/3854).
Appendix A: Extension of the Tests from Comparison to Correlation Coefficients and Beyond
The original EBI has been primarily used for two-sample one-dimensional location problems (between-group comparison) such as gene discovery by comparing a control group to a treatment or affected group, or similarly by comparison of the healthy individuals against patients as intended in neuro-electro-magnetic signal analysis. Notwithstanding, the framework can be similarly used for virtually any statistical test, such as one-sample location problems (where comparison of data against zero or paired comparison of data are intended), as well as correlation analysis. These options have been implemented in the EBI Toolbox for MATLAB.
5.1 One Sample Inference
For a one-sample 1-dimensional location problem, including n data points xi, the Wilcoxon’s Signed Rank test statistic is defined as , where R(i) is the rank of the {|xi|| xi ≠ 0}. The normalised test static may be defined as Wn = W/(ΣR(i)) which is bounded between 0 and 1. Therefore, Wn can be transformed to the z space using (4), as in the case of AUC and the subsequent procedures will be similar to the two-sample problem. The bootstrapping procedure for building the null distribution, is carried out by performing random sign flips (multiplying data by random −1 or 1 values) and recalculation of the test statistics for the number of bootstrapping cycles.
5.2 Correlation Coefficient
To use the same framework for analysis of correlation coefficients, the Spearman’s correlation coefficient ρ can be mapped to the [0 1] range (as for AUROC, A) by the transformation (ρ + 1)/2. In this case, the grouping information will have equal numberof paired zeros and ones. The null permutation of the data shall consist of using separate re-sampling (with substitution) from the first and second groups of observations for the same data points as the original data, which disregards their pairing information.