Abstract
Integration of unstructured and very diverse data is often required for a deeper understanding of complex biological systems. In order to uncover communalities between heterogeneous data, the data is often harmonized by constructing a kernel and numerical integration is performed. In this study we propose a method for data integration in the framework of an undirected graphical model, where the nodes represent individual data sources of varying nature in terms of complexity and underlying distribution, and where the edges represent the partial correlation between two blocks of data. We propose a modified GLASSO for estimation of the graph, with a combination of cross-validation and extended Bayes Information Criterion for sparsity tuning. Furthermore, hierarchical clustering on the weighted consensus kernels from a fixed network is used to partitioning the samples into different classes. Simulations show increasing ability to uncover true edges with increasing sample size and signal to noise. Likewise, identification of non existing edges towards disconnected nodes is feasible. The framework is demonstrated for integration of longitudinal symptom burden data from the 2nd and 3rd year of life with 21 diseases precursors as well as the development of asthma and eczema at the age of 6 years from 403 children from the COPSAC2010 mother-child cohort, suggesting that maternal predisposition as well as being born preterm indirectly lead to higher risk of asthma via increased respiratory symptom burden.
1 | INTRODUCTION
Understanding complex biological systems often requires the need to integrate data of very diverse nature, such as data with variable dimensionality and distribution, with a mixture of longitudinal or cross sectional features. Given such a set of distributed sources of information pertaining to the same set of samples is often tackled by harmonizing the data heterogeneity by construction of so-called kernels followed by a numerical integration aiming at uncovering commonality [1]. Kernel transformation resembles the concept of characterizing some information entity from n individuals into a symmetric kernel , where can be anything from univariate random variables, multivariate, tables of counts, time-to-event, etc. [2, 3]. After kernel-harmonization , these matrices can be integrated by factorization such as INDSCAL [4] to reveal common structure between the M data layers, as well as uncovering the multivariate distribution of the n samples. While such factorization models are powerful in estimating patterns of correlation, they do not infer the structural topology related to the conditional dependency between layers of data. Aben et al 2018 developed a framework for inferring the conditional dependency between data layers based on the kernel-to-kernel (matrix) correlations, referred to as the RV coefficient [5, 6], and used this for structure revealing from seven data layers covering genotype to phenotype information from a cancer study [1].
In this study we propose a method for data integration in the framework of an undirected graphical model. Let be a representation of a graph, where V and E are the vertices (nodes) and edges between M sources of information. In particular for this study, the nodes represent individual data sources of varying nature in terms of complexity and underlying distribution, and the edges represent the partial correlation between two blocks of data (conditioning on all other data blocks). We propose a modified graphical lasso for estimation of the graph, with a combination of cross-validation and extended Bayes Information Criterion [7] for sparsity tuning. Naturally, the choice of functions fm for kernel transformation, inherently affects the upstream results, and depending on the nature of the data there is a vast source of methods for such transformations (see e.g. [8] for details). For this purpose, we utilize data from a clinical childhood cohort with information on recurrent periods of specific diagnoses and disease remission (amoung other sources), we propose kernel transformations for such data. Having established a graph between the layers of data, we propose a clustering method based on weighted consensus kernels and hierachical clustering for partitioning of the samples into different classes. The method is demonstrated on simulated as well as real data pertaining to symptom burden of common childhood diseases across the 2nd and 3rd year of life in relation to exposures as well as the development of asthma at age 6 years from the COPSAC2010 mother-child cohort [9, 8].
The paper is organized with a theoretical derivation of kernel transformation for various data types (in 2.1), graphical network modelling (in 2.2) followed by results on simulated (in 3.1) as well as real life data (in 3.2).
2 | THEORY
In this paper, an unspecified source of information is highlighted as , matrices are in capital bold (X), vectors in lower case bold (x), and scalars as italic lower case (x). Indices of matrices and vectors are highlighted as subscripts.
2.1 | Kernel representation
Kernels are useful methods for representing the population distribution as sample similarities. Let K be a symmetric n by n matrix indicating similarities between any pair of samples from an information source from a total of n samples. In the case of a matrix , then the individual scalar elements of K can be defined as: Kij = f(xi, xj) where f is a transformation function of the input vectors for sample i (xi) and j (xj). If a multivariate dataset of continous variables are assumed Gaussian, then the so called radial basis function kernelization (rbf-kernel) is a natural choice of transformation:
Where ||xi – xj|| is the Euclidean distance and σ2 is the bandwidth of the kernel. For more information on kernel consult [2].
2.1.1 | Data on survival-and recurrent episodes
Longitudinal information such as survival outcomes are characterised by a time-to-disease (t ≥ 0) and a disease status (zero if the disease has not occurred and one if it has occurred) within the time period of interest e.g. follow-up time from intervention or study period. However, in the case of data on recurrent episodes, a survival representation, merely reflects time to the first event, and thereby omits relevant information on remission and later occurring episodes. Be aware, that this representation is used for episodes of chronic conditions, such as asthma, where a diagnosis span from a few years to a lifetime. This is contrast to episodes of shorter duration such as common childhood symptoms, which is described in section 3.2 and further detailed in [8].
We propose a flexible kernel representation of longitudinal data defined as the sum of weighted kernels:
For classical survival data, where the disease status is binary, the similarities represented by the binary kernel: KT=t at time T = t is defined as KT=t(i, j) = 1 if patienti œ patientj (both have disease or both healthy) and KT=t(i, j) = 0 if not. This results in a sequence of binary kernels: KT=tstart, …, KT=tend
The weights of each time kernel (αT ≥ 0) were defined as the change in values between two kernels at two consecutive time points, defined as the element wise change between two consecutive kernels: ϵ is a small number giving all periods a weight above zero and the first element of the weight vector is set to ϵ (α1 = ϵ). Further, the weights were scaled to a simplex (∑ αt = 1).
Traditionally survival analyses have been used for outcomes that are not reversible such as death. Over the years survival analyses have been applied to other types of data such as time to chronical diseases, time to relapse eg. Some patients go into remission after getting diagnosed with a disease. In order to capture this we introduce a third state to the model, the state of remission defined as healthy after a period of disease. Now we have three states: healthy, ill and remission and we define the similarity between two patients as:
Where r is the rate of remission and KT=t ∈ {0, r, 1 – r, 1}. The weights and the final kernel were defined as in equation 3 and 2 respectively.
2.1.2 | Repeated measurements of continuous data
The kernelization of data that does not follow data on time to event or recurrent episodes (as in section 2.1.1), but when applied to longitudinal continuous information such as repeated measurements of height or weight, the framework in equation 2 and 3 can be extended, by using a relevant measure of similarity for each time point (T = (t1, t2, …, tend)) such as:
Where g(xi, xj) calculates a measure of similarity between xi, and xj: E.g. g(xi, xj) = exp(–(xi – xj)2/s) (with s being a relevant scaling constant).
The weights of the time kernels KT=t1, KT=t2, …, KT=tend are defined as in equation 3 and they are combined to Kas in equation 2.
2.2 | Graphical network based on partial correlations
2.2.1 | Correlation coefficients
Kernel to Kernel correlations is defined via the so-called RV coefficient [5, 6], which simply amounts to a scalar correlation coefficient between two vectorized matrices. That is: The M individual kernels were unfolded, from Km ~ (n × n) to vectors where the diagnoal elements are removed km ~ (n2 – n × 1). Each vectorized kernel was standardized to mean zero and standard deviation one. The correlation between two kernels can be expressed as in equation 6, which under centering and scaling ( and σki = σkj) reduces to an inner product. For detail on scaling of kernels and the definition of RV coefficient see Aben et al 2018 [1].
The correlations can then be combined in a correlation matrix S = ST ~ (M × M).
2.2.2 | Sparse Graphical LASSO
The matrix of unfolded standardized kernels is combined into a data matrix X = [k1, k2… kM] ~ (n2 – n × M), which is considered a manifestation of a multivariate distribution on what we term a similarity manifold. We use the GLASSO proposed by Friedmann et al 2009 [10] to estimate the partial correlations as the inverse covariance matrix on these standardized kernels, given the assumption that such a data representation can be subjected to a regularized maximum likelihood correlation estimation procedures as in [10].
The objective of the graphical lasso is.
Where S = ST ~ (p, p) is the observed covariance matrix, λ ≥ 0 is a tuning parameter and Θ = ΘT ~ (p, p) is the inverse covariance matrix (or precision matrix).
In principal two algorithmic approaches have been suggested for solving this objective, either based on sequential updating of the covariance matrix or on the inverse of the covariance matrix (Θ) [11]. For the present work we utilize a dual optimization strategy for updating columns/rows of directly on Θ referred to as DP-GLASSO by [11]. Beyond superior convergence stability compared to GLASSO, this algorithm focuses directly on Θ which allows for an easy extension where certain parts of Θ are considered fixed (see section 2.2.3). For details on the derivation of the optimality criterion and the algorithm see Appendix A.
2.2.3 | Introducing new variables to a fixed inner model
To be able to observe correlations between highly correlated biological data e.g. between different symptoms of illness (from here on the inner network) and not as highly correlated data such as environmental exposures or later disease without having to interpret to many ‘middle’ edges within the inner network, we propose a two-step approach. This method where the less influential biological correlations are removed offers simplicity in order to understand the complex biology. First, fit an inner network (Θinner) of the biological data (here between different symptoms), followed by a model for the outer network conditioning on the inner correlation network, where the graphical structure both within the outer sources (Θouter) of information as well as between the inner-and outer layers of data (Θinner2outer) is estimated (see Figure 1 for a schematic representation).
This is achieved by a modification of the algorithm in A, such that the initialization of (Θ) is exchanged with the estimates from the inner network. Otherwise, the algorithmic procedure follows 1, with the only modification of only cycling through the columns associated with the new information.
2.3 | From graphical network to observation clustering/similarities
For the graphical network between with M nodes (|V| = M) the edge-set (E) is set by Θ. This network partitions information (here the M kernels) into clusters of correlations, however, without any explicit segmentation of the individual samples. Here, we propose a post-hoc procedure to derive a partition over the samples obeying the correlation structure in .
First, for each of the M nodes, a network weight (ωm) is defined as the sum of the weights of its adjacent edges. That is:
Hence ωm reflect how strongly node m is connected with the rest of the network. These weights are stored in a vector ω = (ω1, ω2, …, ωM). Using these weights, a consensus kernel (KK) can be derived as:
By definition, such a consensus kernel will reflect the population structure for the most connected nodes in the network, while disconnected nodes (ω = 0) will have no influence. By using hierarchical clustering on this consensus kernel a partition of the n samples, reflecting the dominating correlation patterns in the data, will be derived.
2.3.1 | Model selection
Penalty sequence
Depending on the problem, the dynamic sequence of penalties λmin, …, λmax is not trivial to set, which essentially results in screening sequences of models for penalty values with no practical change in parameter estimates. For this reason the dynamic penalty sequence is set such that λmin results in a single off-diagonal element of Θ being 0, and similarly λmax returns only a single active (≠ 0) off-diagonal element.
Model section criterion
Cross-validation evaluating the non-regularized maximum-likelihood based on a trained precision matrix in comparison with an independent (left-out) observed covariance matrix is generally good for overall control of estimation flexibility. However, this procedure suffers from being unable to correctly identifying 0 elements in the precision-matrix. This is due to shrinking these parameters to a small non-zero neglect able entity, that have no practical influence on the cross validation fit, but leads to a false discovery. This phenomena is previously described by Weijie et al 2017 [12]. For this reason a criterion explicitly weighting the amount of non-zero elements has been proposed via the extended Bayes Information Criterion [7], which penalizes dense graphs with increasing γ, following: where n and M is the number of samples and number of graph nodes (kernels) respectively, |E|0 is the number of edges. I.e. the number of non-zero of diagonal elements of Θ, and γ is a tuning parameter putting emphasize on finding the zero elements of Θ.
3 | RESULTS
3.1 | Simulations
For this part we pertain to the individual data being low rank multivariate data, simulated as bilinear matrices. The block-wise dependency is constructed by a linear mapping of sample related component scores in one block to another block. On top of the systematic bilinear structure there is added either structured noise or white noise at varying levels.
For this setup we construct five blocks (X1, X2,… X5) of each n samples and p1, p2,… p5 variables following the structure shown in figure 2.
We define the scores for the five blocks as in equation 11 and 12:
The matrices Qi,j constitutes the mapping between block i and j, and Fi, constitutes the structured noise, where the levels is set by ki. The matrices Z1 ~ (n, c1), Z2 ~ (n, c2), Z3 ~ (n, c3), Q1,4 ~ (c1, c4), Q2,4 ~ (c2, c4), Q25 ~ (c2, c5), Q45 ~ (c4, c5) and F1 ~ (n, c1),…, F5 ~ (n, c5) all have entries drawn from a Gaussian distribution . The rank (ci) of the individual data blocks are drawn from a uniform distribution .
The five data sets are now given by: where Bi, ~ (ci, pi) is the right component loading matrix drawn from a Gaussian distribution . The number of variables (pi) for each data block is drawn from a uniform distribution . The noise level scalars ki, and li, are set to fulfill signal to noise levels for i) structured noise defined as: as well as for ii) unstructured white noise defined as:
The data sets X1, X2, …, X5 are kernelized such that Ki, ~ (n, n) is the kernelization of Xi ~ (n, pi) using the exponential of the negative euclidean distance between pairs of observations.
We investigate the effect of number of samples (n ∈ (20,50,100,200,500)), type of noise (White or Structured), signal to noise (s2n ∈ (1/8,1/4,1/2,1, 2,4, 8)) on the ability to recover the underlying structure represented as sensitivity (true positive rate), specificity (true negative rate) and the combination of the two as the area under the receiver operator characteristics curve (AUC). For all simulationsmodel selection is done using eBIC.
For each of the 700 design combinations 100 simulations are produced, and model selection is done by eBIC with γ ∈ (0,0.5,1).
Figure 3 shows the performance in terms of true positive- and true negative rate as well as the agglomerated area under the receiver operator curve (AUC). These quality metrics are shown for models selected using the eBIC with γ ∈ (0,0.5,1). Further, included is also the oracle model, which is the model that most optimally resembles the underlying edge set (figure 2) in terms of AUC. The behaviour of the true positive rate follows general statistical estimation paradigm with higher n and signal to noise yielding more accurate results. However, this is at the cost of the true negative rate which declines with these data characteristics. In general, the difference between the oracle model and the model selected using eBIC is that λoracle - λeBIC is biased positively for low signal-to-noise and low sample size, turning into being biased negatively with high signal-to-noise, reflecting that eBIC is too conservative when dealing with low-noise data, and too optimistic in the case of small-and noisy data. (see Appendix B for details)
Figure 4 investigates from which parts of the graph the uncertainty derives. Here, node X3 is the single isolated / non-connected vertex of the graph, and it seems relatively easy for the model to also correctly exclude edges to this node. The indirect connection between X1 and X2 and X5 (via X4) is causing the observed sub optimal true negative rate. For the identification of the true connections, generally similar results is observed. There seem to be no discrepancy in the estimation power depending on whether the edge is in a fully connected part of the network (between X2, X4 and X5), or towards the leaf vertex X1 from X4.
3.2 | COPSAC symptom burden
From a total of n = 403 children, with at least 50% of diary registration each month, the symptom burden was prospectively registered daily the first three years of life, where the following common childhood symptoms were registered: cough, breathlessness, wheeze, cold, pneumonia, inflammation of the throat, ear infection, fever, gastric infection and eczema. For details on these data including seasonal variation see [8]. For this analysis, the symptoms of the 2nd and 3rd year were investigated, and further were the symptoms breathlessness and wheeze combined in to one symptom due to their similar nature. This gives nine symptoms over two years resulting in 18 symptom kernels.
Additionally, 21 exposures obtained during or prior to the first year as well as the asthma-and eczema status up to age 6 years were incorporated. Of particular interest is to answer Q1: How early life exposures leads to later development of asthma? and here; Q2: How the early life symptom burden can be considered as a path towards later disease.
3.2.1 | Kernelization of symptom diaries
To reduce noise, the daily diary registrations were summarised (fractions of days with the symptom out of registered days) for each calendar month the 2nd and 3rd year of life (January to December, year 2 to year 3), resulting in 24 registrations per symptoms per child. The distance between any two children was constructed using the Euclidean distance between frequency vectors within each year. By explicitly factoring in calendar month in the construction of the pairwise differences, results in kernels that allow one to disentangle seasonal symptoms patterns. For instance, two children with an overall equal prevalence but with distinct seasonal profiles, will be rather dissimilar in the kernel. The distance measures were transformed to measures of similarities by multiplying by minus one. For details on the choices related to kernel transformation see Appendix C a well as [8].
3.2.2 | Graphical inner-model
The 18 kernels of symptoms were unfolded, the diagonals were removed and they were standardized with mean zero and standard deviation one. Model selection were performed using the eBIC with γ = 0 based on a 10 times repeated 5-fold cross-validation. Figure 5 shows the cross-validation results, where a penalty of λ = 0.076 was chosen resulting in 118 edges (out of 153 possible - 77%).
Table 1 shows the number of connections for each symptom-year combination. The node with the most connections (ten) is fever 2nd year, while eczema 3rd year was only connected to eczema-and cough 2nd year.
3.2.3 | Graphical outer-model
Data on exposure (Cat at birth, Dog at birth, Daycare type, Daycare age, Gestational age, Maternal asthma, Maternal BMI, Maternal rhinitis, Paternal asthma, Paternal rhinitis, Preeclampsia, Premature delivery and Mode of delivery) was kernelized and treated in the same way as the symptom kernels.
The asthma diagnoses were kernelized as proposed in section 2.1.1. 85 (21 %) of the children got a diagnosis of asthma before the age of six years and 61 (72%) went into remission before the age of six years.
123 (31%) of the children got a diagnosis of eczema before the age of 6 years. Of these, 77% (95 children) went into a state of remission at the age of 6 years, and had thereby lost this diagnosis. Due to the early debut of eczema peaking around age 1-2 years, the eczema kernel were constructed based on those having the diagnosis cross sectional at age 6 years. This in order to avoid correlations by construction as symptoms of eczema is also included in the diary registrations.
These information sources, in the form of kernels, were added to the symptom burden data, to establish a model between risk factors, early life symptoms and chronic diseases. The partial correlations between symptoms found in section 3.2.2 were fixed in this model.
The outer partial correlation network was found with the customized conditional GLASSO, as proposed in section 2.2.3. The outer model were selected at λ = 0.1 based on 10 times repeated 5-fold cross validation using eBIC with γ = 0 as criterion. The corresponding network consisted of additional 22 edges of which 5 were between symptoms and exposures, 8 between symptoms and disease and 9 within exposures (see Table 1 for summary details of the graph). The resulting model is shown in figure 6.
The airway symptoms: cough, breathlessness-wheeze and cold are clustered together in the center of the graph with connections to a cluster of eczema (year two, three and cross sectional at age six years) and a cluster of the rest of the symptoms (ear and throat infections, fever, pneumonia and gastrointestinal infections) on the other side. These symptoms are generally thought to be driven by bacterial or viral infections [13]. Maternal asthma is connected to cough and breathlessness-wheeze the second year of life. Pregnancy-and birth circumstances represented by gestational age and prematurity is connected to cough the third year, and interestingly the well known risk factor Cesarean section for a wide range of diseases [14], is only related to early life symptom burden through early delivery. Furthermore, maternal smoking during third trimester of pregnancy is connected to breathlessness-wheeze third year of life in line with previous findings [15]. These connections suggest an environmental effect on the respiratory health in early life, where as for the symptoms grossly driven by infections, the connections with risk factors are indirect.
3.2.4 | Clustering children based on the inner-model
The proposed graphical model estimates clusters of associated information. However, it can be fortunate to also group the children into certain groups based on their symptom burden pattern as a phenotypic characterization. Based on the inner network between nine symptoms across year 1 and 2 a consensus kernel is calculated using the sum edge-weights for each node (see table 1). Hierarchical clustering results in a dendrogram as shown in figure 7, which is cutted to partition the children into 3 clusters with a distribution of ((nred = 12, ngreen = 26, nblue = 365)), where the small red cluster corresponds to the most diseased children followed by the green cluster, leaving the large blue cluster being the overall most healthy.
4 | DISCUSSION
We propose a flexible framework to do explorative analyses of heterogeneous data encompassing longitudinal symptom observation, survival type disease data and various data on exposure. The method is based on kernel transformation for data harmonization followed by graphical network modelling by the graphical LASSO to uncover direct and in direct correlations.
Kernels are flexible and can capture differences shifted in time, on different scales and of distributional nature. However, how the original data is kernelized will have a large impact on which patterns that will be captured. In particular for this work we have chosen to factor in calendar month in the transformation of data, due to the fact that there is a large seasonal component in symptom burden where for instance eczema is more prevalent during the changing seasons (autumn and spring) and respiratory symptoms are more prevalent during the winter months (for details on symptom burden seasonal patterns see [8]). In detail, consider two children (A and B) with the same overall symptom burden prevalence, but where child A have episodes during winter, where as child B have the episodes evenly spread across the year. By using the proposed kernel transformation, such differences will be conserved, and possibly affect the GLASSO model.
A dual representation of the well-known GLASSO were used for graph estimation. This dual formulation works directly on the entity of interest; the precision matrix, as opposed to the standard GLASSO algorithmicly working on the covariance matrix. Beyond numerical stability, this allows to fix certain parts of the precision matrix via a very simple modification of the algorithm.
The framework is validated on both simulated and biological data. The simulations show that capturing true edges follows a normal statistical paradigm with better recovery for larger n and signal-to-noise, and further that unstructured white noise interfere less with the model estimation than structured noise. However, capturing non-existing edges behaves oppositely, with a higher false negative rate with larger n and s2. This phenomenon derives from indirect relations, and not from spurious edges towards disconnected nodes (edges towards X3 in Figure 4). Overall though, the error-rate in terms of AUC increases with n and s2n.
The integration of the symptom burden data during 2nd and 3rd year of childhood showed that the respiratory symptoms are highly correlated, and that the ear infection and the inflammation of the throat are connected which corresponds with the bacteria moving from the throat to the ears in line with the coherent airway epithelium in these communicating organs [13].
Introducing precursors of disease as well as asthma and eczema by age 6 years to the symptom network highlights that maternal asthma is connected to respiratory symptoms in the 2nd year of life, while maternal rhinitis only associates to respiratory symptoms indirectly through maternal asthma. Further, maternal asthma is suggested to be a risk factor for the development of asthma by age 6 years indirectly through an elevated level of respiratory symptoms in early life. Paternal asthma, rhinitis and eczema do not have any connections to any of the symptoms. This corresponds with previous research showing maternal asthma increases the risk of childhood asthma to a higher degree than paternal asthma [16,17].
Networks behave different based on their correlation structures. When only investigating the correlations between the symptoms, we observe a network with many well-connected nodes. When exposures/environmental factors are included in this part of the network has a weaker correlations between symptoms and exposures with fairly many (7) disconnected nodes. Furthermore, we see that few exposures (4) associate directly with the symptoms. However, the pertinent challenge here, is to distinguish between associations of symptoms which are timely matched and are having overlapping phenotype presentation with those of exposures or disease, which are disparate in time and a-priory of much less strength. By using the two step estimation procedure with an inner model on symptom data only, followed by an outer model on all data conditioning on the inner estimates allows for using a higher penalty on the inner model (λinner) to only include strong connections, while allowing the penalty of the outer model (λouter) to include everything that generalizes in terms of cross-validation.
5 | CONCLUSIONS
A combination of kernel transformation and graphical LASSO is proposed for integration of heterogeneous data layers. The method is demonstrated on both simulated-and real life data pertaining to early life symptom burden, precursors and development of disease at age 6 years.
6 | SOFTWARE
All algorithms have been coded in Matlab ®R2019 and is available through github: (https://github.com/mortenarendt/KerGLASSO), the code depends on the quadprog() function from the optimization toolbox(http://www.mathworks.com/products/optimization/). Plotting is conducted using R(ver 3.5.1) with ggplot2, tidygraph [18], ggraph [19] and igraph [20] for graph handling and plotting.
A | DERIVATION OF DP-GLASSO
Let be the objective from equation 7. And further, let W = Θ−1 be the population covariance- and precision matrix respectively, and S be the observed covariance matrix. All symmetric and of size (p, p).
The sub-gradient of f with respect to Θ can be formalized as:
Where Γ is the sign matrix of the individual elements of Θ. Let: be a representation of Θ where Θ11 ~ (p - 1, p – 1), Θ12 ~ (p – 1,1) and Θ22 ~ (1,1) (with a similar representation for S and W). Using calculus for matrix inversions reveals the block wise stationary point of equation 15 to be: where β = Θ12/Θ22, and γ12 is a sign vector of Θ12. This linear system of equations 17 is the score (or normal-) equations associated with the LASSO with the following quadratic objective:
The dual representation of this problem (equation 18) is derived by defining a new vector: x = β and extending the objective to: where z is a vector of (non negative) Lagrangian multipliers.
The stationary points of objective 19 with respect to β, assuming W11 ≻ 0, is: and with respect to x
Combining (19), (20) and (21) leads to a dual quadratic minimization problem with a box constraint on the Lagrangian multipliers:
The primal estimates follows from equation 23
Without loss of generality, the entire solution can be achieved by rearranging the columns and rows of S, W and Θ to cycle through all columns of the covariance-and precision matrix with sequential conditional updates. See algorithm 1 for the entire procedure.
B | MODEL SELECTION - SIMULATION STUDY
The behaviour in the classification accuracy in Figure 3 and 4, especially when contrasting model selection with the oracle is due to a biased hyper parameter selection (λ) depending on the sample size and level of noise.
Figure 8 show this behavior when comparing the hyper parameter for the oracle model with that selected by eBIC.
C | KERNEL TRANSFORMATION OF SYMPTOM BURDEN DATA
The raw diary symptom burden data represents binary (whether or not) the child had symptoms. This, for each day across a total of nine symptoms.
In order to investigate the effect of 1) time resolution, 2) distance metric and 3) distance to similarity function, a design of experiment were established.
The factors were:
Biological
Age of child (#2) second year or third year
Symptom (#9) breathlessness or wheeze, cold, cough, ear infection, eczema, fever, gastric, inflammation and pneumonia
Technical
Time-resolution (#4) The symptom burden is agglomerated over calendar-– day (1,… 365)
– week (1, …, 52)
– month (1, …, 12)
– year (total prevalence)
into the number of days with symptoms within each bin.
Distance metric (#4) Euclidean, City block or Manhattan, Hamming and Hamming on presence/absence data
Distance to similarity transformation (#2) as , where s and d are measures of similarity and distance respectively, and is the average distance over the entire (n, n) kernel of distances: .
This result in a 32 different options for each symptom and year combination, leading to a total of 576 kernel realizations. These were compressed by tucker3. Figure 9 shows the first two component scores. Within the upper 8 panels (and the lower 8 panels) only technical choices influence the distribution. In principal, The largest discrepancy between scores were due biological differences (symptoms and child age) while the choice of distance metric (reflected by color) and the distance to similarity function only contribute to minor differences. Agglomeration over day, week and month reveals similar patterns, while truncation of the information into a single yearly prevalence makes the results less coherent in terms clustering of symptoms.
Acknowledgements
We thank the children and families of the COPSAC cohort study for all their support and commitment. We acknowledge and appreciate the unique efforts of the COPSAC research team.
Footnotes
Funding information, COPSAC is funded by private and public research funds all listed on www.copsac.com. The Lundbeck Foundation; The Danish Ministry of Health; Danish Council for Strategic Research and The Capital Region Research Foundation have provided core support for COPSAC