Abstract
Information criteria (ICs) based on penalized likelihood, such as Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC), and sample-size-adjusted versions of them, are widely used for model selection in health and biological research. However, different criteria sometimes support different models, leading to discussions about which is the most trustworthy. Some researchers and fields of study habitually use one or the other, often without a clearly stated justification. They may not realize that the criteria may disagree. Others try to compare models using multiple criteria but encounter ambiguity when different criteria lead to substantively different answers, leading to questions about which criterion is best. In this paper we present an alternative perspective on these criteria that can help in interpreting their practical implications. Specifically, in some cases the comparison of two models using ICs can be viewed as equivalent to a likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test than AIC. This perspective may lead to insights about how to interpret the ICs in more complex situations. For example, AIC or BIC could be preferable, depending on the relative importance one assigns to sensitivity versus specificity. Understanding the differences and similarities among the ICs can make it easier to compare their results and to use them to make informed decisions.
1 Introduction
Many model selection techniques have been proposed for different settings (for reviews see Miller, 2002; Pitt and Myung, 2002; Zucchini, 2000; Johnson and Omland, 2004). Among other considerations, researchers must balance sensitivity (suggesting enough parameters to accurately model the patterns, processes, or relationships in the data) with specificity (not suggesting nonexistent patterns, processes, or relationships). Several of the simplest and most popular model selection criteria can be discussed in a unified way as log-likelihood functions with simple penalties. These include Akaike’s Information Criterion (Akaike, 1973, AIC), the Bayesian Information Criterion (Schwarz, 1978, BIC), the sample-size-adjusted AIC or AICc of Hurvich and Tsai (1989), the “consistent AIC” (CAIC) of Bozdogan (1987), and the sample-size-adjusted BIC (ABIC) of Sclove (1987) (see Table 1). Each of these ICs consists of a goodness-of-fit term plus a penalty to reduce the risk of overfitting, and each provides a standardized way to balance sensitivity and specificity.
Applying an IC involves choosing the model with the best penalized log-likelihood: that is, the highest value of ℓ – Anp, where ℓ is the log-likelihood, An is a constant or a function of the sample size n, and p is the number of parameters in the model. For historical reasons, instead of finding the highest value of ℓ minus a penalty, this is often expressed as finding the lowest value of − 2ℓ plus a penalty: and we follow that convention here. Expression (1) is what Atkinson (1980) called the generalized information criterion; in this paper we simply refer to (1) as an IC. Expression (1) is sometimes replaced in practice by the practically equivalent G2 + Anp, where G2 is the deviance, defined as twice the difference in log-likelihood between the current model and the saturated model, that is, the model with the most parameters which is still identifiable (e.g., Collins and Lanza, 2010).
In practice, Expression (1) cannot be used directly without first choosing An. Specific choices of An make (1) equivalent to AIC, BIC, ABIC or CAIC. Thus, although motivated by different theories and goals, algebraically these criteria are only different values of An in (1), corresponding to different relative degrees of emphasis on parsimony, that is, on the number of free parameters in the selected model (Claeskens and Hjort, 2008; Lin and Dayton, 1997; Vrieze, 2012). Because the different ICs often do not agree, the question often arises as to which is best to use in practice. In this paper we examine this question by focusing on the similarities and differences among AIC, BIC, CAIC, and ABIC, especially in view of an analogy between their different complexity penalty weights An and the α levels of hypothesis tests. We especially focus on AIC and BIC, which have been extensively studied theoretically (Pötscher, 1991; Atkinson, 1980; Kuha, 2004; Zhang, 1993; Ding et al., 2018; Shao, 1997; Kadane and Lazar, 2004; Vrieze, 2012), and which are often used not only in their own right but as tuning criteria to improve the performance of more complex model selection techniques (e.g., in high-dimensional regression variable selection Wu and Ma, 2015; Narisetty and He, 2014; Wang et al., 2007). The AIC and BIC are widely used in many important applications in bioinformatics, including in molecular phylogenetics (Posada, 2008; Darriba et al., 2012; Jayaswal et al., 2014; Kalyaanamoorthy et al., 2017; Lefort et al., 2017).
In the following section we review the motivation and theoretical properties of these ICs. We then discuss their application to a common application of model selection in medical, health and social scientific applications: that of choosing the number of classes in a latent class analysis (e.g., Collins and Lanza, 2010). Finally, we propose practical recommendations for using ICs to extract valuable insights from data while acknowledging their differing emphases.
Common Penalized-Likelihood Information Criteria
In this section we review some commonly used ICs. Their formulas, as well as some of their properties which we describe later in the paper, are summarized for convenience in Table 1.
Akaike’s Information Criterion (AIC)
First, the AIC Akaike (1973) sets An = 2 in (1). It estimates the relative Kullback-Leibler (KL) divergence (a nonparametric measure of difference between distributions) of the likelihood function specified by a fitted candidate model, from the likelihood function governing the unknown true process that generated the data. The fitted model closest to the truth in the KL sense would not necessarily be the model that best fits the observed sample, since the observed sample can often be fit arbitrary well by making the model more and more complex. Rather, the best KL model is the model that most accurately describes the population distribution or the process that produced the data. Such a model would not necessarily have the lowest error in fitting the data already observed (also known as the training sample) but would be expected to have the lowest error in predicting future data taken from the same population or process (also known as the test sample). This is an example of a bias-variance tradeoff (see, e.g., Hastie et al, 2001).
Technically, the KL divergence can be written as Et(ℓt(y)) – Et((y)), where Et is the expected value under the unknown true distribution function, ℓ is the log-likelihood of the data under the fitted model being considered, and ℓt is the log-likelihood of the data under the unknown true distribution. This is intuitively understood as the difference between the estimated and the true distribution. Et(ℓt(y)) will be the same for all models being considered, so KL is minimized by choosing the model with highest Et(ℓ(y)). The ℓ(y) from the fitted model is a biased measure of Et(ℓ(y)), especially if p is large, because a model with many parameters can generally be fine-tuned to appear to fit a small dataset well, even if its structure is such that it cannot generalize to describe the process that generated the data. Intuitively, this means that if there are many parameters, the fit of the model to the originally obtained data (training sample) will seem good regardless of whether the model is correct or not, simply because the model is so flexible. In other words, once a particular dataset is used to estimate the parameters of a model, the fit of the model on that sample is no longer an independent evaluation of the quality of the model. The most straightforward way to address this fit inflation would be by cross-validation on a new sample, but AIC and similar criteria attempt to achieve something similar when there is no other sample (see Shao, 1993, 1997).
Akaike (1973) showed that an approximately unbiased estimate of Et(ℓ(y)) would be a constant plus (where J and K are two p × p matrices, described below, and tr() is the trace, or sum of diagonal elements). is an estimator for the covariance matrix of the parameters, based on the matrix of second derivatives of ℓ in each of the parameters, and is an estimator based on the cross-products of the first derivatives (see Claeskens and Hjort, 2008, pp. 26-7). Akaike showed that and are asymptotically equal for the true model, so that the trace becomes approximately p, the number of parameters in the model. For models that are far from the truth, the approximation may not be as good. However, poor models presumably have poor values of ℓ, so the precise size of the penalty is less important (Burnham and Anderson, 2002). The resulting expression ℓ – p suggests using An = 2 in (1) and concluding that fitted models with low values of (1) will be likely to provide a likelihood function closer to the truth. AIC is discussed further by Burnham and Anderson (2002, 2004) and Kuha (2004).
Criteria Related to AIC. When n is small or p is large, the crucial AIC approximation is too optimistic and the resulting penalty for model complexity is too weak (Tibshirani and Knight, 1999; Hastie et al., 2001). In the context of regression and time series models, several researchers (e.g., Sugiura, 1978; Hurvich and Tsai, 1989; Burnham and Anderson, 2004) have suggested using a corrected version, AICc, which applies a slightly heavier penalty that depends on p and n; it gives results very close to those of AIC when n is large relative to p. For small n, Hurvich and Tsai (1989) showed that AICc sometimes performs better than AIC. Theoretical discussions of model selection often focus on the advantages and disadvantages of AIC versus BIC, and AICc gets little attention because it is asymptotically equivalent to AIC. However, this equivalence is subject to the assumption that p is fixed and n becomes very large. Because in many situations p is comparable to n or larger, AICc may deserve more attention in future work.
Some other selection criteria are asymptotically equivalent to AIC, at least for linear regression. These include Mallows’ Cp (see George, 2000), leave-one-out cross-validation (Shao, 1997; Stone, 1977), and the generalized cross-validation (GCV) statistic (see Golub et al., 1979; Hastie et al., 2001). Leave-one-out cross-validation involves fitting the candidate model on many subsamples of the data, each excluding one subject (i.e., participant or specimen), and observing the average squared error in predicting the extra response. Each approach is intended to correct a fit estimate for the artificial inflation in observed performance caused by fitting a model and evaluating it with the same data, and to find a good balance between bias caused by too restrictive a model and excessive variance caused by too rich a model (Hastie et al., 2001). Model parsimony is not a motivating goal in its own right, but is a means to reduce unnecessary sampling error caused by having to estimate too many parameters relative to n. Thus, especially for large n, sensitivity is likely to be treated as more important than specificity. If parsimonious interpretation is of interest in its own right, another criterion such as BIC, described in the next section, might be more appropriate.
The Deviance Information Criterion used in Bayesian analyses (Spiegelhalter et al., 2002; Gibson et al., 2018) is beyond the scope of this paper because it cannot be expressed as a value of An in Expression (1). However, it has some relationship to AIC and has an analogous purpose (Claeskens and Hjort, 2008; Ando, 2013).
Other ICs are named after AIC but do not derive from the same theoretical framework, except that they share the form (1). For example, some researchers (Andrews and Currim, 2003; Fonseca and Cardoso, 2007; Yang and Yang, 2007) have suggested using An = 3 in expression (1) instead of 2. The use of An = 3 is sometimes called “AIC3.” There is no statistical theory to motivate AIC3, such as minimizing KL divergence or any other theoretical construct, but on an ad hoc basis it has fairly good simulation performance in some settings, being stricter than AIC but not as strict as BIC. Also, the CAIC, the “corrected” or “consistent” AIC proposed by Bozdogan (1987), uses An = ln(n) + 1. (It should not be confused with the AICc discussed above.) This penalty tends to result in a more parsimonious model and more underfitting than AIC or BIC, and it is therefore not very similar to AIC. This value of An was chosen somewhat arbitrarily as an example of an An that would provide model selection consistency, a property described below in the section for BIC. However, any An proportional to ln(n) provides model selection consistency, so CAIC has no real advantage over the better-known and better-studied BIC (see below), which also has this property.
Schwarz’s Bayesian Information Criterion (BIC)
In Bayesian model selection, a prior probability is set for each model Mi, and prior distributions (often uninformative priors for simplicity) are also set for the nonzero coefficients in each model. If we assume that one and only one model, along with its associated priors, is true, we can use Bayes’ theorem to find the posterior probability of each model given the data. Let Pr(Mi) be the prior probability set by the researcher, and let Pr(y|Mi) be the probability density of the data given Mi, calculated as the expected value of the likelihood function of y given the model and parameters, over the prior distribution of the parameters. According to Bayes’ theorem, the posterior probability Pr(Mi|y) of a model is proportional to Pr(Mi) Pr(y|Mi). The degree to which the data support Mi over another model Mj is given by the ratio of the posterior odds to the prior odds:
If we assume equal prior probabilities for each model, this simplifies to the “Bayes factor” (see Kass and Raftery, 1995): so that the model with the highest Bayes factor also has the higher posterior probability. Schwarz (1978) and Kass and Wasserman (1995) showed that, for many kinds of models, Bij can be roughly approximated by exp , where BIC equals Expression (1) with An = ln(n), especially if a certain “unit information” prior is used for the coefficients. The use of Bayes factors has been argued to be more interpretable than that of significance tests in some practical settings (Raftery, 1996; Goodman, 2008; Beard et al, 2016) although with some caveats (see Gigerenzer and Marewski, 2015; Murtaugh, 2014). Thus the model with the highest posterior probability is likely to be the one with lowest BIC. BIC is described further in Raftery (1995) and Wasserman (2000), but critiqued by Gelman and Rubin (1995) and Weakliem (1999), who find it to be an oversimplification of Bayesian methods. BIC can also be called the Schwarz criterion.
BIC is sometimes preferred over AIC because BIC is “consistent” (e.g., Nylund et al., 2007). Assuming that a fixed number of models are available and that one of them is the true model, a consistent selector is one that selects the true model with probability approaching 100% as n → ∞ (see Rao and Wu, 1989; Zhang, 1993; Shao, 1997; Yang, 2005; Claeskens and Hjort, 2008). The existence of a true model here is not as unrealistically dogmatic as it sounds (Burnham and Anderson, 2004; Kuha, 2004). Rather, the true model can be defined as the smallest adequate model, that is, the single model that minimizes KL divergence, or the smallest such model if there is more than one (Claeskens and Hjort, 2008). There may be more than one such model because if a given model has a given KL divergence from the truth, any more general model containing it will have no greater distance from the truth. This is because there is some set of parameters for which the larger model becomes the model nested within it. However, the theoretical properties of BIC are better in situations in which a model with a finite number of parameters can be treated as “true” (Shao, 1997).
AIC is not consistent because it has a non-vanishing chance of choosing an unnecessarily complex model as n becomes large. The unnecessarily complex model would still closely approximate the true distribution but would use more parameters than necessary to do so. However, selection consistency involves some performance tradeoffs when n is modest, specifically, an elevated risk of poor performance caused by underfitting (see Shibata, 1986; Shao, 1997; Pötscher, 1991; Vrieze, 2012). In general, the strengths of AIC and BIC cannot be combined by any single choice of An (Leeb, 2008; Yang, 2005). However, in some cases it is possible to construct a more complicated model selection approach that uses aspects of both (see Ding et al., 2018).
Criteria Related to BIC. Sclove (1987) suggested a sample-size-adjusted BIC, variously abbreviated as ABIC, SABIC, or BIC∗, based on the work of Rissanen (1978) and Boekee and Buss (1981). It uses An = ln((n + 2)/24) instead of An = ln(n). This penalty will be much lighter than that of BIC, and may be lighter or heavier than that of AIC, depending on n. The unusual expression for An comes from Rissanen’s work on model selection for autoregressive time series models from a minimum description length perspective (see Stine, 2004). It is not clear whether or not the same adjustment is still theoretically appropriate in different contexts, but in practice it is sometimes used in latent class modeling and seems to work fairly well (see Nylund et al., 2007; Tein et al., 2013).
2 Information Criteria in Simple Cases
The above shows that AIC and BIC differ in theoretical basis. They also often disagree in practice, generally with AIC indicating models with more parameters and BIC with less. This has led many researchers to question whether and when a particular value of the “magic number” An (Bozdogan, 1987) can be chosen as most appropriate. Two special cases – comparing equally sized models and comparing nested models – each provide some insight into this question.
First, when comparing different models of the same size (i.e., number of parameters to be estimated), all ICs of the form (1) will always agree on which model is best. For example, in regression variable subset selection, suppose two models each use five covariates. In this case, any IC will select whichever model has the highest likelihood (the best fit to the observed sample) after estimating the parameters. This is because only the first term in Expression (1) will differ across the candidate models, so An does not matter. Thus, although the ICs differ in theoretical framework, they only disagree when they make different tradeoffs between fit and model size.
Second, for comparing a nested pair of models, different ICs act like different α levels on a likelihood ratio test (LRT). For comparing models of different sizes, when one model is a restricted case of the other, the larger model will typically offer better fit to the observed data at the cost of needing to estimate more parameters. The ICs will differ only in how they make this bias-variance tradeoff (Lin and Dayton, 1997; Sclove, 1987). Thus, an IC will act like a hypothesis test with a particular α level (Söderström, 1977; Teräsvirta and Mellin, 1986; Pötscher, 1991; Claeskens and Hjort, 2008; Foster and George, 1994; Stoica et al., 2004; van der Hoeven, 2005; Vrieze, 2012; Murtaugh, 2014).
Suppose a researcher will choose whichever of M0 and M1 has the better (lower) value of an IC of the form (1). This means that M1 will be chosen if and only if , where ℓ1 and ℓ0 are the fitted maximized log-likelihoods for each model. Although the comparison of models is interpreted differently in the theoretical frameworks used to justify AIC and BIC (Aho et al., 2014; Kuha, 2004), algebraically this comparison is the same as a LRT (Söderström, 1977; Stoica et al, 2004; Pötscher, 1991). That is, M0 is rejected if and only if
The left-hand side is the LRT test statistic (since a logarithm of a ratio of quantities is the difference in the logarithms of the quantities). Thus, in the case of nested models an IC comparison is mathematically an LRT with a different interpretation. The α level is specified indirectly through the critical value An; it is the proportion of the null hypothesis distribution of the LRT statistic that is less than An.
For many kinds of models, the asymptotic null-hypothesis distribution of – 2(ℓ0 – ℓ1) is asymptotically χ2 with degrees of freedom (df) equal to p1 – p0. Consulting a χ2 table and assuming p1 – p0 = 1, AIC (An = 2) becomes equivalent to a LRT test at an α level of about .16 (i.e., the probability of a χ2 deviate being greater than 2). In the same situation, BIC (with An = ln(n)) has an α level that depends on n. If n = 10 then An = ln(n) = 2.30 so α = .13. If n = 100 then An = 4.60 so α = .032. If n = 1000 then An = 6.91 so α = .0086, and so on. Thus when p1 – p0 = 1, significance testing at the customary level of α = .05 is often an intermediate choice between AIC and BIC, corresponding to An = 1.962 ≈ 4. However, as p1 – p0 becomes larger, all ICs become more conservative, in order to avoid adding many unnecessary parameters unless they are needed. Table 2 shows different effective α values for two values of p1 – p0, obtained using the R (R Development Core Team, 2010) code 1-pchisq(q=An*df,df=df,lower.tail=TRUE) where An is the An value and df is p1 – p0. AICc is not shown in the table because its penalty weight depends both on p0 and on p1 in a slightly more complicated way, but will behave similarly to AIC for large n and modest p0.
The property of selection consistency can be intuitively understood from this perspective. For AIC, as for hypothesis tests, the power of a test increases with n. Thus, rejecting any given false null hypothesis is practically guaranteed for sufficiently large n even if the effect size is tiny. However, the Type I error rate is constant and never approaches zero. On the other hand, BIC becomes a more stringent test (has a decreasing Type I error rate) as n increases. The power increases more slowly (i.e., the Type II error rate decreases more slowly) than for AIC or for fixed-α hypothesis tests because the test is becoming more stringent, but now the Type I error rate is also decreasing. Thus, nonzero but practically negligible departures from a model are less likely to lead to rejecting the model for BIC than for AIC (Raftery, 1995). Fortunately, even for BIC, the decrease in α as n increases is slow; thus power still increases as n increases, although more slowly than it would for AIC. Thus, for BIC, both the Type I and Type II error rates decline slowly as n increases, while for AIC (and for classical significance testing) the Type II error rate declines more quickly but the Type I error rate does not decline at all. This is intuitively why a criterion with constant An cannot be asymptotically consistent even though it may be more powerful for a given n (see Claeskens and Hjort, 2008; Yang, 2005; Kieseppä, 2003).
Nylund et al. (2007) seem to interpret the lack of selection consistency as a flaw in AIC (Nylund et al., 2007, p. 556). However, the real situation is more complicated; AIC is not a defective BIC, nor vice versa (see Kieseppä, 2003; Shibata, 1981, 1986; Pötscher, 1991; Vrieze, 2012). Likewise, the other ICs mentioned here are neither right nor wrong, but are simply choices (perhaps thoughtful and perhaps arbitrary, but still technically valid choices). Since choosing An for a model comparison is closely related to choosing an α level for a significance test, the universally “best” IC cannot be defined any more than the “best” α; there will always be a tradeoff. Thus, debates about whether AIC is generally superior to BIC or vice versa, will be fruitless.
For non-nested models of different sizes, neither of the above simple cases hold; furthermore, these complex cases are often those in which ICs are most important because a LRT cannot be performed. However, it remains the case that An has a powerful effect on the tradeoff between the likelihood term and the penalty on the number of parameters, hence the tradeoff between good fit to the observed data and parsimony.
Almost by definition, there is no universal best way to decide how to make a tradeoff. Sometimes the relative importance of sensitivity or specificity depends on the decisions to be made based on model predictions. For example, in theoretical research Type I error is considered to be more serious because it is a false statement rather than simply a failure to reject a null hypothesis. However, in some environmental or epidemiological decision-making contexts, the decision corresponding to Type II error might be much more harmful to public health than that which would correspond to a Type I error, requiring increased attention to uncertainty about the adequacy of null hypothesis (Peterman, 1990; Andorno, 2004). In this way, one could characterize the comparison of models by analogy to a medical diagnostic test (see, e.g., Altman and Bland, 1994), replacing “Type I error” with “false positive” and “Type II error” with “false negative.” AIC and BIC use the same data but apply different cutoffs for whether to “diagnose” the smaller model as being inadequate. AIC is more sensitive (lower false-negative rate), but BIC is more specific (lower false-positive rate). The utility of each cutoff is determined by the consequences of a false positive or false negative and by one’s beliefs about the base rates of positives and negatives. Thus, AIC and BIC could be seen as representing different sets of prior beliefs in a Bayesian sense (see Burnham and Anderson, 2004; Kadane and Lazar, 2004) or, at least, different judgments about the importance of parsimony. For example, although AIC has favorable theoretical properties for choosing the number of parameters needed to approximate the shape of a nonparametric growth curve in general (Shao, 1997), in a particular application with such data Dziak et al. (2015) argued that BIC would give more interpretable results. They argued this because the curves in that context were believed likely to have a smooth and simple shape, as they represented averaged trajectories of an intensively measured variable on individuals with diverse individual experiences and because deviations from the trajectory could be modeled using other aspects of the model.
As a caveat, if a researcher wishes to consider practical consequences of decisions based on model choices directly, it may be much more satisfactory to explicitly use Bayesian decision theory rather than simply choosing a value of Expression An in (1) (see, e.g., Claxton et al., 2000; Gelman and Rubin, 1995). Also, in practice it is often difficult to determine the a value that a particular criterion really represents, for two reasons. First, even for regular situations in which a LRT is known to work well, the x2 distribution for the test statistic is asymptotic and will not apply well to small n. Second, in some situations the rationale for using an IC is, ironically, the failure of the assumptions needed for a LRT. That is, the test emulated by the IC will itself not be valid at its nominal a level anyway. Therefore, although the comparison of An to an a level is helpful for getting a sense of the similarities and differences among the ICs, simulations are required to describe exactly how they behave. In the section below we review simulation results from a common application of ICs, namely the selection of the number of latent classes (empirically derived clusters) in a dataset.
3 The Special Case of Latent Class Analysis
A common use of ICs is in selecting the number of components for a latent class analysis (LCA). LCA is a kind of finite mixture model (essentially, a model-based cluster analysis; McLachlan and Peel, 2000; Lazarsfeld and Henry, 1968; Collins and Lanza, 2010). LCA assumes that the population is a “mixture” of multiple classes of a categorical latent variable. Each class has different parameters that define the distributions of observed items, and the goal is to account for the relationships among items by defining classes appropriately. In this section we consider LCA as described in Collins and Lanza (2010), although ICs are also used for more complex mixture models and clustering applications (e.g., Wang et al., 2012; Ye et al., 2015). LCA is very similar to cluster analysis, but is based on maximizing an explicitly stated likelihood function rather than focusing on a heuristic computational algorithm like A;-means. Also, some authors use the term LCA only when the observed variables are also categorical, and use the term “latent profile analysis” for numerical observed variables, but we ignore this distinction here. LCA is also closely related to latent transition (LTA) models (see Collins and Lanza, 2010), an application of hidden Markov models (see, e.g., Eddy, 2004) that allows changes in latent class membership, conceptualized as transitions in an unobserved Markov chain. LCA models are sometimes used in combination with other models, such as in predicting class membership from genotypic or demographic variables, or predicting medical or behavioral phenotypes from class membership (e.g., Lubke et al., 2012; Dziak et al., 2016; Bray et al, 2018). To fit an LCA model or any of its cousins, an algorithm such as EM (Dempster et al., 1977; McLachlan and Peel, 2000; Gupta and Chen, 2010) is often used to alternatively estimate class-specific parameters and predict subjects’ class membership given those parameters. The user must specify the number of classes in a model, but the true number of classes is generally unknown. (Nylund et al., 2007; Tein et al., 2013). Sometimes one might have a strong theoretical reason to specify the number of classes, but often this must be done using data-driven model selection.
A naïve approach would be to use likelihood ratio (LR) or deviance (G2) tests sequentially to choose the number of classes and to conclude that the k-class model is large enough if and only if the (k +1)-class model does not fit the data significantly better. The selected number of classes would be the smallest k that is not rejected when compared to the (k + 1)-class model. However, the assumptions for the supposed asymptotic χ2 distribution in a LRT are not met in the setting of LCA, so that the p-values from those tests are not valid (see Lin and Dayton, 1997; McLachlan and Peel, 2000). The reasons for this are based on the fact that Ho here is not nested in a regular way within Hi, since a k-class model is obtained from a (k + 1)-class model either by constraining any one of the class sizes to a boundary value of zero or by setting the class-specific item-response probabilities equal between any two classes. That is, an meaningful k-class model is not obtained simply by setting a parameter to zero in a (k + 1) class model in the way that, for example, a more parsimonious regression model is obtained by constraining certain coefficients in a richer model to zero. Ironically, the lack of regular nesting structure that makes it impossible to decide on the number of classes with an LRT has also been shown to invalidate the mathematical approximations used in the AIC and BIC derivations in the same way (McLachlan and Peel, 2000, pp. 202-212). Nonetheless, ICs are widely used in LCA and other mixture models. This is partly due to their ease of use, even without a firm theoretical basis. Fortunately, there is at least an asymptotic theoretical result showing that, when the true model is well-identified, BIC (and hence also AIC and ABIC) will have a probability of underestimating the true number of classes that approaches 0 as sample size tends to infinity (Leroux, 1992; McLachlan and Peel, 2000, p. 209).
Lin and Dayton (1997) did an early simulation study comparing the performance of AIC, BIC, and CAIC for choosing which assumptions to make in constructing constrained LCA models, a model selection task which is somewhat but not fully analogous to choosing the number of classes. When a very simple model was used as the true model, BIC and CAIC were more likely to choose the true model than AIC, which tended to choose an unnecessarily complicated one. When a more complex model was used to generate the data and measurement quality was poor, AIC was more likely to choose the true model than BIC or CAIC, which were likely to choose an overly simplistic one. They explained that this was very intuitive given the differing degrees of emphasis on parsimony. Interpreting these results, Dayton (1998) suggested that AIC tended to be a better choice in LCA than BIC, but recommended computing and comparing both.
Other simulations have explored the ability of the ICs to determine the correct number of classes. In Dias (2006), AIC had the lowest rate of underfitting but often overfit, while BIC and CAIC practically never overfit but often underfit. AIC3 was in between and did well in general. The danger of underfitting increased when the classes did not have very different response profiles and were therefore easy to mistakenly lump together; in these cases BIC and CAIC almost always underfit. Yang (2006) reported that ABIC performed better in general than AIC (whose model selection accuracy never got to 100% regardless of n) or BIC or CAIC (which underfit too often and required large n to be accurate). Fonseca and Cardoso (2007) similarly suggested AIC3 as the preferred selection criterion for categorical LCA models.
Yang and Yang (2007) compared AIC, BIC, AIC3, ABIC and CAIC. When the true number of classes was large and n was small, CAIC and BIC seriously under-fit, but AIC3 and ABIC performed better. Nylund et al. (2007) presented various simulations on the performance of various ICs and tests for selecting the number of classes in LCA, as well as factor mixture models and growth mixture models. Overall, in their simulations, BIC performed much better than AIC, which tended to overfit, or CAIC, which tended to underfit (Nylund et al., 2007, p. 559). However, this does not mean that BIC was the best in every situation. In most of the scenarios considered by Nylund et al. (2007), BIC and CAIC almost always selected the correct model size, while AIC had a much smaller accuracy in these scenarios because of a tendency to overfit. In those scenarios, n was large enough so that the lower sensitivity of BIC was not a problem. However, in a more challenging scenario with a small sample and unequally sized classes, (Nylund et al., 2007, p. 557), BIC essentially never chose the larger correct model and it usually chose one that was much too small. Thus, as Lin and Dayton (1997) found, BIC may select too few classes when the true population structure is complex but subtle (for example, a small but nonzero difference between the parameters of a pair of classes) and n is small. Wu (2009) compared the performance of AIC, BIC, ABIC, CAIC, naïve tests, and the bootstrap LRT in hundreds of simulated scenarios. Performance was heavily dependent on the scenario, but the method that worked adequately in the greatest variety of situations was the bootstrap LRT, followed by ABIC and classic BIC. Wu (2009) argued that BIC seemed to outperform ABIC in the most optimal situations because of its parsimony, but that ABIC seemed to do better in situations with smaller n or more unequal class sizes. Dziak et al. (2014) also concluded that BIC could seriously underfit relative to AIC for small sample sizes or other challenging situations. In latent profile analysis, Tein et al. (2013) found that BIC and ABIC did well for large sample sizes and easily distinguishable classes, but AIC chose too many classes, and no method performed well for especially challenging scenarios. In a more distantly related mixture modeling framework involving modeling evolutionary rates at different genomic sites, Kalyaanamoorthy et al. (2017) found that AIC, AICc, and BIC worked well but that BIC worked best.
Despite all these findings, is not possible to say which IC is universally best, even in the idealized world of simulations. Rather, the true parameter values and n used when generating simulated data determine the relative performance of the ICs. For small n, the most likely error in a simulation is underfitting, so the criteria with lower underfitting rates, such as AIC, often seem better. For larger n, the most likely error is overfitting, so more parsimonious criteria, such as BIC, often seem better. Unfortunately, the point at which the n becomes “large” depends on numerous aspects of the situation. Furthermore, all of these findings have limited usefulness in real data, where the truth is unknown. It may be more helpful to think about which aspects of performance (e.g., sensitivity or specificity) are most important in a given situation.
If the goal of having a sufficiently rich model to describe the heterogeneity in the population is more important than parsimony, or if some classes are expected to be small or similar to other classes but distinguishing among them is still considered important for theoretical reasons, then perhaps AIC, AIC3 or ABIC should be used instead of BIC or CAIC. If obtaining a few large and distinctly interpretable classes is more important, then BIC is more appropriate. Sometimes, the AIC-favored model might be so large as to be difficult to use or understand. In these cases, the BIC-favored model is clearly the better practical choice. For example, in Chan et al. (2007) BIC favored a mixture model with 5 classes, and AIC favored at least 10; the authors felt that a 10-class model would be too hard to interpret. In fact, it may be necessary for theoretical or practical reasons to choose a number of classes even smaller than that suggested by BIC. This is because it is important to choose the number of classes based on their theoretical interpretability, as well as by excluding any models with so many classes that they lead to a failure to converge to a clear maximum-likelihood solution (see Collins and Lanza, 2010; Pohle et al., 2017; Bray and Dziak, 2018).
An alternative to ICs in latent class analysis and cluster analysis is the use of a bootstrap test (see McLachlan and Peel, 2000). Unlike the naïve NRT, Ny-lund et al. (2007) showed empirically that the bootstrap LRT with a given α level does generally provide a Type I error rate at or below that specified level. Both Nylund et al. (2007) and Wu (2009) found that this bootstrap test seemed to perform somewhat better than the ICs in various situations. The bootstrap LRT is beyond the scope of this paper, as are more computationally intensive versions of AIC and BIC, involving bootstrapping, cross-validation, or posterior simulation (see McLachlan and Peel, 2000, pp. 204-212). Also beyond the scope of this paper are mixture-specific selection criteria such as the normalized entropy criterion (Biernacki et al., 1999) or integrated completed likelihood (Biernacki and Celeux, 2000; Rau and Maugis, 2018). However, the basic ideas in this article will still be helpful in interpreting the implications of some of the other selection methods. For example, like any test or criterion, the bootstrap LRT still requires the choice of a tradeoff between sensitivity and specificity (i.e., by selecting an α level).
4 Discussion
If BIC indicates that a model is too small, it may well be too small (or else fit poorly for some other reason). If AIC indicates that a model is too large, it may well be too large for the data to warrant. Beyond this, theory and judgment are needed. If BIC selects the largest and most general model considered, it is worth thinking about whether to expand the space of models considered (since an even more general model might fit even better), and similarly if AIC chooses the most parsimonious.
AIC and BIC each have distinct theoretical advantages. However, a researcher may judge that there may be a practical advantage to one or the other in some situations. For example, as mentioned earlier, in choosing the number of classes in a mixture model, the true number of classes required to satisfy all model assumptions is sometimes quite large, too large to be of practical use or even to allow coefficients to be reliably estimated. In that case, BIC would be a better choice than AIC. Additionally, in practice, one may wish to rely on substantive theory or parsimony of interpretation in choosing a relatively simple model. In such cases, the researcher may decide that even the BIC may have indicated a model that is too complex in a practical sense, and may choose to select a smaller model that is more theoretically meaningful or practically interpretable instead (Pohle et al., 2017; Bray and Dziak, 2018). This does not mean that BIC overfit. Rather, in these situations the model desired is sometimes not the literally true model but simply the most useful model, a concept which cannot be identified using fit statistics alone but requires subjective judgment. Depending on the situation, the number of classes in a mixture model may either be interpreted a true quantity needing to be objectively estimated, or else as a level of approximation to be chosen for convenience, like the scale of a map. Still, in either case the question of which patterns or features are generalizable beyond the given sample remains relevant (c.f. Li and Marron, 2005).
A larger question is whether to use ICs at all. If ICs indeed reduce to LRTs in simple cases, one might wonder why ICs are needed at all, and why researchers cannot simply do LRTs. A possible answer is flexibility. Both AIC and BIC can be used to concurrently compare many models, not just a pair at a time, or to weight the estimates obtained from different models for a common quantity of interest. These weighting approaches use either AIC or BIC but not both, because AIC and BIC are essentially treated as different Bayesian priors. While currently we know of no mathematical theoretical framework for explicitly combining both AIC and BIC into a single weighting scheme, a sensitivity analysis could be performed by comparing the results from both. AIC and BIC can also be used to choose a few well-fitting models, rather than selecting a single model from among many and assuming it to be the truth (Kuha, 2004). Researchers have also proposed benchmarks for judging whether the size of a difference in AIC or BIC between models is practically significant (see Burnham and Anderson, 2004; Raftery, 1995; Mur-taugh, 2014); for example, an AIC or BIC difference between two models of less than 2 provides little evidence for one over the other; an AIC or BIC difference of 10 or more is strong evidence. These principles should not be used as rigid cutoffs Murtaugh (2014), but as input to decision making and interpretation. Kadane and Lazar Kadane and Lazar (2004) suggested that ICs might be used to “deselect” very poor models (p. 279), leaving a few good ones for further study, rather than indicating a single best model. One could use the ICs to suggest a range of model sizes to consider for future study; for example, in some cases one might use the BIC-preferred model as a minimum size and the AIC-preferred model as a maximum. AIC and BIC can also both be used for model averaging, that is, estimating quantities of interest by combining more than one model weighted by their plausibility (see Posada and Crandall, 2001; Posada and Buckley, 2004; Claeskens and Hjort, 2008; Johnson and Omland, 2004; Gelman and Rubin, 1995; Hoeting et al., 1999; Burnham and Anderson, 2004). Despite these many worthwhile options, it is still important to remember an automatic and uncritical use of an IC is no more insightful than an automatic and uncritical use of a p-value.
Lastly, both AIC and BIC were developed in situations in which n was assumed to be much larger than p. None of the ICs discussed here were specifically developed for situations such as those found in many genome-wide association studies predicting disease outcomes, in which the number of participants (n) is often smaller than the number of potential genes (p), even when n is in the tens of thousands. The ICs can still be practically useful in this setting (e.g., Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013). However, sometimes they might need to be adapted (see, e.g., Chen and Chen, 2008; Pan et al., 2016; Mestres et al., 2018). More research in this area would be worthwhile.
Funding
This research was supported by NIH grant P50 DA039838 from the National Institute on Drug Abuse. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute on Drug Abuse or the National Institutes of Health.
Acknowledgements
The authors thank Linda M. Collins for very valuable suggestions and insights which helped in the development of this paper. We also thank Michael Cleveland for his careful review and recommendations on an earlier version of this paper. John Dziak thanks Frank Sabach for encouragement in the early stages. Amanda Applegate is also thanked for her editorial assistance. Lars Jermiin thanks the University College Dublin for its generous hospitality.
A previous version of this report has been disseminated as Methodology Center Technical Report 12-119, June 27, 2012, and as a preprint at https://peerj.com/preprints/1103/. The earlier version of the paper contains simulations to illustrate the points made. Simulation code is available at http://www.runmycode.org/companion/view/1306 and results at https://methodology.psu.edu/media/techreports/12-119.pdf.