Abstract
Information criteria (ICs) based on penalized likelihood, such as Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC), and sample-size-adjusted versions of them, are widely used for model selection in health and biological research. However, different criteria sometimes support different models, leading to discussions about which is the most trustworthy. Some researchers and fields of study habitually use one or the other, often without a clearly stated justification. They may not realize that the criteria may disagree. Others try to compare models using multiple criteria but encounter ambiguity when different criteria lead to substantively different answers, leading to questions about which criterion is best. In this paper we present an alternative perspective on these criteria that can help in interpreting their practical implications. Specifically, in some cases the comparison of two models using ICs can be viewed as equivalent to a likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test than AIC. This perspective may lead to insights about how to interpret the ICs in more complex situations. For example, AIC or BIC could be preferable, depending on the relative importance one assigns to sensitivity versus specificity. Understanding the differences and similarities among the ICs can make it easier to compare their results and to use them to make informed decisions.
Key Points
Information criteria such as AIC and BIC are motivated by different theoretical frameworks.
However, when comparing pairs of nested models, they reduce algebraically to likelihood ratio tests with differing alpha levels.
This perspective makes it easier to understand their different emphases on sensitivity versus specificity, and why BIC but not AIC possesses model selection consistency.
This perspective is useful for comparisons, but it does not mean that the information criteria are only likelihood ratio tests. Information criteria can be used in ways these tests themselves are not as well suited for, such as for model averaging.