Parametric inference in the large data limit using maximally informative models

Justin B. Kinney; Gurinder S. Atwal

doi:10.1101/001396

Abstract

Motivated by data-rich experiments in transcriptional regulation and sensory neuro-science, we consider the following general problem in statistical inference. When exposed to a high-dimensional signal S, a system of interest computes a representation R of that signal which is then observed through a noisy measurement M. From a large number of signals and measurements, we wish to infer the “filter” that maps S to R. However, the standard method for solving such problems, likelihood-based inference, requires perfect a priori knowledge of the “noise function” mapping R to M. In practice such noise functions are usually known only approximately, if at all, and using an incorrect noise function will typically bias the inferred filter. Here we show that, in the large data limit, this need for a pre-characterized noise function can be circumvented by searching for filters that instead maximize the mutual information I[M; R] between observed measurements and predicted representations. Moreover, if the correct filter lies within the space of filters being explored, maximizing mutual information becomes equivalent to simultaneously maximizing every dependence measure that satisfies the Data Processing Inequality. It is important to note that maximizing mutual information will typically leave a small number of directions in parameter space unconstrained. We term these directions “diffeomorphic modes” and present an equation that allows these modes to be derived systematically. The presence of diffeomorphic modes reflects a fundamental and nontrivial substructure within parameter space, one that is obscured by standard likelihood-based inference.

1 Introduction

This paper discusses a familiar problem in statistical inference, but focuses on an understudied limit that is becoming increasingly relevant in the era of large data sets. Consider an experiment having the following form:

When presented with a signal S, a system of interest applies a deterministic filter θ thereby producing an internal representation R of that signal. For each representation R, a noisy measurement M is then generated. The conditional probability distribution π(M|R) from which M is drawn is called the “noise function” of the system. From data consisting of N signal-measurement pairs, , we wish to reconstruct the filter θ. This paper focuses on how to infer θ properly in the N → ∞ limit when the noise function π is unknown a priori.

All statistical regression problems have this “SRM” form (Bishop, 2006), but we will focus on two biological applications for which this problem is particularly relevant. In neuroscience, SRM experiments are commonly used to characterize the response of neurons to stimuli (Schwartz et al., 2006). For instance, S may be an image to which a retina is exposed, while M is a binary variable (‘spike’ or ‘no spike’) indicating the response of a single retinal ganglion cell. It is often assumed that the spiking probability depends on a linear projection R of S. The specific probability of a spike given R is determined by the noise function π.

More recently, analogous experiments have been used to characterize the biophysical mechanisms of transcriptional regulation. In the context of work by Kinney et al. (2010), S is the DNA sequence of a transcriptional regulatory region, R is the rate of mRNA transcription produced by this sequence, and M is a (noisy) measurement of the resulting level of gene expression. The filter θ is a function of DNA sequence that reflects the underlying molecular mechanisms of transcript initiation. The noise function π accounts for both biological noise¹ and instrument noise.

The standard approach for solving inference problems like these is to adopt a specific noise function π, then search a space Θ of possible filters for the one filter θ that maximizes the likelihood , where, is the per-datum log likelihood. For instance, the method of least squares regression corresponds to maximum likelihood inference assuming a homogenous Gaussian noise function π (Bishop, 2006).

Although the correct filter θ does indeed maximize L(θ, π) when the correct noise function π is used, full a priori knowledge of this noise function is rare in practice. Often π is chosen primarily for computational convenience, as is standard with least-squares regression. This can be problematic because using an incorrect π will typically produce bias in the inferred filter θ, bias that does not disappear in the N → ∞ limit. The reason for this is illustrated in Fig. 1.

Figure 1:

Maximizing likelihood with an incorrect noise function will generally bias the inferred filter. The per-datum log likelihood L(θ, π) will typically depend on both the filter θ and the noise function π in a correlated manner (left panel). Values of a schematic L(θ, π) are illustrated in gray, with darker shades indicating larger likelihood. If the correct noise function π^* is assumed (solid line), maximizing L(θ, π^*) will yield the correct filter θ^* (filled dot). However, if an incorrect noise function π′ is assumed (dashed line), maximizing L(θ, π′) will typically lead to an incorrect filter θ′ (open dot).

Sometimes this problem can be partially alleviated by performing a separate “calibration experiment” in which the noise function π(M|R) is measured directly. For instance, one might be able to make repeated measurements M for a select number of known representations R. However, there will always be residual measurement error in π that will propagate to θ in a manner that is not properly accounted for by simply plugging π into likelihood calculations via Eq. 2.

An alternative inference procedure (Sharpee et al., 2004; Paninski, 2003; Kinney et al., 2007) that circumvents the need for an assumed noise function is to maximize the mutual information (Cover & Thomas, 1991), between predictions R and measurements M.² Here, p(R, M) is the empirical joint distribution between predictions and measurements, and thus depends implicitly on θ. This method has been proposed, studied, and applied in the specific contexts of receptive field inference (Sharpee et al., 2004; Paninski, 2003; Sharpee et al., 2006; Pillow & Simoncelli, 2006) and transcriptional regulation (Kinney et al., 2007; Elemento et al., 2007; Kinney, 2008; Kinney et al., 2010; Melnikov et al., 2012). However, this alternative approach can be applied to a much wider range of statistical regression problems, and a general discussion of how maximizing mutual information relates to maximizing likelihood for arbitrary SRM systems has yet to be presented.

We begin by pointing out that, in the N → ∞ limit, maximizing mutual information over θ alone is equivalent to maximizing likelihood over both θ and π. We then prove that when the correct filter θ lies within the class of filters being considered, maximizing mutual information is also equivalent to simultaneously maximizing every dependence measure that satisfies the Data Processing Inequality (DPI). However, in the absence of a known noise function π, SRM experiments are fundamentally incapable of constraining certain directions in the parameter space of θ; we call these directions “diffeomorphic modes.” An equation for diffeomorphic modes is described and then applied to filters having various functional forms. In particular, our analysis of a linear-nonlinear filter used by Kinney et al. (2010) to model transcriptional regulation demonstrates how model nonlinearities can eliminate diffeomorphic modes in useful and non-obvious ways. This has important consequences for biophysical studies of transcriptional regulation that use recently developed DNA-sequencing-based assays (Kinney et al., 2010; Melnikov et al., 2012).

Throughout this manuscript, R is used to implicitly denote the representation predicted by the filter θ for signal S, i.e. R ≡ θ(S). is used to denote any DPI-satisfying dependence measure. Representations R are assumed to be multidimensional with components R^μ, and ∂_μ ≡ ∂/∂R^μ. θ is used to denote both a filter and the parameters governing that filter. Θ represents both an abstract space of filters, as well as the space of parameters for filters assumed to have a specific functional form. In the latter case, θⁱ denotes coordinates in parameter space, and ∂_i ≡ ∂/∂θⁱ.

2 Mutual information and likelihood

We begin by discussing the connection between likelihood and mutual information in the N → ∞ limit. In this limit, the per-datum log likelihood (Eq. 2) can be rewritten as,

The first term, I(θ), is the mutual information between R and M (Eq. 3) and is independent of the noise function π. The second term, is the Kullback-Leibler (KL) divergence between the empirical distribution p(M|R), which results from the choice of θ, and the assumed noise function π(M|R). The last term, H[M] = – ∫ dM p(M) log p(M), is the entropy of the measurements M. H[M] is independent of both θ and π and can thus be ignored in the optimization problem.

The key point is that finding maximally informative filters θ is equivalent to solving the maximum likelihood problem over both filters θ and noise functions π. This is because if θ maximizes I(θ), simply choosing a noise function that matches the empirical noise function, i.e. setting π(M|R) = p(M|R), will minimize D(θ, π) and thus maximize L(θ, π).

If one can formalize prior assumptions about the noise function π using a Bayesian prior p(π), the relevant objective function becomes the per-datum marginal likelihood,

This is analogous to Eq. 4 computed after all possible noise functions have been integrated out. As has been shown in previous work (Kinney et al., 2007; Rajan et al., 2013), maximizing marginal likelihood and maximizing mutual information are essentially equivalent in the N → ∞ limit. This can be seen by decomposing L_m(θ) as, where,

Under weak assumptions about the prior p(π),³ Δ → 0 as N → ∞ (see Appendix A).

3 DPI-optimal filters

Mutual information is just one measure among many that satisfy DPI (see Appendix B). In this section, we discuss the importance of DPI for the SRM inference problem and introduce the notion of “DPI-optimal” filters.

Paninski (2003) has argued as follows for using DPI-satisfying dependence measures as objective functions for inferring filters. If θ* is the correct filter in an SRM experiment, then for every filter θ, is a Markov chain. This implies for every DPI-satisfying measure 𝒟. If θ^* resides within the space Θ of filters being explored, it must therefore fall within the subset of Θ_𝒟 ⊆ Θ on which 𝒟 is maximized. As a simple extension of this argument, we point out that, because θ^* maximizes all DPI-satisfying measures, θ^* must actually lie within the intersection of all such sets, i.e.,

Filters in Θ_DPI can properly be said to be “DPI-optimal.”

This raises an important question: would optimizing a variety of different measures 𝒟, not just mutual information, narrow the search for θ^*? Here we show the answer is ‘no’; when θ^* ϵ Θ, maximizing mutual information is equivalent to simultaneously maximizing every DPI-satisfying measure, i.e.,

To prove this, we first define on the space of all possible filters a weak and strong partial ordering, as well as an equivalence relation. These mathematical structures are a natural consequence of DPI. For any two filters θ₁ and θ₂,⁴ we write,

Note that θ₁ ≤ θ₂ if is a Markov chain. The set Θ_DPI of DPI-optimal filters is the supremum of Θ under this partial ordering. The equivalence Θ_I = Θ_DPI, which occurs when θ^* ϵ Θ, follows directly from the fact, proven in Appendix C, that θ < θ^* implies I(θ) < I(θ^*). We note that this is not true for all DPI-satisfying measures. For instance, the trivial measure 𝒟 = 0 satisfies DPI but reveals no information about whether a given θ resides in Θ_DPI. These results are illustrated in Fig. 2.

Figure 2:

Venn diagram illustrating filter sets maximizing different DPI-satisfying measures. In general, different DPI-satisfying dependence measures, e.g. mutual information I and some other measure 𝒟, will be maximized by different sets of filters, respectively represented here by Θ_I and Θ_𝒟. Θ_DPI is the intersection of the optimal sets of all such DPI-satisfying measures. Mutual information has the important property that Θ_I = Θ_DPI whenever θ^* ϵ Θ; this is not true of all DPI-satisfying measures.

4 Diffeomorphic modes

Whether or not two filters θ₁ and θ₂ satisfy the above equivalence relation (Eq. 15) can depend on the true filter θ^* and on the specific noise function π^* of the SRM experiment. However, certain pairs of filters will satisfy θ₁ ⋍ θ₂ under all SRM experiments. We will refer to such pairs of filters as being “information equivalent.” In Appendix D we prove that two filters are information equivalent if and only if their predicted representations are related by an invertible transformation.

As an objective function, mutual information is inherently incapable of distinguishing between information equivalent filters. In practice this means that selecting maximally informative filters from a parametrized set of filters can leave some directions in parameter space unconstrained. Here we term these directions “diffeomorphic modes.”

The diffeomorphic modes of linear filters have an important and well-recognized consequence in neuroscience: the technique of maximally informative dimensions can identify only the relevant subspace of signal space, not a specific basis within that sub-space (Sharpee et al., 2004; Paninski, 2003; Pillow & Simoncelli, 2006). However, an interesting twist occurs in applications to transcriptional regulation. Here, linear filters are often used to model the sequence-dependent binding energies of proteins to DNA (Stormo, 2013). Any mechanistic hypothesis about how DNA-bound proteins interact with one another predicts that the transcription rate will depend on these binding energies in a specific nonlinear manner (Bintu et al., 2005; Stormo, 2013). Such up-front knowledge about the nonlinearities of linear-nonlinear filters can eliminate diffeomorphic modes of the underlying linear filters in useful and non-obvious ways (Kinney, 2008; Kinney et al., 2010).

4.1 An equation for diffeomorphic modes

Consider a filter θ, representing a point in Θ, whose parameters θⁱ are infinitesimally transported along a vector field having components gⁱ(θ). This yields a new filter θ′ with components θ^′i = θⁱ + ∊gⁱ (θ). If the representation R predicted by θ for a specified signal S has components R^μ in representation space, these will be transformed to

If the vector field gⁱ(θ) represents a diffeomorphic mode of Θ, this transformation must be invertible, meaning the values cannot depend on S except through the value of R. This is a nontrivial condition because ∂_iR can depend on the underlying signal S in an arbitrary manner. However, if does indeed depend only on the value of R then, for some vector function h^μ (R, θ). This is the equation that any diffeomorphic mode gⁱ(θ) must satisfy.

4.2 General linear filters

We now use Eq. 16 to derive the diffeomorphic modes of general linear filters. By definition, a linear filter θ yields a representation R that is a linear combination of signal “features” , i.e.,

As is standard with regression problems (Bishop, 2006), the term “linear” describes how R depends on the parameters θⁱ; the features need not be linear functions of S.

To find the diffeomorphic modes of these filters, we apply the operator to both sides of Eq. 17. Using Eq. 16 we then find . The left-hand side is linear in signal features, so unless something unusual happens,⁵ h^′(R, θ) must also be a linear function of R, i.e. have the form,

The number of diffeomorphic modes is bounded above by the number of independent parameters on which h^μ depends (at each θ).⁶ For a general linear filter we see that there can be no more than dim(R)[dim(R) + 1] diffeomoprhic modes, which is the number of parameters a^μ and in Eq. 18. This bound is independent of the number of signal features, i.e. the dimensionality of S. In particular, if R is a scalar, then h = a + bR. In this case we observe two diffeomorphic modes, corresponding to additive and multiplicative transformations of R.

4.3 A linear-nonlinear filter

Kinney et al. (2010) performed experiments probing the biophysical mechanism of transcriptional regulation at the Escherichia coli lac promoter (Fig. 3A). These experiments are of the SRM form where S is the DNA sequence of a mutated lac promoter, M is a measurement of the resulting gene expression, and the mRNA transcription rate T is the internal representation the system. Linear filters were used to model the binding energies Q and P of the two proteins CRP and RNAP. The specific parametric form used for these filters was, where b indexes the four possible bases (A,C,G,T), l indexes nucleotide positions within the 75 bp promoter DNA region, S_bl = 1 if base b occurs at position l and S_bl = 0 otherwise.⁷

Figure 3:

A linear-nonlinear filter modeling the biophysics of transcriptional regulation at the Escherichia coli lac promoter. (A) The biophysical model inferred by Kinney et al. (2010) from Sort-Seq data. Each signal S is a 75 bp DNA sequence differing from the wildtype lac promoter by ~ 9 randomly scattered substitution mutations. Q and P denote the sequence-dependent binding energies of the proteins CRP and RNAP to their respective sites on this sequence S; both Q and P were modeled as linear filters of S. γ is a sequence-independent interaction energy between CRP and RNAP. The resulting transcription rate T, of which the Sort-Seq assay produces noisy measurements M, is assumed to depend on Q, P, and γ in a specific nonlinear manner dictated by the hypothesized biophysical mechanism (Eq. 20; all energies are in units of k_BT). (B) The linear filter Q is defined by parameters and via Eq. 19. Inferring these parameters by maximizing the mutual information I [Q; M] determines up to an unknown scale and leaves undetermined. (C) Analogous results are obtained for the parameters and when I [P; M] is maximized. (D) Because of the inherent nonlinearity in Eq. 20 (right-hand side), maximizing I[T; M] breaks diffeomorphic modes, fixing the values of , , and in units of k_BT. The parameter remains undetermined.

Measurements M were taken for ∼ 5 × 10⁴ mutant lac promoters S. These data were then used to fit a model for the sequence-dependent binding energy of CRP. This was done by maximizing I [Q; M]. Because of the diffeomorphic modes of Q, the parameters were inferred up to an unknown scale and the additive constant was left undetermined. This is shown in Fig. 3B. Analogous results were obtained for RNAP (Fig. 3C).

Next, a full thermodynamic model of transcriptional regulation was proposed and fit to the data. Based on the hypothesized biophysical mechanism, the transcription rate T was assumed to depend on S via,

This quantity R is called the “regulation factor” of the promoter (Bintu et al., 2005). Because R is an invertible function of T, it serves equally well as the representation of the SRM system. In the following analysis we work with R instead of T due to its simpler functional form.

When the parameters of the linear filters P and Q were simultaneously fit to data by maximizing I[T; M] (or equivalently, maximizing I [R; M]), three of the four diffeomorphic modes described above were eliminated (Fig. 3D). Specifically, the overall scale of the parameters and were fixed, allowing binding energy predictions for CRP and RNAP in physically meaningful units of k_BT. The parameter , corresponding to the intracellular concentration of CRP, was also fixed by the data. The only diffeomorphic mode left unbroken was , corresponding to the intracellular concentration of RNAP.

We now show how the nonlinearity in R was able to break three of the four diffeomorphic modes of P and Q. First observe that any diffeomorphic mode of a linear-nonlinear filter must also be a diffeomorphic mode of each individual linear filter if, as here, the linear filters are independent functions of S. This means any diffeomorphic mode gⁱ of the full thermodynamic model for R must satisfy, for coefficients a_P, b_P, a_Q, b_Q, a_γ which do not depend on S. Evaluating the right-hand side derivatives and substituting for P in terms of Q and R we find,

For gⁱ to be a diffeomorphic mode, the right-hand side must be independent of S for fixed R. The terms dependent on Q must therefore vanish, rendering b_P = a_Q = b_Q = a_γ = 0.⁸ Any diffeomorphic modes gⁱ must therefore satisfy . Thus only one mode remains, corresponding to an additive shift in the binding energy P.

5 Discussion

Likelihood-based inference masks the fundamentally different ways in which data constrain the parameters that lie along diffeomorphic modes versus those that lie along nondiffeomorphic modes. Standard likelihood inference constrains all model parameters, including both diffeomorphic and nondiffeomorphic modes, with error bars that scale as N^−1/2.⁹ These constraints will be consistent with the correct underlying filter θ^* when the correct noise function is used (Fig. 4A). However, use of an incorrect noise function will typically cause θ^* to fall outside the error bars inferred along both diffeomorphic and nondiffeomorphic modes (Fig. 4B).

Figure 4:

Schematic illustration of constraints placed on diffeomorphic and non-diffeomorphic modes by different objective functions. The dot in each panel represents the correct filter θ^*; shades of gray represent the posterior distribution p(θ|data). (A,B) Likelihood (Eq. 2) places tight constraints (scaling as N^−1/2 as N → ∞) along both diffeomorphic and nondiffeomorphic modes. (A) θ^* will typically lie within error bars if the correct noise function π^* is used. (B) However, if an incorrect noise function π′ is used, θ^* will generally violate inferred constraints along both diffeomorphic and nondiffeomorphic modes. (C) Marginal likelihood (Eq. 7) computed using a sufficiently weak prior p(π) will place tight constraints on nondiffeomorphic modes and weak constraints (scaling as N⁰ as N → ∞) along diffeomorphic modes. (D) Mutual information (Eq. 3) places tight constraints on nondiffeomorphic modes but provides no constraints whatsoever on diffeomorphic modes.

This problem is rectified if we use a prior p(π) that reflects our uncertainty about what the true noise function is. From Eq. 8 it can be seen that using the resulting marginal likelihood to compute a posterior distribution on θ will constrain diffeomorphic and nondiffeomorphic modes in fundamentally different ways (Fig. 4C). Nondiffeomorphic modes will be constrained by I(θ), which remains finite in the large N limit. This produces error bars on nondiffeomorphic modes comparable to those produced by likelihood when the correct noise function π^* is used. However, constraints along diffeomorphic modes will come only from Δ. Because Δ vanishes as N⁻¹,¹⁰ diffeomorphic constraints become independent of N once N is sufficiently large.

Fortunately, one does not need to posit a specific prior probability over all possible noise functions in order to confidently infer filters from SRM data. Using mutual information as an objective function instead of likelihood, i.e. sampling filters according to , will constrain nondiffeomorphic modes the same way that marginal likelihood does while putting no constraints along diffeomorphic modes (Fig. 4D).

One might worry that a large fraction of filter parameters will be diffeomorphic, and that the analysis of SRM experiments will require an assumed noise function in order to obtain useful results even if doing so yields unreliable error bars. Such situations are conceivable, but in practice this is often not the case. We have shown that for linear filters, the number of diffeomorphic modes will typically not exceed dim(R)[dim(R) + 1] regardless of how large dim(S) is. Some of these diffeomorphic modes may also be eliminated if these linear filters are combined using a nonlinearity of known functional form. Indeed, of the 204 independent parameters comprising the biophysical model of transcriptional regulation inferred by Kinney et al. (2010), only one was diffeomorphic.

A bigger concern, perhaps, is the practical difficulty of using mutual information as an objective function. Specifically, it remains unclear how to compute I(θ) rapidly and reliably enough to confidently sample from . Still, various methods for estimating mutual information are available (Khan et al., 2007; Panzeri et al., 2007), and the information optimization problem has been successfully implemented using a variety of techniques (Sharpee et al., 2004; Sharpee et al., 2006; Kinney et al., 2007, 2010; Melnikov et al., 2012). We believe the exciting applications of mutual-information-based inference provide compelling motivation for making progress on these practical issues.

6 Appendix A: marginal likelihood

In certain cases Δ(θ) can be computed explicitly and thereby be shown to vanish (Kinney et al., 2007). More generally, when π is taken to be finite-dimensional, a saddle-point computation (valid for large N) gives + const. Here, is the π-space Hessian of computed using π(M|R) = p(M|R). If log p(π) and its derivatives are bounded, then the θ-dependent part of Δ(θ) decays as N⁻¹. If π is infinite dimensional, this saddle-point computation becomes a semiclassical computation in field theory akin to the density estimation problem studied by Bialek et al. (1996). If this field theory is properly formulated through an appropriate choice of p(π), then Δ(θ) may exhibit different decay behavior, but will still vanish as N → ∞. See also Rajan et al. (2013).

7 Appendix B: DPI-satisfying measures

DPI is satisfied by all measures of the F-information form (Csiszár & Shields, 2004; Kinney & Atwal, 2013), where F(x) is a convex function for x ≥ 0. Mutual information corresponds to F(x) = x log x whereas F(x) = (x^α — 1)/(α — 1) yields a more general “Rényi information” measure (Rényi, 1961) that reduces to mutual information when α = 1. DPI-satisfying measures other than mutual information have been used for filter inference in a number of works, including Paninski (2003) and Kouh & Sharpee (2009). A discussion of the differences between DPI-satisfying measures and some non-DPI-satisfying measures can be found in (Kinney & Atwal, 2013).

8 Appendix C: DPI-optimality

Assume θ < θ^* by Eq. 14. Because is a Markov chain, the KL divergence between p(R^*|R, M) and p(R^*|R) can be decomposed as D(p(R^*|R, M)||p(R^*|R)) = I[R*; M] — I [R; M]. If this quantity is zero, then is also Markov chain, implying θ^* ≤ θ, a contradiction. This KL divergence must therefore be positive, i.e. I(θ) < I(θ^*). So if θ^* ϵ Θ_DPI, then for every as well. This proves Θ_I = Θ_DPI.

9 Appendix D: information equivalence

First we observe that if θ₁ and θ₂ make isomorphic predictions then they are information equivalent. This is readily shown from the fact that 𝒟[R; M] is invariant under arbitrary invertible transformations of R (Kinney & Atwal, 2013). Next we show the converse: if θ₁ and θ₂ are information equivalent, the predictions R₁ and R₂ must be isomorphic. Here is the proof. If , then for all 𝒟, and in particular I[R₁; M] = I[R₂; M]. In Appendix C we showed that I[R; M] = I[R^*; M] implies is a Markov chain. Imagining an SRM experiment in which θ^* = θ₁ and π(M|R) = δ(M – R), we find that is a Markov chain. This implies the mapping R₂ → R₁ is one-to-one. Similarly, R₁ → R₂ is one-to-one. R₁ and R₂ are therefore bijective.

Acknowledgments

We thank William Bialek, Curtis Callan, Bud Mishra, Swagatam Mukhopadhyay, Anand Murugan, Michael Schatz, Bruce Stillman, and Gašper Tkačik for helpful conversations. Support for this project was provided by the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.

Footnotes

↵¹ Such as stochastic gene expression (Elowitz et al., 2002).
↵² The notation I(θ) and I[R; M] will be used interchangeably.
↵³ E.g. p(π) does not vanish at the true noise function π^*.
↵⁴ The subscripts 1 and 2 label two different filters, not two parameters of a single filter.
↵⁵ E.g. if the various features (S) exhibit complicated interdependencies, either because of their functional form or because signals S are restricted to a particular subspace. We ignore such possibilities here.
↵⁶ Technically the number of diffeomorphic modes is the number of independent vector fields gⁱ that correspond to such transformations. However, here we consider only proper diffeomorphic modes, not gauge transformations; as in physics, we define gauge transformations to be vector fields gⁱ along which transformation of θ leaves all predicted representations invariant.
↵⁷ To fix the gauge freedoms of these filters, Kinney et al. (2010) adopted the convention that for all positions l.
↵⁸ This assumes γ ≠ 0, i.e. that CRP actually interacts with RNAP. Which is true.
↵⁹ In this discussion we ignore gauge parameters, which do not alter model predictions and are therefore non-identifiable.
↵¹⁰ More precisely, given any direction i in filter space, for N large enough.

References

↵
Bialek, W., Callan, C., & Strong, S. (1996). Field theories for learning probability distributions. Phys. Rev. Lett., 77(23), 4693–4697.
OpenUrl CrossRef PubMed Web of Science
↵
Bintu, L., Buchler, N., Garcia, H., Gerland, U., Hwa, T., Kondev, J., & Phillips, R. (2005). Transcriptional regulation by the numbers: models. Curr. Opin. Genet. Dev., 15(2), 116–124.
OpenUrl CrossRef PubMed Web of Science
↵
Bishop, C. (2006). Pattern recognition and machine learning. Springer, New York.
Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Comput., 12(7), 1531–1552.
OpenUrl CrossRef PubMed Web of Science
↵
Cover, T. & Thomas, J. (1991). Elements of information theory. John Wiley & Sons, New York.
↵
Csiszár, I. & Shields, P.C. (2004). Information theory and statistics: a tutorial now Publishers, Hanover MA.
↵
Elemento, O., Slonim, N., & Tavazoie, S. (2007). A universal framework for regulatory element discovery across all genomes and data types. Mol. Cell, 28(2), 337–350.
OpenUrl CrossRef PubMed Web of Science
↵
Elowitz, M. B., Levine, A. J., Siggia, E. D., Swain, P. S. (2002). Stochastic gene expression in a single cell. Science, 297(5584), 1183–1186
OpenUrl Abstract/FREE Full Text
↵
Khan, S., Bandyopadhyay, S., Ganguly, A., Saigal, S., Erickson , III, D., Protopopescu, V., & Ostrouchov, G. (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E, 76(2), 026209.
OpenUrl
↵
Kinney, J. B., Tkačik, G., & Callan, C. G. (2007). Precise physical models of protein-DNA interaction from high-throughput data. Proc. Natl. Acad. Sci. USA, 104(2), 501–506.
OpenUrl Abstract/FREE Full Text
↵
Kinney, J. B. (2008). Biophysical models of transcriptional regulation from sequence data. PhD thesis, Princeton University.
↵
Kinney, J.B., Murugan, A., Callan, C.G., & Cox, E. (2010). Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci USA, 107(20), 9158–9163.
OpenUrl Abstract/FREE Full Text
↵
Kinney, J. B. & Atwal, G. S. (2013). Equitability, mutual information, and the maximal information coefficient arXiv:1301.7745 [q-bio.QM].
OpenUrl
↵
Kouh, M. & Sharpee, T. (2009). Estimating linear-nonlinear models using Rényi divergences. Network, 20(2), 49–68.
OpenUrl CrossRef PubMed Web of Science
↵
Melnikov, A., Murugan, A., Zhang, X., Tesileanu, T., Wang, L., Rogov, P., Feizi, S., Gnirke, A., Callan, C. G., Kinney, J. B., Kellis, M., Lander, E. S., & Mikkelsen, T. S. (2012). Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol., 30(3), 271–277.
OpenUrl CrossRef PubMed
↵
Paninski, L. (2003). Convergence properties of three spike-triggered analysis techniques. Network, 14(3), 437–464.
OpenUrl CrossRef PubMed Web of Science
↵
Panzeri, S., Senatore, R., Montemurro, M. A., & Petersen, R. S. (2007). Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol., 98(3), 1064–1072.
OpenUrl CrossRef PubMed Web of Science
↵
Pillow, J. W. & Simoncelli, E. P. (2006). Dimensionality reduction in neural models: an information-theoretic generalization of spike-triggered average and covariance analysis. J. Vis., 6(4), 414–428.
OpenUrl Abstract
↵
Rajan, K., Marre, O., & Tkačik, G. (2013). Learning quadratic receptive fields from neural responses to natural stimuli. Neural Comput., 25(7), 1661–1692.
OpenUrl CrossRef
↵
Rényi, A. (1961). On measures of entropy and information. Proc. 4th Berkeley Symp. Math. Statist. and Prob. (Univ. of Calif. Press), 1, 547–561.
OpenUrl
↵
Schwartz, O., Pillow, J., Rust, N., & Simoncelli, E. (2006). Spike-triggered neural characterization. J. Vis., 6(4), 484–507.
OpenUrl Abstract
↵
Sharpee, T., Rust, N., & Bialek, W. (2004). Analyzing neural responses to natural signals: maximally informative dimensions. Neural Comput., 16(2), 223–250.
OpenUrl CrossRef PubMed Web of Science
↵
Sharpee, T., Sugihara, H., Kurgansky, A., Rebrik, S., Stryker, M., & Miller, K. (2006). Adaptive filtering enhances information transmission in visual cortex. Nature, 439(7079), 936–942.
OpenUrl CrossRef PubMed Web of Science
↵
Stormo, G. D. (2013). Introduction to protein-DNA interactions: structure, thermodynamics, and bioinformatics Cold Spring Harbor Laboratory Press, New York.