Githegi's Pen

Essays and articles about things that interest me

Quadratic Mutual Information (in brief)

Information theory is great – in theory. In practice, many of the quantities are tricky to compute. Most of them revolve around entropy

\[ H(X) = -\sum_{i=1}^N \log p_j, \]

which is fine for discrete systems for which we know all possible states and the distribution over those states. Most often this is far from the case, so we have to be clever about how we use these measures.

Thankfully, the work of past clever people cascades down to the rest of us, and we can reap the efforts of their cleverness. In 1994, Jagat Narain Kapur published a treatise: Measures of Information and their Application [1] in which he formulated alternative forms of mutual information with varying uses. Of note, to us, is what later developers of the theory called: Quadratic Mutual Information (QMI) [2]. The idea is that replacing Kullback-Liebler (KL) divergence in QMI with a ‘Quadratic Divergence’ results in a form that’s much easier to estimate in a differentiable way. The trade-off is that the raw QMI quantity isn’t meaningful the same way MI is, but here’s the kicker: the same distributions that maximise QMI also maximise MI. So we can optimise QMI, a differentiable and computationally-cheap-to-estimate quantity, as a proxy to MI. How great!

Deriving QMI

Both Jose Principe’s book Information Theoretic Learning [2] and Kari Torkkola’s paper [3] have well-expounded derivations of QMI and its estimator. Here I’ll re-elaborate on what has already been done there.

Mutual information between two random variables \(X\) and \(Y\) can be thought of as the KL-divergence between their joint distribution, and the product of their marginals, \[ I(X; Y) = D_{KL}(P_{X,Y}\|P_XP_Y) = \iint\log\frac{p(x,y)}{p(x)p(y)}\:dxdy. \]

With QMI we replace KL-divergence and replace it with something that looks similar to Euclidean distance,

\[ I_Q(X; Y) = \iint(p(x,y) - p(x)p(y))^2\:dxdy. \]

Expanding this we get a three-term quadratic-looking expression,

\begin{align*} I_Q(X; Y) = \iint p(x,y)^2 - 2p(x,y)p(x)p(y) + p(x)^2p(y)^2.\tag{1} \end{align*}

And so there it is, quadratic mutual information. Yet still not as useful in practice. We’re still stuck with those gnarly integrals and density functions we have to somehow estimate. In comes the second step: the QMI estimator.

The QMI estimator

First, we use Kernel Density Estimation (KDE) to approximate the density functions. Given dataset of \(N\) samples \(\{(x_i,y_i)|i=1…N\}\), we estimate the densities using a Gaussian kernel with bandwidth \(\sigma\):

\begin{align*} \hat{p}_{X,Y}(x, y) &= \frac{1}{N}\sum_{i=1}^N G_\sigma(x-x_i)G_\sigma(y-y_i)\tag{2}\\
\hat{p}_X(x) &= \frac{1}{N}\sum_{i=1}^NG_\sigma(x-x_i),\tag{3}\\
\hat{p}_Y(y) &= \frac{1}{N}\sum_{i=1}^NG_\sigma(y-y_i),\tag{4} \end{align*}

where \(G_\sigma(x)\) is the Gaussian density function with \(\mu=0\) and \(\sigma\) standard deviation. Before we plug the KDE estimates into (1), we note the following property of convolved Gaussian functions:

\[ \int G_\sigma(x-x')G_\sigma(y-y')\:dx’dy' = G_{\sigma\sqrt{2}}(x' - y'). \]

This neat simplification vanquishes the scary integrals of \((1)\) This is why we use the Gaussian kernel in particular. , leaving us with pleasant sums-of-sums that we can compute exactly. In this manner, each of three terms in \((1)\) becomes an information potential (IP): the joint-, marginal-, and cross-information potentials respectively, \[ \hat{I}_Q(X; Y) = \hat{V}_J + \hat{V}_M - 2\hat{V}_C,\\

where the IPs work out to:

\begin{align*} \hat{V}_J &= \frac{1}{N^2}\sum_{i=1}^N\sum_{j=1}^N G_{\sigma\sqrt2}(x_i - x_j)G_{\sigma\sqrt2}(y_i - y_j)\tag{5}\\
\hat{V}_M &= \bigg(\frac{1}{N^2}\sum_{i=1}^N\sum_{j=1}^N G_{\sigma\sqrt2}(x_i-x_j)\bigg)\bigg(\frac{1}{N^2}\sum_{i=1}^N\sum_{j=1}^N G_{\sigma\sqrt2}(y_i-y_j)\bigg)\tag{6}\\
\hat{V}_C &= \frac{1}{N}\sum_{i=1}^N \bigg( \frac{1}{N}\sum_{j=1}^N G_{\sigma\sqrt{2}}(x_i - x_j) \bigg) \bigg( \frac{1}{N}\sum_{j=1}^N G_{\sigma\sqrt{2}}(y_i - y_j) \bigg).\tag{7} \end{align*}


If you squint a little you’ll see that \((5)-(7)\) revolve around inter-sample differences \(x_i-x_j\) and \(y_i-y_j\). This means we can compute all of these once and just re-use them for each IP, bringing the time complexity of QMI into the \(\mathcal{O}(N^2)\) range. Let \(X \in \mathbb{R}^{N\times N}\) be the difference matrix for \(x\)-samples such that \(X_{ij} = G_{\sigma\sqrt{2}}(x_i - x_j\), and similarly for \(Y\). We can then rewrite the IPs with Einstein notation:

\begin{align*} \hat{V}_J &= N^{-2} X_{ij}Y_{ij}\\
\hat{V}_M &= N^{-4} (\mathbf{1}_{ij}X_{ij})(\mathbf{1}_{kl}Y_{kl}) \\
\hat{V}_C &= N^{-3} (\mathbf{1}_jX_{ij})(\mathbf{1}_kY_{ik}), \end{align*}

where summation over the indices is implied and \(\mathbf{1}\) is the vector/matrix of all ones (either/or implied by the number of indices); \(\mathbf{1}_{ij}A_{ij}\) is shorthand for summing over all the elements in \(A_{ij}\). The notation might be a little obtuse at first but it serves to show how simple the calculations really are.

The above translates to Python quite smoothly:

def qmi(x, y):
    """Compute QMI
        x and y: Length N ndarrays (samples from random variables X and Y)
    # Compute difference matrix (G is the elementwise univariate Gaussian pdf)
    X = G( x[:,None] - x )
    Y = G( y[:,None] - y )

    # Compute information potentials
    Vj = (X * Y).mean()
    Vm = X.mean() * Y.mean()
    Vc = (X.mean(axis=1) * Y.mean(axis=1)).mean()

    # QMI
    return Vj + Vm - 2*Vc


Here we’ve gone over the QMI formulation where both \(x\) and \(y\) are continuous-valued. This form is the most general and, fortunately, the most straightforward. In his 2002 paper, Torkkola [3] formulated QMI for the discrete-continuous case. There, one variable was class labels and the other a continuous (possibly multivariate) value. In that context QMI becomes a way to do feature selection: maximising the QMI between a transformation \(z = f_\theta(x)\) and the class labels, w.r.t. parameters \(\theta\), puts \(z\) in a space where same-class samples cluster and differing-class ones are far apart – effectively, classification.

QMI has been used this way in neuroscience [4,5]. The binary class labels become the presence/absence of a spike in a single neuron, \(x\) is whichever stimulus was used, and the transformation (often linear) projects the stimulus into a space where its samples are separated by whether they evoked a spike or not. Thus, the parameters of the transformation are the neuron’s receptive field: the stimulus that it’s sensitive to.

Though useful so far, QMI, as far as I can tell, seems to be relatively untapped. I mean, this is an easy way to optimise mutual information, we could use it in loads of places Caveat: the particular way QMI is optimised is important, particularly how bandwidth $\sigma$ is adjusted. This is worth talking about in a future article. . For instance, we could apply the same receptive-field mapping procedure to neurons/layers in a deep neural net and do something like feature visualisation. It would be great to see this used more.


[1] Kapur, J.N. (1994). Measures of information and their applications.

[2] Xu, D., Erdogmuns, D. (2010). Renyi’s Entropy, Divergence and Their Nonparametric Estimators. In: Information Theoretic Learning. Information Science and Statistics.

[3] Torkkola, K. (2002). On feature extraction by mutual information maximization.

[4] Mu, Z., Nikolic, K., Schultz, S.R. (2021). Quadratic Mutual Information estimation of mouse dLGN receptive fields reveals asymmetry between ON and OFF visual pathways.

[5] Katz, M., Viney, T., Nikolic, K. (2016). Receptive Field Vectors of Genetically-Identified Retinal Ganglion Cells Reveal Cell-Type-Dependent Visual Functions.