Minimizing KL-divergence maximizes Mutual Information

Here I show a simple proof of how minimizing KL-divergence maximizes Mutual Information

This is a quick note just to show in clear notation, how mutual information within the model converges to the data as a consequence of minimizing cross-entropy.

We are given the following:

\begin{aligned} & A \overset{\theta}{\leftrightsquigarrow} X \overset{N}{\leftrightsquigarrow} Y \\[2ex] \mathcal{I} &\equiv \text{domain of the "input"} \\ \mathcal{O} &\equiv \text{domain of the "output"} \\ X : \mathcal{I} &\equiv \text{input (part of the dataset)} \\ Y : \mathcal{O} &\equiv \text{label for X, (part of the dataset)} \\ A : \mathcal{O} &\equiv \text{predicted label for }X \\ \end{aligned}

The X \leftrightsquigarrow Y relationship is constant, determined by Nature "N". The X \leftrightsquigarrow A relationship is controlled by model parameters \theta and changes during training. For clarity, I will notate all quantities dependent on \theta using a subscript.

Given this is a supervised learning context, I(X;Y) = H(Y) - H(Y|X) is usually "close to" H(Y) but the results here apply regardless the value of I(X;Y).

Notation: One random variable represents only one distribution

I use a notation in which one random variable represents exactly one distribution, but random variables may have the same domain. In particular, A and Y share the same domain \mathcal{O}, but represent distinct distributions.

Another common style is to make random variables domain-centric, that is, to use one random variable for each domain that exists in the problem statement. Then, one must use different operator letters to signify different distributions.

For example, if we did not use A to represent model predictions, but overload Y instead, then p(Y) is now ambiguous. Does it mean the distribution of labels in the dataset, or over the model predictions? To disambiguate these, sometimes different operator names are chosen, such as p(Y) and q(Y). However, this method does not allow combining different random variables in expressions unambiguously, so I favor the first style.

With this "One RV, One Distribution" style, we have for example:

\begin{aligned} \\ \mathrm{p}(Y) &\equiv \text{distribution of labels in the dataset} \\[1ex] \mathrm{p}_\theta(A) &\equiv \text{distribution of labels assigned by the model} \\[1ex] \mathrm{p}(Y|X) &\equiv \text{distribution of label } Y \text{ assigned to } X \\[1ex] \mathrm{p}_{\theta}(A|X) &\equiv \text{distribution of label } A \text{ predicted by the model for } X \\ \end{aligned}

A perfectly trained model \mathrm{p}_{\theta}(A|X) should attain the same shape as \mathrm{p}(Y|X) for every x \in X, so we can minimize D_{\theta}(Y\|A|X) or D_\theta(A \| Y | X) to achieve that. Since we are dealing with an empirical sample from P(X, Y), we can minimize D_{\theta}(Y\|A|X):

\begin{aligned} \\[1ex] D_{\theta}(Y\|A|X) &= E_X E_{Y|X} \log \dfrac{\mathrm{p}(Y|X)}{\mathrm{p}_{\theta}(A\mathord{=}Y|X)} \\[1ex] &= H_{\theta}(Y \bullet A | X) - H(Y|X) \\ \end{aligned}

H_{\theta}(Y \bullet A | X) represents conditional cross entropy (to distinguish from joint entropy). Since H(Y|X) is a constant, we can minimize D_{\theta}(Y|A|X) by minimizing H_{\theta}(Y \bullet A | X).

The empirical estimate of conditional cross entropy is derived as:

\begin{aligned} \\[1ex] H_{\theta}(Y \bullet A | X) &= E_X E_{Y|X} [-\log \mathrm{p}_{\theta}(A\mathord{=}Y|X)] \\[1ex] \widetilde{H}_\theta(Y \bullet A|X) &= \dfrac{1}{N} \sum_{x_i \sim \mathcal{D}}^N { E_{Y|X\mathord{=}x_i} [-\log \mathrm{p}_\theta(A\mathord{=}Y|x_i)] } \\[1ex] \end{aligned}

Intuitively, it seems like the following implication would hold:

\begin{aligned} \\[1ex] H_{\theta}(Y \bullet A | X) \rightarrow 0 &\implies \mathrm{p}_\theta(A|X) \rightarrow \mathrm{p}(Y|X) \\[1ex] &\implies H_\theta(A|X) \rightarrow H(Y|X) \\[1ex] &\implies I_\theta(A;X) \rightarrow I(Y;X) \end{aligned}

The reverse implication does not hold. If we simply push the mutual information I_\theta(A;X) upward somehow, this will not automatically lead to a good prediction. The reason is that Mutual Information is invariant to any one-to-one remapping of elements in a variable's domain, while cross-entropy (and KL-divergence) is not. But, what this says is that, in order to predict, the model must acquire the same level of mutual information as is in the data.