Here I show a simple proof of how minimizing KL-divergence
maximizes Mutual Information
This is a quick note just to show in clear notation, how mutual
information within the model converges to the data as a consequence of
minimizing cross-entropy.
We are given the following:
\begin{aligned}
& A \overset{\theta}{\leftrightsquigarrow} X \overset{N}{\leftrightsquigarrow} Y \\[2ex]
\mathcal{I} &\equiv \text{domain of the "input"} \\
\mathcal{O} &\equiv \text{domain of the "output"} \\
X : \mathcal{I} &\equiv \text{input (part of the dataset)} \\
Y : \mathcal{O} &\equiv \text{label for X, (part of the dataset)} \\
A : \mathcal{O} &\equiv \text{predicted label for }X \\
\end{aligned}
The X \leftrightsquigarrow Y relationship is constant,
determined by Nature "N". The X \leftrightsquigarrow A
relationship is controlled by model parameters \theta and
changes during training. For clarity, I will notate all quantities
dependent on \theta using a subscript.
Given this is a supervised learning context, I(X;Y) = H(Y) -
H(Y|X) is usually "close to" H(Y) but the
results here apply regardless the value of I(X;Y).
Notation: One random variable represents only one distribution
I use a notation in which one random variable represents exactly one
distribution, but random variables may have the same domain. In
particular, A and Y share the same domain
\mathcal{O}, but represent distinct distributions.
Another common style is to make random variables domain-centric, that
is, to use one random variable for each domain that exists in the problem
statement. Then, one must use different operator letters to signify
different distributions.
For example, if we did not use A to represent model
predictions, but overload Y instead, then
p(Y) is now ambiguous. Does it mean the distribution of
labels in the dataset, or over the model predictions? To disambiguate
these, sometimes different operator names are chosen, such as
p(Y) and q(Y). However, this method does
not allow combining different random variables in expressions
unambiguously, so I favor the first style.
With this "One RV, One Distribution" style, we have for example:
\begin{aligned}
\\
\mathrm{p}(Y) &\equiv \text{distribution of labels in the dataset} \\[1ex]
\mathrm{p}_\theta(A) &\equiv \text{distribution of labels assigned by the model} \\[1ex]
\mathrm{p}(Y|X) &\equiv \text{distribution of label } Y \text{ assigned to } X \\[1ex]
\mathrm{p}_{\theta}(A|X) &\equiv \text{distribution of label } A \text{ predicted by the model for } X \\
\end{aligned}
A perfectly trained model \mathrm{p}_{\theta}(A|X)
should attain the same shape as \mathrm{p}(Y|X) for every
x \in X, so we can minimize
D_{\theta}(Y\|A|X) or D_\theta(A \| Y |
X) to achieve that. Since we are dealing with an empirical
sample from P(X, Y), we can minimize
D_{\theta}(Y\|A|X):
\begin{aligned}
\\[1ex]
D_{\theta}(Y\|A|X) &= E_X E_{Y|X} \log \dfrac{\mathrm{p}(Y|X)}{\mathrm{p}_{\theta}(A\mathord{=}Y|X)} \\[1ex]
&= H_{\theta}(Y \bullet A | X) - H(Y|X) \\
\end{aligned}
H_{\theta}(Y \bullet A | X) represents conditional cross entropy (to distinguish from joint
entropy). Since H(Y|X) is a constant, we can minimize D_{\theta}(Y|A|X) by
minimizing H_{\theta}(Y \bullet A | X).
The empirical estimate of conditional cross entropy is derived as:
\begin{aligned}
\\[1ex]
H_{\theta}(Y \bullet A | X) &= E_X E_{Y|X} [-\log \mathrm{p}_{\theta}(A\mathord{=}Y|X)] \\[1ex]
\widetilde{H}_\theta(Y \bullet A|X) &= \dfrac{1}{N} \sum_{x_i \sim \mathcal{D}}^N { E_{Y|X\mathord{=}x_i} [-\log \mathrm{p}_\theta(A\mathord{=}Y|x_i)] } \\[1ex]
\end{aligned}
Intuitively, it seems like the following implication would hold:
\begin{aligned}
\\[1ex]
H_{\theta}(Y \bullet A | X) \rightarrow 0 &\implies \mathrm{p}_\theta(A|X) \rightarrow \mathrm{p}(Y|X) \\[1ex]
&\implies H_\theta(A|X) \rightarrow H(Y|X) \\[1ex]
&\implies I_\theta(A;X) \rightarrow I(Y;X)
\end{aligned}
The reverse implication does not hold. If we simply push the mutual
information I_\theta(A;X) upward somehow, this will not
automatically lead to a good prediction. The reason is that Mutual
Information is invariant to any one-to-one remapping of elements in a
variable's domain, while cross-entropy (and KL-divergence) is not. But,
what this says is that, in order to predict, the model must acquire the
same level of mutual information as is in the data.