How much information does quantization or noise destroy?

Here I compare loss of information due to quantization or addition of noise

Quantizing and adding noise to a latent representation are two methods to "reduce information", or create a "bottleneck". This note attempts to show how that works, and how they are comparable in terms of the mutual information between such variables.

Motivation

Deep learning models sometimes employ information reduction techniques of quantization or noising as a means for regularizing a model. One of the effects of such an operation is to discard information. Specifically, the Markov chain X \leftrightsquigarrow \text{reduce}(X) \leftrightsquigarrow Y will produce a lower value for I(X;Y) than would the chain X \leftrightsquigarrow Y. Seen against the backdrop of Tishby's Information Bottleneck principle, we can ask, what is the mutual information I(X; \text{reduce}(X)), where \text{reduce}(X) is either quantization or adding noise.

Although all random variables in an implemented model are effectively discrete, we consider both continuous and discrete cases for theoretical completeness. To avoid confusion, we are considering the mutual information between a variable and its reduced transformation. One could also ask, what is the difference in mutual information I(X; X) - I(X; \text{reduce}(X)) as a way of comparing a model that does or does not use a form of information reduction.

Setup

Assume the following (rather contrived) variables:

\begin{aligned} X \in \Bbb R &\sim U[0, 8) \\[1ex] Q \in \Bbb N &= \dfrac{\lfloor (X * 2) \rfloor}{2} & \text{"quantized" from X rounded down to nearest} \tfrac{1}{2} \\[1ex] N \in \Bbb R &= X + U[0, \tfrac{1}{2}) \mod 8 & \text{X with added noise in }[0, \tfrac{1}{2})\\[1ex] \\ \end{aligned}

Q takes the value of the nearest \tfrac{1}{2} value lower bound of X's value, while N adds noise in [0, \tfrac{1}{2}), wrapping around in case it goes outside X's domain. Thus, X, N, and Q are all uniform distributions.

From these definitions we have the following properties. For clarity I notate quantities on discrete variables with \Sigma and continuous variables with \int. As I discuss below, these two styles cannot be combined. However, mutual information always combines two quantities of the same type.

Quantization Mutual Information

Here is the calculation of mutual information due to quantization.

\begin{aligned} H_{\int}(X) &= 3 & \text{} \\ H_{\int}(X|Q) &= -1 \\ I_{\int}(X;Q) &= H_{\int}(X) - H_{\int}(X|Q) = 4 \\[2ex] \end{aligned}

Knowing Q in all cases narrows down the value of X to a \tfrac{1}{2} length interval of uniform probability, which has differential entropy -1, so our mutual information is 4.

We can calculate this the other way as well:

\begin{aligned} H_\Sigma(Q) &= 4 & \\ H_\Sigma(Q|X) &= 0 \\ I_\Sigma(X;Q) &= H_\Sigma(Q) - H_\Sigma(Q|X) = 4 \\[2ex] \end{aligned}

Q has entropy 4 bits because it is a uniform discrete with 16 classes. And knowing X determines Q exactly, making it a one-hot distribution, so again we get a mutual information of 4. Unlike the previous, we are using discrete entropies here.

Noise Mutual Information

Adding noise has a similar effect. We again start out with a differential entropy of 3 for X. Conditioning on N narrows X to a \tfrac{1}{2} length uniform interval, just like with the quantized case.

\begin{aligned} H_{\int}(X) &= 3 & \\ H_{\int}(X|N) &= -1 \\ I_{\int}(X;N) &= 4 \end{aligned}

Calculating the other way, we see that conditioning on X narrows the noise variable to the \tfrac{1}{2} length uniform interval, entropy -1, so we have again mutual information of 4.

\begin{aligned} H_{\int}(N) &= 3 \\ H_{\int}(N|X) &= -1 \\ I_{\int}(X;N) &= 4 \end{aligned}

Comparison of Quantization and Noise

So quantizing to the nearest \tfrac{1}{2} value produces a relationship with the same mutual information as adding a uniform noise over the same sized interval. In general, the smaller the quantization interval or noise interval, the more mutual information. In the limit, we approach maximal mutual information.

Varying a quantization codebook density, or varying the variance of the noise added per data point can allow a network to learn how much mutual information to preserve on a per-sample basis.

An inconsistency between discrete and continuous ("differential") entropy formulas

As an aside, in the case of continuous random variables, this limit is infinity, which is not consistent with the formulation that I(X;X) = H(X). In other words, a deterministic mapping from a continuous variable to another continuous variable has infinite mutual information between them. In practice, there are no such variables - everything is quantized.

Discrete entropy is formulated as:

H_\Sigma(X) \equiv - \sum_{x_i}^N{p(x_i) \log(p(x_i))}

It is bounded in [0, \log(N)].

Differential entropy is formulated as

H_{\int}(X) \equiv - \int_{\mathcal{X}}{d(x) \log(d(x)) \,dx}

It is bounded in [-\infty, \log(|\mathcal{X}|)]

Both attain their maximal value for uniform distributions. Discrete entropy attains its minimum of 0 for any one-hot distribution, while H_{\int} takes its minimum of -\infty for any distribution with support of measure zero.

The Shannon coding interpretation to entropy is the minimum expected codeword length for symbols sampled i.i.d. from the distribution. Unfortunately, H_{\int} doesn't conform to this, because it would take infinite bits to represent values in a continuous domain. Also, the notion of entropy as self-information, H(X) = I(X;X) doesn't work for a serendipitous reason: that I(X;X) is calculated by subtracting entropies H_{\int}(X) - H_{\int}(X|X). In this case, the second term would be -\infty and we would indeed get \infty (correctly) as our value for self-information.