Autoencoder with additional Predictive Coding

I present a modified VAE that combines the prior and recognition model, trains it using a predictive coding loss as well as a reconstruction loss with the decoder.

Kingma et al. derive a training objective for a variational autoencoder. The derivation can be understood, in my opinion a little easier than their original one, as follows.

They start out with a data set X which is assumed to be distributed according to a model p_\theta(z) p_\theta(x|z), and the goal is to estimate \theta. At this point, the only useful property of learning this relationship is that it can now be sampled from to generate new samples from p_\theta(x). There is nothing claimed about any useful properties of z itself.

Although they make the point that the prior p_\theta(z) can in principle be complex, they work through an example in which it is an isotropic standard multivariate normal distribution. The fact that this is a smooth distribution allows one to visualize how the complex output space in X changes as z gradually changes. Other than this, the only practical significance of this is just the ability to create samples.

For the moment, let's accept that learning \theta is a worthy goal, and follow their argument about how to do so. The first step is to minimize the cross-entropy \mathbb{E}_{x \sim Data}[p_\theta(x)]. Calculating this using

p_\theta(x) \cong \mathbb{E}_{z \sim p_\theta(z)}[p_\theta(x|z)]

is intractable, because in general, the region in z-space where p_\theta(x|z) has high density will be small, and it is unknown where this region is. So, we can never get enough samples of z to make the approximation accurate.

But, there is a nice identity:

p_\theta(x) \equiv \dfrac{p_\theta(x, z)}{p_\theta(z|x)}

which is true for any value of z. But now the problem is that we cannot calculate p_\theta(z|x).

That density is defined implicitly, or "induced", as I like to think of it, as p_\theta(z|x) \equiv p_\theta(z) p_\theta(x|z) / \sum_z{p_\theta(z) p_\theta(x|z)} We cannot calculate it, but it is helpful to know that it is a well-defined entity that is determined through our calculatable expressions p_\theta(z) and p_\theta(x|z).

Instead, what Kingma and Welling do is introduce a distribution q_\phi(z|x) and set up an objective so that it is forced to be closer and closer to the induced distribution p_\theta(z|x). They rewrite the above as an approximation times an error factor:

p_\theta(x) \equiv \dfrac{p_\theta(x, z)}{q_\phi(z|x)} \dfrac{q_\phi(z|x)}{p_\theta(z|x)}

or

\log p_\theta(x) \equiv \log \dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log \dfrac{q_\phi(z|x)}{p_\theta(z|x)}

Note that, even though the formula contains the term parameterized by \phi, the left-hand side does not depend on \phi, which is why \phi doesn't appear in the parameterization. Its effects cancel.

So now, we have two terms: one which is calculatable, which is an approximation, and the second one, which is a correction factor, or error term. Again, this is true for any z, no matter where it comes from.

The next idea puts together four facts conveniently. First is that, after the log transformation, we can express the product of the two factors as a sum of logs. Second, because the full expression is constant over z, we can take an expectation of that expression over any distribution of z.

\log p_\theta(x) \equiv \mathbb{E}_{z \sim p_{any}(z)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log \dfrac{q_\phi(z|x)}{p_\theta(z|x)}]

Third, if we specifically choose q_\phi(z|x) as the distribution, then the second expectation becomes a KL-divergence term, which is known to be always positive.

\begin{aligned} \log p_\theta(x) &\equiv \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log \dfrac{q_\phi(z|x)}{p_\theta(z|x)}] && \text{Expectation of constant = constant}\\ &= \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)}] + \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{q_\phi(z|x)}{p_\theta(z|x)}] && \text{Separate out non-computable terms} \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)}] + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] && \text{Recognize KL-divergence} \\ &\ge \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)}] && \text{Positivity of KL-divergence} \end{aligned}

Intuitively, what is happening is that each term in the expectation will be pushed up. Recall that \dfrac{d}{dx} \log x = \dfrac{1}{x} \, dx. So, the smaller q_\phi(z|x) is, the steeper the gradient pushing it down. Meanwhile, the smaller p(z,x) is, the steeper the gradient pushing it up.

Finally, now that we have two expectations, one of which is always positive, this means the other is a lower bound for the full expression. Conveniently, now, we can calculate the value of the first term. Even though we cannot calculate the value of the KL-divergence term, it doesn't prevent us from maximizing the full expression, since we can just maximize a lower bound for it.

Recall that the full expression given above has the same value for any z in the domain. But, for any choice of z, the first term (the approximation) will vary, and the second term (the correction factor)

First, the error term will be much more accurate for z values in which both q_\phi(z|x) and p_\theta(z|x) are high. This isn't a mathematically rigorous statement really. But, intuitively, the ratio is less sensitive to absolute differences when both values are high. Second, if we calculate an expectation over samples from q_\phi(z|x):

\begin{aligned} \mathbb{E}_{z \sim q_\phi(z|x)}[ \log \dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log \dfrac{q_\phi(z|x)}{p_\theta(z|x)}] \end{aligned}

then the second term becomes the always-positive KL-divergence between q and what it is approximating. It is important to interpret this formula correctly. Remember that the term p_\theta(z|x) is not directly modeled, rather "induced", and that it is not calculatable even from the modeled distributions it is induced from. It's not obvious, but the two terms are connected in such a way that maximizing the first term will drive the second term to zero. I'm not sure what to make of this, since the second term is the always-positive KL-divergence D_{KL}(q_\phi(z|x)||p_\theta(z|x)). So it would appear that one of the forces might try to increase it, even though overall, the whole expression is maximized when this term goes to zero. I don't know how to think about this.

Putting that aside, we can calculate the gradients with respect to \theta and \phi. So, training works, and we learn all three modeled distributions p_\theta(z), p_\theta(x|z) and q_\phi(z|x). This means they are consistent with the underlying joint distribution p(x, z). Finally, the induced marginal p_{\theta,\phi}(x) asymptotically approaches the data distribution X in the limit of a large amount of data.

Practice error

From the above, we have expressed the learning objective in terms of a code prediction error and a reconstruction error. The code prediction error corresponds with the human experience of a constant monitoring of moment-to-moment expectations for what comes next, either passively, or when we take action. The reconstruction error corresponds roughly to a human who can perceive some data x in the sensory domain, and then, based on that perception z, produce another x in the data domain. In the way that we train an autoencoder, the model is forced (teacher forcing) to reproduce the same x and we measure the assigned density and optimize it.

I'd like to introduce a new objective, called practice error, which simulates the human experience of practicing. By "practicing", I mean, the human has an idea of a desirable behavior of some sort, which he understands only through his representation of the sense perceptions resulting from that behavior. The trial-and-error process of practicing something consists of:

  1. Obtain (through observation or imagination) some representation \hat{z} of a desirable observation
  2. Generate x^{(i)} \sim p_\theta(x|\hat{z}) through bodily action and the resulting interaction with environment and senses
  3. Interpret the result through your senses: z^{(j)} \sim q_\phi(z|x^{(i)})
  4. Measure the error between \hat{z} and z^{(j)}
  5. Based on the error, adjust the behavior generation p_\theta(x|z) and/or the perception q_\phi(z|x)
  6. Repeat

Mapping this process onto an autoencoder might look like an inverted reconstruction:

\mathbb{E}_{x^{(i)} \sim p_\theta(x|\hat{z})}[r_\phi(\hat{z}|x^{(i)}, g = \text{open})]

Each x^{(i)} is a separate trial, representing a particular bodily action, interaction with environment, and resulting sensory data generated. Once x^{(i)} is generated, in this scheme, instead of generating a z^{(j)} and measuring some distance, we simply try to maximize the expected probability assigned to \hat{z}. This is not the same as step five above, both procedures will produce the same result, because XXX

In the above, \hat{z} was obtained "through observation or imagination". If the goal is to imitate a particular observation \hat{x}, then it makes sense to perform the above experiment over multiple zs that arise from it. This would then be:

\mathbb{E}_{z^{(j)} \sim r_\phi(z|\hat{x}, g = \text{open}) \atop x^{(k}) \sim p_\theta(x|z^{(j)})} [ r_\phi(z^{(j)}|x^{(k)}, g = \text{open})]

What is a plausible model, inspired by the VAE, about how the brain processes speech?

I have been considering the VAE model in the context of unsupervised speech representation learning, using WaveNet as the decoder, where X is speech audio data, and z is a time-indexed latent that is supposed to represent something like a "phoneme". From that point of view, I want to map the various cognitive processes and experiences onto the expressions p_\theta(z), q_\phi(z|x), and p_\theta(x|z). I settled on the following:

First, I note that z and x are time-indexed, so should be written z \equiv [z_1, z_2, ..., z_n] and x \equiv [x_1, x_2, ..., x_n]. Second, the process of silently imagining words in your head, I interpret as sampling from p(z), in which the brain maintains a current state that summarizes previously generated z_{t-i}'s. Thus, it is written autoregressively as p(z_t | z_{t-k}, ..., z_{t-1}).

The process of pronouncing these imagined phonemes and words involves another circuit that takes this generated state and translates it into motor output for the tongue, larynx, vocal chords, jaw, etc. This could be modeled as p_\theta(m_t | z_t) and where we have a fixed function, determined by one's body itself, that translates the motor commands into actual sound, x_t = f_{body}(m_t), which is not parameterized. The final model thus would be p_\theta(x_t | z_t) \equiv f_{body}(p_\theta(m_t|z_t)). An important point here is that m_t is much lower-dimensional than x_t since there are relatively few muscles involved, and the sound waves produced have very high temporal resolution.

The peculiar thing now is imagining the process of listening to speech. I feel it is very likely the case that the same circuits that are involved in speech imagining, which we modeled as p(z_t | z_{t-k}, ..., z_{t-1}) must also be involved in listening.

What about the process of listening to speech? The intuition is that listening (or any experience) is a predictive activity. Based on the recent past inputs, we are constantly generating predictions about what might come next. And, moment-to-moment, the hypothesis is that our brains compare these predictions with the encoding derived from the immediate sensory information. So, if I were to model this, it would be a recognition circuit that is augmented with autoregressive context, as q_\phi(z_t | z_{t-k}, ..., z_{t-1}, x_t)

But, if the imagining process, and the listening process both involve moment-to-moment integration of recent past state, isn't it likely they would share the same circuits? The problem is that the VAE formulation doesn't allow that, because they are two separate models (I think). But, I believe we can combine them using an attentional trick as follows:

Allow for a gating mechanism mediated by inhibitory neurons, which allows or prevents signal from the sensory processing circuit to reach the recurrent circuit. The sensory processing circuit produces a representation z^{(s)}_t = f_{sense}(x_t), which will be compared to the autoregressively predicted z_t to produce an error term e_t. The new setup is now:

\begin{aligned} z &\equiv [z_1, ..., z_T] \\ p(z) &\equiv \prod_{t=k+1}^T { p_\theta(z_t | z_{t-k}, ..., z_{t-1}, e_{t-1}, g_{t-1} = \text{shut}) } \\ p(z|x) &\equiv \prod_{t=k+1}^T { p_\theta(z_t | z_{t-k}, ..., z_{t-1}, e_{t-1}, g_{t-1} = \text{open}) } \\ e &= f_{error}(z_{t-1}, z_{t-1}^{(s)}) && \text{error} \\ z^{(s)} &= f_{sense}(x_{t-1}) && \text{sense-derived encoding} \\ \end{aligned}

The error function could be a circuit that is always on, but upstream of the gating mechanism. If the gate is open, then two things happen. First, the autoregressive circuit incorporates the error signal and updates itself to try to minimize error. Second, the error provides additional conditioning input, along with z_{t-k},...,z_{t-1} which affects the prediction of latent variable at the next timestep.

If the gate is closed, the error signal is ignored. This will be akin to a person imagining words while ignoring other people talking. The sounds are going into the person's ears, and some lower level circuits have no choice but to process the information, but an attention mechanism prevents it from affecting the person's train of thought. Thus, the prediction of latent variable at the next timestep only incorporates the previous latent variables as context, and there is also no training signal.

Crucially, the system would have to train in such a way that, for any given z-context z_{t-k}, ..., z_{t-1}, the distribution of e_t arising from the data will have zero mean, and thus:

p(z) \propto \sum_{x \sim X}{p(z|x)}

We then stipulate that the circuit learns over time to make prediction errors that are zero-mean. The problem is that, by the VAE formulation, the prior and recognition model are completely separate. I wondered how one could combine them in a probabilistically consistent way, and have the following proposal:

Now, taking the insight from McAllester and van den Oord (ref, predictive coding), imagine that the activity of listening to spoken speech and recognizing phonemes, words, phrases, etc, involves the brain maintaining a current state which allows it to use information from previous latent states. As it does this, another circuit processes the current x_t and generates an "actual" z_t from which it Now, there are three main activities. First is the activity of listening to spoken speech, and recognizing phonemes, words, phrases, etc in real time.

The first thing that occurred to me is that the idea of sampling from the prior p(z) is how I should think about the mental process of imagining (but not pronouncing) words.