Predictive coding is one of the oldest techniques in signal processing for data compression

One of the most common strategies for unsupervised learning has been to predict future, missing or contextual information.

Mutual Information

  1. data: $x$
  2. context: $c$

$$I(x; c) = \sum_{x,c}p(x, c)log\frac{p(x|c)}{p(x)}$$


Contrastive Predictive Coding (CPC) aims to learn representations that separates the target future frame $x_{i+n}$ and randomly sampled negative frames $\tilde{x}$, given a context $h_i = (x_1, x_2, …, x_i)$


The methodology of the APC model is largely inspired by language models (LMs) for text, which are typically a probability distribution over sequences of $N$ tokens $(t_1, t_2, …, t_N )$. Given such a sequence, an LM assigns a probability $P(t_1, t_2, …, t_N)$ to the whole sequence by modeling the probability of token $t_k$ given the history $(t_1, t_2, …, t_{k−1})$


It is trained by minimizing the negative log-likelihood:


where the parameters to be optimized are $\theta_{t}$, $\theta_{rnn}$ and $\theta_{rnn}$ is a look-up table that maps each token into a vector of fixed dimensionality. $\theta_{rnn}$ is a Recurrent Neural Network (RNN) used to summarize the sequence history up to the current time step. $\theta_s$ is a Softmax layer appended at the output of each RNN time step for estimating probability distribution over the tokens. Language modeling is a general task that requires the understanding of many aspects in language in order to perform well.

In other words, given an utterance represented as a sequence of acoustic feature vectors $(x_1, x_2, …, x_T)$, the RNN processes each sequence element $x_t$ one at a time and outputs a prediction $y_t$, where $x_t$ and $y_t$ have the same dimensionality.