Unknown

• character-based speech recognition : what’s character-based?

Some Tips

We explore unsupervised pre-training for speech recognition by learning representations of raw audio.

wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training.

We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.

Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data

Our model, wav2vec, is a convolutional neural network that takes raw audio as input and computes a general representation that can be input to a speech recognition system.

The objective is a contrastive loss that requires distinguishing a true future audio sample from negatives.

Different to previous work (van den Oord et al., 2018), we move beyond frame-wise phoneme classification and apply the learned representations to improve strong supervised ASR systems. wav2vec relies on a fully convolutional architecture which can be easily parallelized over time on modern hardware compared to recurrent models used in previous work.

We introduce wav2vec, the first application of unsupervised pre-training to speech recognition with a fully convolutional model. Our approach achieves 2.43% WER on the test set of WSJ, a result that outperforms the next best known character-based speech recognition model in the literature (Amodei et al., 2016) while using two orders of magnitude less transcribed training data.

• improves resource-poor setups
• settings where all WSJ training data is used

Objective

Our model takes raw audio signal as input and then applies two networks.

• encoder network : embeds the audio signal in a latent space
• context network : combines multiple time-steps of the encoder to obtain contextualized representations

Given raw audio samples $x_i \in \mathcal{X}$, we apply the encoder network $f : \mathcal{X} → \mathcal{Z}$ parameterized as a five-layer convolutional network

Next, we apply the context network $g : \mathcal{Z} → \mathcal{C}$ to the output of the encoder network to mix multiple latent representations $z_i…z_{i-v}$ into a single contextualized tensor $c_{i}= g(z_{i}…z_{i−v})$ for a receptive field size $v$.

The objective is a contrastive loss that requires distinguishing a true future audio sample from negatives.

We train the model to distinguish a sample $z_{i+k}$ that is $k$ steps in the future from distractor samples $\widetilde{z}$ drawn from a proposal distribution $p_n$, by minimizing the contrastive loss for each step $k = 1,…,K$:

$$L_k = -\sum_{i=1}^{T-k}(log\sigma(z_{i+k}^{T}h_k(c_i))+\lambda\mathbb{E}[log\sigma(-\widetilde{z}^{T}h_k(c_i))])$$

$$\mathcal{L} =\sum_{k=1}^{K}\mathcal{L}_k$$

• $\sigma(x)=1/(1+exp(-x))$ : sigmoid
• $\sigma(z_{i+k}^{T}h_{k}(c_i))$ : the probability of $z_{i+k}$ being the true sample

We consider a step-specific affine transformation(仿射变换) $h_k(c_i) = W_{k}c_{i}+b_{k}$ for each step $k$, that is applied to $c_i$(van den Oord et al., 2018). We optimize the loss $\mathcal{L} =\sum_{k=1}^{K}\mathcal{L}_k$, summing (1) over different step sizes. In practice, we approximate the expectation by sampling ten negatives examples by uniformly choosing distractors from each audio sequence, i.e., $p_n(z) = \frac{1}{T}$, where $T$ is the sequence length and we set $\lambda$ to the number of negatives

Code Example

Pre Train model Useage

• The encoder network embeds the audio signal in a latent space
• the context network combines multiple time-steps of the encoder to obtain contextualized representations
import torch
from fairseq.models.wav2vec import Wav2VecModel

cp = torch.load('/path/to/wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()

wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
c = model.feature_aggregator(z)