## Unknown

**character-based speech recognition**: what’s character-based?

## Some Tips

We explore unsupervised pre-training for speech recognition by learning representations of raw audio.

**wav2vec** is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training.

**We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.**

Our experiments on WSJ reduce WER of a strong character-based **log-mel filterbank baseline** by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data

Our model, wav2vec, is a convolutional neural network that takes raw audio as input and computes a general representation that can be input to a speech recognition system.

**The objective is a contrastive loss that requires distinguishing a true future audio sample from negatives.**

Different to previous work (van den Oord et al., 2018), we move beyond frame-wise phoneme classification and apply the learned representations to improve strong supervised ASR systems. wav2vec relies on a fully convolutional architecture which can be easily parallelized over time on modern hardware compared to recurrent models used in previous work.

We introduce wav2vec, the first application of unsupervised pre-training to speech recognition with a fully convolutional model. Our approach achieves 2.43% WER on the test set of WSJ, a result that outperforms the next best known character-based speech recognition model in the literature (Amodei et al., 2016) while using two orders of magnitude less transcribed training data.

- improves resource-poor setups
- settings where all WSJ training data is used

## Objective

Our model takes raw audio signal as input and then applies two networks.

**encoder network**: embeds the audio signal in a latent space**context network**: combines multiple time-steps of the encoder to obtain contextualized representations

Given raw audio samples $x_i \in \mathcal{X}$, we apply the encoder network $f : \mathcal{X} → \mathcal{Z}$ parameterized as a five-layer convolutional network

Next, we apply the context network $g : \mathcal{Z} → \mathcal{C}$ to the output of the encoder network to mix multiple latent representations $z_i…z_{i-v}$ into a single contextualized tensor $c_{i}= g(z_{i}…z_{i−v})$ for a receptive field size $v$.

**The objective is a contrastive loss that requires distinguishing a true future audio sample from negatives.**

We train the model to distinguish a sample $z_{i+k}$ that is $k$ steps in the future from distractor samples $\widetilde{z}$ drawn from a proposal distribution $p_n$, by minimizing the contrastive loss for each step $k = 1,…,K$:

$$L_k = -\sum_{i=1}^{T-k}(log\sigma(z_{i+k}^{T}h_k(c_i))+\lambda\mathbb{E}[log\sigma(-\widetilde{z}^{T}h_k(c_i))])$$

$$\mathcal{L} =\sum_{k=1}^{K}\mathcal{L}_k$$

- $\sigma(x)=1/(1+exp(-x))$ : sigmoid
- $\sigma(z_{i+k}^{T}h_{k}(c_i))$ : the probability of $z_{i+k}$ being the true sample

We consider a step-specific **affine transformation**(仿射变换) $h_k(c_i) = W_{k}c_{i}+b_{k}$ for each step $k$, that is applied to $c_i$(van den Oord et al., 2018). We optimize the loss $\mathcal{L} =\sum_{k=1}^{K}\mathcal{L}_k$, summing (1) over different step sizes. In practice, we approximate the expectation by sampling ten negatives examples by uniformly choosing distractors from each audio sequence, i.e., $p_n(z) = \frac{1}{T}$, where $T$ is the sequence length and we set $\lambda$ to the number of negatives

## Code Example

### Pre Train model Useage

- The encoder network embeds the audio signal in a latent space
- the context network combines multiple time-steps of the encoder to obtain contextualized representations

```
import torch
from fairseq.models.wav2vec import Wav2VecModel
cp = torch.load('/path/to/wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
c = model.feature_aggregator(z)
```