Self-Supervised Pretrain speech representation for Speaker Recogntion

The ongoing success of deep learning techniques depends on the quality of the representations automatically discovered from data. What is a good representation?

1. high-performing on a diverse set of downstream tasks using simple models.
2. useful in transfer learning with small amounts of data for a new task.

[1] Introduction

Recently, Self-Supervised Learning(SSL) shows a promising approach for data representation. Self-Supervised Learning is a version of unsupervised learning where data provides the supervision, and the goal of SSL is learning represent the world before learning tasks. For instance, BERT is a well-know NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner, and it can map a variable-length sentence or word to a fixed-length vector for many NLP downstream tasks.

In speech processing filed, the extraction and selection of the best parametric representation of acoustic signals is an important task in the design of any speech recognition system or speaker recognition sysytem. However, the characteristics of the speakers in speech signal are poorly captured by the traditional acoustic features, such as the amplitudes of a wave signal, log Mel spectrograms, Mel frequency cepstral coefficients(MFCCs), or Filter banks(Fbanks).

The goal of Self-Supervised speech representation is to leverages an enormous amount of unlabeled speech data publicly available on the web and is trained in an unsupervised manner and find a transformation from the surface features that makes high-level properties of speech more accessible to downstream tasks, such as speech recognition and speaker recognition.

Therefore, inspired by the idea of Self-Supervised Learning, I did some experiments about Self-Supervised Speech Representation Feature for Speaker Recongnition task.

[2] Experiment

1. Toolkits

• fairseq: fairseq is powerful seq2seq model toolkit developed by Facebook AI research. wav2vec is a subset project in fairseq, which is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training.
• Kaldi: Kaldi is a powerful toolkit for speech signal processing, speech recognition and speaker recognition.
2. Dataset

• P100: a small speech command dataset which contains 100 speakers.
• P80(dev): for ASV system training, subset of P100 which contains 80 speakers
• P20(enroll & test): for ASV system evalutation, subset of P100 which contains 20 speakers
3. Config

Group 1: Average Pooling

- baseline MFCC average: MFCC feature with average pooling
- wav2vec c average: wav2vec pretrain feature c with average pooling
- wav2vec z average: wav2vec pretrain feature z with average pooling

Group 2: i-vector

- baseline MFCC i-vector: MFCC feature with i-vector ASV system
- wav2vec c i-vector: wav2vec pretrain feature c with i-vector ASV system
- wav2vec z i-vector: wav2vec pretrain feature z with i-vector ASV system

Group 3: x-vector

- baseline MFCC x-vector: MFCC feature with x-vector ASV system
- wav2vec c x-vector: wav2vec pretrain feature c with x-vector ASV system
- wav2vec z x-vector: wav2vec pretrain feature z with x-vector ASV system
1. Result

2. Analyze

The Result show wav2vec pretrain Speech feature.

See Yang Zhang’s Chinese Blog for more detail.

Contrastive Learning for ASV

Contrastive learning has recently shown encouraging progress in Self-Supervised Learning, e.g., in Momentum Contrast (MoCo) and SimCLR.