## Supervised Learning

Given:

• a dataset $\mathcal{D} = {(x, y)}_i^N$
• a loss function $\mathcal{l}$

Goal:

$$min_{\theta}\mathbb{E}_{(x, y)}$$

• Works well when labeled data is abundant
• Learn useful representation with the supervision

## Yann LeCun

in self-supervised learning, the systerm learns to predict part of its input from other parts of it input.

• Goal: Learning to represent the world before learning tasks.
• Predict any part of the input from any other part
• Predict the future from the recent past
• Predict the past from the present
• Predict the top from the bottom

Over the past decade, supervised deep learning models led to great strides in performance for speech processing technologies and applications.

However, unlike humans who are capable of self-learning through experiences and interactions, current real-world speech applications are heavily reliant on large volumes of human annotations.

For the next generation of speech processing systems to exhibit similar levels of cognitive intelligence as humans, they should be designed to exploit unlabeled, partially labeled, contextual and distant supervision data from multiple concurrent modalities, e.g., text and video, and learning signals from corrective user follow-ups in conversational applications.

The main motivation for self-supervised learning is rooted in our need to improve ASR systems when there is a limited amount of labeled data.

Self-supervised learning methods [LeCun 2016] construct proxy predictive tasks to train models for downstream scenarios by exploiting large amounts of unlabeled audio data, unpaired text and audio data in the same domain, or speech data with distant unrelated labels, e.g. A text summary or slides of an audio lecture.

Through these invented proxy tasks, models learn high-level representations that generalize well across different languages, domains, and deployment scenarios with very few in-domain labeled examples.

Self-supervised learning methods achieved major successes in Natural Language Processing (NLP) [Peters 2018, Devlin 2018, Radford 2019, Raffel 2019, Lewis 2019] and Computer Vision (CV) [Sun 2019, He 2019, Xie 2019, Misra 2019].

There is a recent surge in speech processing research work introducing predictive proxy tasks for model training, and achieving impressive results in downstream applications like ASR and speaker recognition. These self-supervised approaches include, but not limited to:

• Future prediction: Learning an autoregressive model that generates distant future audio features from historical ones [Oard 2018, Chung 2019, Schneider 2019].
• Mask prediction: Learning a model that predicts masked parts of the input audio signal [Liu 2019, Song 2019, Baevski 2019a, Baevski 2019b].
• Generating contextual data: Learning a model to predict semantically-related contextual information that accompany the speech signal, e.g. Using social media title and comments as input audio labels [Singh 2019, Pascual 2019].
• Chaining ASR and TTS: Using unpaired audio and text data to train an ASR system and a TTS system jointly, where one is generating training paired data for the other [Tjandra 2019, Hori 2019, Baskar 2019]. This family of self-supervised methods can be viewed as auto-encoders of speech signals through latent text representations. Effective use of external language models falls into this category to regularize the text representations.

## Workshop - ICML 2019 Self-Supervised Learning

Big data has driven a revolution to many domains of machine learning thanks to modern high-capacity models, but the standard approaches – supervised learning from labels, or reinforcement learning from a reward function – have become a bottleneck.

Even when data is abundant, getting the labels or rewards that specify exactly what the model must do is often intractable. Collecting simple category labels for classification is prohibitive for millions of billions of examples, and structured outputs (scene interpretations, interactions, demonstrations) are far worse, especially when the data distribution is non-stationary.

Self-supervised learning is a promising alternative where proxy tasks are developed that allow models and agents to learn without explicit supervision in a way that helps with downstream performance on tasks of interest. One of the major benefits of self-supervised learning is increasing data efficiency: achieving comparable or better performance with less labeled data or fewer environment steps (in Reinforcement learning / Robotics).

The field of self-supervised learning (SSL) is rapidly evolving, and the performance of these methods is creeping closer to the fully supervised approaches. However, many of these methods are still developed in domain-specific sub-communities, such as Vision, RL and NLP, even though many similarities exist between them. While SSL is an emerging topic and there is great interest in these techniques, there are currently few workshops, tutorials or other scientific events dedicated to this topic.

This workshop aims to bring together experts with different backgrounds and applications areas to share inter-domain ideas and increase cross-pollination, tackle current shortcomings and explore new directions. The focus will be on the machine learning point of view rather than the domain side.

## ICML 2020 Self-supervision in Audio and Speech

The ongoing success of deep learning techniques depends on the quality of the representations automatically discovered from data.

These representations must capture important underlying structures from the raw input, e.g., intermediate concepts, features, or latent variables that are useful for the downstream task.

While supervised learning using large annotated corpora can leverage useful representations, collecting large amounts of annotated examples is costly, time-consuming, and not always feasible.

This is particularly problematic for a large variety of applications. In the speech domain, for instance, there are many low-resource languages, where the progress is dramatically slower than in high-resource languages such as English. Moreover, annotations are often underspecified for many potential downstream applications, and the related supervised representations might be biased towards the task they are trained on, limiting their exportability to other applications 2.