TTS vs VC

Voice Conversion, VC: 语音转换处理的问题是输入一段声音,输出另外一段声音,但这两段声音有些不同,一般我们希望保留声音的内容,改变说话者的音色

Voice conversion (VC) is a technique to modify the speech from source speaker to make it sound like being uttered by target speaker while keeping the linguistic content unchanged

Text to Speech, TTS: 文字转语音系统是将一般语言的文字转换为语音。输入一段文本,输出一段语音

Toolkits

Pytorch WaveNet vocoder

The goal of the repository is to provide an implementation of the WaveNet vocoder, which can generate high quality raw speech samples conditioned on linguistic or acoustic features.

Pytorch tacotron2

PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.

This implementation includes distributed and automatic mixed precision support and uses the LJSpeech dataset.

Distributed and Automatic Mixed Precision support relies on NVIDIA’s Apex and AMP.

Pytorch waveglow

WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression.

Pytorch MelGan (official)

Train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques.

VC articles & Papers

陆军工程大学-语音转换技术研究现状及展望(2019)

语音转换通常是指将一个人的声音个性化特征通过“修改变换”,使之听起来像另外一个人的声音,同时保持说话内容信息不变。近年来,随着信息处理和机器学习技术的快速发展,语音转换技术也得到了突飞猛进的进步。为此,在简要介绍语音转换基本概念的基础上,重点综述了近几年语音转换的典型模型和方法,分析了语音转换的关键技术,列举了语音转换技术的主要应用场景,梳理了目前语音转换中仍存在的若干技术问题,并展望了语音转换研究的发展方向。

An overview of voice conversion systems(2017)

Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving lin- guistic information. A subset of VT, Voice conversion (VC) specifically aims to change a source speaker’s speech in such a way that the generated output is perceived as a sentence uttered by a target speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.

Maigo-语音转换技术综述(2019)

i-vector 与 PLDA 的数学推导是出了名的复杂,我曾在 2011 年的夏天推导过一个多星期。在这里,我尽可能避开数学细节,用最简洁的语言引入相关概念和方法。

几种传统的语音转换方法

  • 高斯混合模型(Gaussian mixture models, GMM)
  • 频率弯折法(frequency warping)
  • 基于样例的方法(exemplar-based method)

几种现代的语音转换方法

  • 生成对抗式网络(generative adversarial networks, GAN)
  • i-vector + PLDA(probabilistic linear discriminant analysis)
  • 自编码器(autoencoders)

TTS articles & Papers

清华大学王东老师-TTS