Recent advancements in adversarial speech synthesis, Mikołaj Bińkowski @ Google DeepMind

preview_player
Показать описание
Generative Adversarial Networks have seen rapid development in recent years, leading to major improvements in image and video generation. On the other hand, for a long time successful application of GANs to other domains, in particular audio and speech synthesis, remained a major challenge. During last year, however, several promising approaches appeared, proving that high fidelity speech synthesis is feasible with adversarial nets, and leading to new kinds of efficient, feed-forward generators of raw speech. A common theme in these works is a concept of random local discriminators, operating on subsampled audio at various frequencies. Like most speech synthesis models, GANs for text-to-speech rely on strong conditioning, either by linguistic features, or mel-spectrograms. In this talk, I will discuss these recent advancements in detail, and show how combining them with mel-spectrogram reconstruction loss and new alignment architecture allow us to abandon the usual conditioning systems and obtain end-to-end, fully differentiable text-to-speech models.

Mikołaj Bińkowski is a Research Scientist at DeepMind and a PhD student at Imperial College London. His research has focused mostly on various aspects of Generative Adversarial Networks, including training objectives, evaluation and their application to new domains such as speech synthesis. Previously, he interned at Google Brain Amsterdam and at Montreal Institute for Learning Algorithms, where he worked with Aaron Courville on image-to-image transfer. He holds an MSc in Financial Mathematics from the University of Warsaw, and a BSc in Mathematics from the Jagiellonian University.
Рекомендации по теме