GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

During November 2022, PhD student, Matthew Baas and Professor Herman Kamper, both from the Department Electrical and Electronic Engineering at Stellenbosch University implemented a Generative Adversarial Network (GAN) for unconditional speech synthesis calling it AudioStyleGAN (ASGAN). 

The model is designed to learn from disentangled latent space without any additional training in a zero-shot fashion. According to the tests, ASGAN outperformed existing diffusion and autoregressive models. 

What is ASGAN?

ASGAN is a type of GAN used to generate realistic-sounding audio clips. The ASGAN architecture is similar to that of a traditional GAN, with two main components: a generator and a discriminator. The generator is responsible for creating the synthetic audio clips, while the discriminator tries to distinguish between real and fake audio clips. 

In order for the ASGAN to generate realistic-sounding audio, it must first learn the distribution of real-world audio. To do this, the ASGAN is trained on a dataset of real-world audio clips. Once the ASGAN has learned the distribution of real-world audio, it can then generate synthetic audio clips that sound realistic. The ASGAN has been shown to be effective at generating realistic-sounding audio clips, and it has potential applications in speech synthesis and music generation.

Figure 1: The ASGAN generator (left) and discriminator (right).

Research proposition and findings

The researchers proposed ASGAN, a new GAN for unconditional speech synthesis. As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN,  a number of new techniques, including a modification to adaptive discriminator augmentation to probabilistically skip discriminator updates, were introduced. 

ASGAN achieved state-of-the-art results in unconditional speech synthesis on the Google Speech Commands dataset. It is also substantially faster than the top-performing diffusion models. Through a design that encourages disentanglement, ASGAN can perform voice conversion and speech editing without being explicitly trained to do so. ASGAN demonstrates that GANs are still highly competitive with diffusion models. Code, models, samples: 

Limitations and the way forward

One major limitation of the work described in this article is scale: once trained, ASGAN can only generate utterances of a fixed length, and the model struggles to generate coherent full sentences on datasets with longer utterances (a limitation shared by existing unconditional synthesis models). 

Future work will aim to address this shortcoming by considering which aspects of ASGAN can be simplified or removed to improve scaling. It will also perform more thorough subjective evaluations to quantify how ASGAN performs on unseen tasks.

Read the full article at