Google Assistant’s voice is now more realistic thanks to WaveNet

google assistant natural sound voice

google assistant realistic voice

Getting speech synthesis algorithms to sound natural is a hard nut to crack. AI researchers have been struggling to make digital assistants sound more natural for years, and the results are indeed impressive.

However, there’s still something that doesn’t sound right and users can always tell if they’re talking to another human being or a machine.

The good news is that recent progress in speech synthesis research now allows AI assistants to sound more natural.

Why does speech synthesis sound unnatural?

The classic speech synthesis tools are based on concatenative text-to-speech (concatenative TTS). Basically, this involves using a database of high-quality recordings collected from a single voice actor. These recordings are then split into tiny chunks that can then be combined in different manners to generate sounds and words.

The main disadvantage is that these systems often result in unnatural sounding voices. Moreover, the sound chunks are highly dependent on the initial database

Other systems can overcome some of these problems by using a model called parametric TTS. This method uses a series of rules and parameters about grammar and mouth movements to create a computer-generated voice. However, this solution is not perfect either and can also result in unnatural sounding voices.

Google Assistant sounds more natural

WaveNet is a new deep neural network that generates raw audio waveforms that sound more realistic than any of the previous techniques used in speech synthesis. This technology now powers Google Assistant’s voices for US English and Japanese across all platforms.

It relies on a new approach, completely different from the ones listed above.

[a] deep generative model that can create individual waveforms from scratch, one sample at a time, with 16,000 samples per second and seamless transitions between individual sounds.

It was built using a convolutional neural network, which was trained on a large dataset of speech samples. During this training phase, the network determined the underlying structure of the speech, such as which tones followed each other and what waveforms were realistic (and which were not). The trained network then synthesised a voice one sample at a time, with each generated sample taking into account the properties of the previous sample. The resulting voice contained natural intonation and other features such as lip smacks. Its “accent” depended on the voices it had trained on, opening up the possibility of creating any number of unique voices from blended datasets. As with all text-to-speech systems, WaveNet used a text input to tell it which words it should generate in response to a query.

Unfortunately, building up sound waves at such high-fidelity is computationally expensive. This means taht WaveNet is a promising technology but it cannot be deployed in the real world yet.

YOU MAY ALSO LIKE: Intel’s new AI chip mimics how the brain functions

Follow The AI Center on social media:

Mark Therol

Mark is a Computer Science student who wants to become an AI software developer.
About Mark Therol 2 Articles
Mark is a Computer Science student who wants to become an AI software developer.