Google Assistant’s voice is now more realistic thanks to WaveNet

google assistant natural sound voice
AI research

google assistant realistic voice

Getting speech synthesis algorithms to sound natural is a hard nut to crack. AI researchers have been struggling to make digital assistants sound more natural for years, and the results are indeed impressive.

However, there’s still something that doesn’t sound right and users can always tell if they’re talking to another human being or a machine.

The good news is that recent progress in speech synthesis research now allows AI assistants to sound more natural.

Why does speech synthesis sound unnatural?

The classic speech synthesis tools are based on concatenative text-to-speech (concatenative TTS). Basically, this involves using a database of high-quality recordings collected from a single voice actor. These recordings are then split into tiny chunks that can then be combined in different manners to generate sounds and words.

The main disadvantage is that these systems often result in unnatural sounding voices. Moreover, the sound chunks are highly dependent on the initial database

Other systems can overcome some of these problems by using a model called parametric TTS. This method uses a series of rules and parameters about grammar and mouth movements to create a computer-generated voice. However, this solution is not perfect either and can also result in unnatural sounding voices.

Google Assistant sounds more natural

WaveNet is a new deep neural network that generates raw audio waveforms that sound more realistic than any of the previous techniques used in speech synthesis. This technology now powers Google Assistant’s voices for US English and Japanese across all platforms.

It relies on a new approach, completely different from the ones listed above.

[a] deep generative model that can create individual waveforms from scratch, one sample at a time, with 16,000 samples per second and seamless transitions between individual sounds.

It was built using a convolutional neural network, which was trained on a large dataset of speech samples. During this training phase, the network determined the underlying structure of the speech, such as which tones followed each other and what waveforms were realistic (and which were not). The trained network then synthesised a voice one sample at a time, with each generated sample taking into account the properties of the previous sample. The resulting voice contained natural intonation and other features such as lip smacks. Its “accent” depended on the voices it had trained on, opening up the possibility of creating any number of unique voices from blended datasets. As with all text-to-speech systems, WaveNet used a text input to tell it which words it should generate in response to a query.

Unfortunately, building up sound waves at such high-fidelity is computationally expensive. This means taht WaveNet is a promising technology but it cannot be deployed in the real world yet.

YOU MAY ALSO LIKE: Intel’s new AI chip mimics how the brain functions

Follow The AI Center on social media:

Mark Therol

Mark is a Computer Science student who wants to become an AI software developer.
AI stock exchange
AI in real life
Canada’s first AI exchange-traded fund enters the market

  Canada launched the first artificial intelligence exchange-traded fund with the ticker MIND. The Horizons Active A.I. Global Equity ETF is an investment strategy run by an AI system. The AI algorithm analyzes financial data to extract patterns in order to determine what are the best investments to make. MIND …

smartest AI
AI research
Are you curious to see who’s the smartest AI in the world?

  The AI world is dominated by fierce competition. Well-known tech giants such as Google, Microsoft, Intel, NVIDIA and other are working at full speed to push AI research forward and innovate the field. The question is: who’s the smartest AI in the world? This is a very difficult question …

penny AI predict wealth
AI research
AI predicts how wealthy your neighborhood will be

  Penny is an artificial intelligence system that relies on satellite imagery to predict income levels in your neighborhood. If you want to know whether the area you want to move in has the potential to support economic growth or not, this tool may come in handy. You can play around …