TheNextPort

In just three seconds, Microsoft's VALL-E will perfectly imitate any voice

Microsoft VALL-E is a new technology that can imitate any voice, raising questions about its potential implications such as scam or identity fraud purposes.

By Matthew Rutherford

fetured-image-for-vall-e

Microsoft VALL-E is a new technology that can imitate any voice, raising questions about its potential implications such as scam or identity fraud purposes.

Imagine saying just one short sentence into a microphone, enough for a voice generator to learn to speak in your own voice.

Microsoft's experimental technology VALL-E can reportedly do this, which its experts have boasted about on Arxiv and in several examples on GitHub. Machine learning only needs three seconds of recording, after which it can estimate how a person would speak even under the pressure of various emotions.

A person uploads a 3-second long recording and the VALL-E text-to-speech generator learns their entire voice. Everything else was learned from 60,000 hours of general voice recordings. Source: Microsoft VALL-E
A person uploads a 3-second long recording and the VALL-E text-to-speech generator learns their entire voice. Everything else was learned from 60,000 hours of general voice recordings. Source: Microsoft VALL-E

To achieve this, VALL-E had to learn what a human voice looks like in a digital format. To do this, he used a database of 60,000 hours of English recordings from LibriLight and a special sound neural codec called EnCodec. Engineers from Facebook helped build both technologies.

Speaker Prompt

Source: Microsoft Vall-E

VALL-E sample 1

VALL-E sample 2

Another tool for fake news, but also "peaceful" uses.

Any technology, including Vall-E, can be used for good or bad purposes. In the case of VALL-E which can raise ethical questions, as anyone could misuse them for creating fake news and putting words into people's mouths that they never said, there are also good purposes as Microsoft points out.

Someone can upload just a 3-second recording and the VALL-E text-to-speech generator learns their entire voice. Everything else was learned from 60,000 hours of general voice recordings.

Imagine, for example, that you can listen to your audiobook in the voice of your favorite actor. He only provides a short clip of his voice without having to come to the studio. He would simply obtain a license for his voice, and VALL-E would take care of the rest.

Or imagine that someone lost their voice due to illness. A single short recording from the past would be enough and technology would preserve his original voice.

Machine correction of mispronunciations.

Microsoft VALL-E can also be used for sound editing tasks, such as correcting mispronunciations. This sound editing can be done quickly and accurately, making it an efficient and reliable tool for media professionals.

VALL-E, which is a study from Arxiv with a few examples on GitHub, has the potential to imitate any voice, however, it is not perfect and can sometimes struggle to pronounce words correctly, resulting in an artificial or robotic sound. Whether it will eventually become a popular, widely used product, like the image and text-generative AI that was released last year, is yet to be seen.

The study shows that if the voice cloning program is trained on more audio recordings, it can be made even more accurate. Microsoft has not yet made VALL-E available to the public, probably to prevent it from being misused.

The direction is set and similar news like this one will increase. It is necessary to prepare for the fact that it is quite possible that generative AI will be able to simulate practically any digital data in this decade.

No spam. Twice a month.
Unsubscribe anytime.

Sign up to our newsletter and receive a selection of cool articles weekly.

By clicking “Sign Up”, you accept our Terms of Service and Privacy Policy. You can opt-out at any time.