Developers: | Microsoft |
Date of the premiere of the system: | January 2023 |
Branches: | Information Technology |
Technology: | Speech technology |
2023: Neural Network Announcement
On January 5, 2023, Microsoft Corporation introduced a new model of artificial intelligence (AI), capable of converting text into speech, accurately imitating the voice of a particular person. The project was named VALL-E.
Microsoft calls the proposed solution "the language model of the neural codec." This AI is able to recreate a person's voice based on a speech sample lasting only three seconds. Moreover, not only the voice is imitated, but also the emotional coloring.
The VALL-E neural network is based on EnCodec technology, which Meta (recognized as an extremist organization; activities on the territory of the Russian Federation are prohibited) presented in October 2022. Unlike other text-to-speech techniques that typically manipulate sound waves, VALL-E analyzes a person's speech by breaking that information into separate components (called "tokens"). The neural network then uses training algorithms to synthesize any phrases based on available knowledge. For training, the Meta LibriLight library was used, which contains about 60,000 hours of English-language speech from more than 7,000 people (mainly from LibriVox public audiobooks).
It is noted that VALL-E does an excellent job of recreating the sound environment of the original recording. If a voice sounds like a person is on the phone, then synthesized phrases will sound the same way. In addition, the neural network mimics accents well - at least British and American several European ones.
VALL-E can be used, for example, to simulate the voice of actors or create voice chatbots. On the other hand, such a neural network can be powerful tools in the hands of attackers. Fraudsters, for example, will be able to call a person's relatives on the phone, imitating his speech after a three-second recording of the conversation. In addition, fake statements with the votes of politicians, etc. can be created.[1]