RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2

Microsoft VALL-E

Product
Developers: Microsoft
Date of the premiere of the system: January 2023
Branches: Information Technology
Technology: Speech technology

2023: Neural Network Announcement

On January 5, 2023, Microsoft Corporation introduced a new model of artificial intelligence (AI), capable of converting text into speech, accurately imitating the voice of a particular person. The project was named VALL-E.

Microsoft calls the proposed solution "the language model of the neural codec." This AI is able to recreate a person's voice based on a speech sample lasting only three seconds. Moreover, not only the voice is imitated, but also the emotional coloring.

Microsoft introduced an open neural network that can imitate the voice of a person

The VALL-E neural network is based on EnCodec technology, which Meta (recognized as an extremist organization; activities on the territory of the Russian Federation are prohibited) presented in October 2022. Unlike other text-to-speech techniques that typically manipulate sound waves, VALL-E analyzes a person's speech by breaking that information into separate components (called "tokens"). The neural network then uses training algorithms to synthesize any phrases based on available knowledge. For training, the Meta LibriLight library was used, which contains about 60,000 hours of English-language speech from more than 7,000 people (mainly from LibriVox public audiobooks).

It is noted that VALL-E does an excellent job of recreating the sound environment of the original recording. If a voice sounds like a person is on the phone, then synthesized phrases will sound the same way. In addition, the neural network mimics accents well - at least British and American several European ones.

VALL-E can be used, for example, to simulate the voice of actors or create voice chatbots. On the other hand, such a neural network can be powerful tools in the hands of attackers. Fraudsters, for example, will be able to call a person's relatives on the phone, imitating his speech after a three-second recording of the conversation. In addition, fake statements with the votes of politicians, etc. can be created.[1]

Notes