RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2

Сбер GigaAM (Giga Acoustic Model)

Product
Developers: SberDevices (SalyutDevices, formerly SberDevices)
Date of the premiere of the system: 2024/04/08
Last Release Date: 2024/12/13
Technology: Speech technology

Content

The main articles are:

2024

Support for post-training and inference with Flash Attention

On December 13, 2024, Sberbank announced that the GigaAM family of open source machine learning models for speech and emotion recognition (Giga Acoustic Model) received a large update.

According to Fyodor Minkin, technical director of GigaChat, the updated version of the GigaAM acoustic models has improved data preparation and the technology for training the base model. Due to this, it was possible to significantly reduce the number of errors in words (Word Error Rate) when recognizing Russian-language requests. For the strongest model in the GigaAM-RNNT family, this figure is 25% better than the previous version and 56% higher than the OpenAI-Whisper-large-v3.

In addition, due to the transition to another positional coding, the updated GigaAM line of models supports additional training and inference with Flash Attention, which gives significant acceleration on modern video cards, noted in Sberbank. To increase the availability of using models, the team simplified the code, reduced the number of dependencies and prepared a conversion to ONNX (open software library) format. Updated models are published with the MIT license, which allows their commercial use.

GigaAM View

On April 8, 2024, SberDevices introduced GigaAM, a family of open source machine learning models for speech and emotion recognition.

These acoustic models can be used to prepare theses and scientific articles.

GigaAM - Audio Foundation Model, trained in a variety of Russian speech. It is great for adapting to various sound tasks, including speech and emotion recognition, announcer definition and others.

GigaAM-CTC is an open model for recognizing Russian-language requests. As the quality score showed on 7 data slices (from smart speaker requests to phone channel entries), the model allows 20-35% fewer word errors in short requests compared to popular solutions such as NeMo-Conformer-RNNT and Whisper-Large-v3.

GigaAM-Emo is an acoustic model for determining emotions. She showed the best result on the Dusha datacet among famous models. All models are placed in the public domain with a non-commercial license and can be used to prepare theses and scientific articles.

Improved versions of these models are available to businesses on our speech synthesis and recognition platform, and SaluteSpeech API individuals can also use them in the SaluteSpeech App.