RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2

Sber ruRoBERTa Language model

Product
Developers: SberDevais (SberDevices), Sberbank
Technology: Speech Technology

Main article: Speech technology: on the way from recognition to understanding

2021: Language models from SberDevices are recognized as the best in understanding texts in Russian

The SberDevices finetune text model developed by ruRoberta-large became the best in understanding the text in accordance with the assessment of the Russian-language benchmark for evaluating large Russian SuperGLUE text models, second only to a person in accuracy. Also in the six leaders included four more models from SberDevices: ruT5-large-finetune, ruBert-large finetune, ruT5-base-finetune, ruBert-base finetune. This was announced on August 25, 2021 in Sberbank.

Having taught the language model of ruBERT, in Sber they began to develop its more advanced version - ruRoBERTa. Architecturally, this is the same BERT, trained on a large body of text, only for the task of restoring masked tokens, on a large battle site and with a BBPE tokenizer from. neuronets ruGPT-3 Training the model for supercomputer "" Kristofari took three weeks, the final dataset (250 GB of text) was similar to the one used for ruGPT-3, but English and part of the "dirty" Common Crawl were removed from it.

The place in the Russian SuperGLUE rating depends on how well the neural network performs tasks for logic, common sense, goal-setting and understanding the meaning of the text. This is an open project used by all data researchers working with Russian-speaking neural networks.

Assessment of a common understanding of the language begins in the ranking with a set of tests reflecting various language phenomena - diagnostic dataset. It reflects the linguistic phenomena of the language and shows how the ruRoberta-large finetune model understands certain features of it. High speed (LiDiRus) suggests that the model not only remembers tasks or guesses the result, but learns features and learns a variety of phenomena of the Russian language.

Each model is also evaluated through various tasks, among which DaNetQA - a set of questions on common sense and knowledge, with the answer "yes" or "no," RCB (Russian Commitment Bank) - classification of the presence of causal links between the text and the hypothesis from it, PARus (Plausible Alternatives for Russian) - other options on the basis of common sense, choice from common sense.

File:Aquote1.png
Sber specialists have been improving neural networks for the Russian language for several years. For their objective assessment, we developed the Russian SuperGLUE leadership board, which clearly shows progress in this work. Our ultimate goal is to create reliable intelligent systems for solving diverse problems in Russian, which can become the forerunners of strong artificial intelligence of domestic assembly, "said David Rafalovsky, executive vice president of Sberbank, Sbera CTO, head of the Technology block.
File:Aquote2.png