Developers: | Moscow State University (MSU) |
Date of the premiere of the system: | December 2023 |
Branches: | Education and Science |
2023: Creating a neural network
On December 22, 2023, Russian researchers from the Institute of Artificial Intelligence of Lomonosov Moscow State University announced the development of the SciRus-tiny neural network, designed to obtain semantic vector representations (embeddings) of scientific texts in Russian. The system is suitable for solving a wide range of application problems - from search and classification to extraction of scientific terms.
The eLibrary portal took part in the project. The neural network, presented in the public domain, will form the basis of the search and recommendation system for scientists, testing of which will begin in 2024. It is noted that the trained model shows high metric values, while having a small number of parameters. This reduces the requirements for computing resources. Thus, SciRus-tiny can efficiently perform tasks at high loads.
The SciRus-tiny model is trained on an array of 1.5 billion tokens of scientific texts in Russian and English. This is a model of the RoBERTa architecture with 29 million parameters and an embedding dimension of 312. The size of the model dictionary is 50,265 tokens, and the maximum context length is 2 thousand tokens. SciRus-tiny is the first solution in the family of models to obtain semantic embeddings of scientific texts in different languages.
As part of the project, the participants also published in the public domain the ruSciBench benchmark for assessing embeddings of scientific texts. The test consists of 14 tasks that are performed in parallel annotations (almost 400 thousand) in Russian and English.
For common language topics, there are many multilingual benchmarks (sets of test tasks) to assess the quality of embeddings obtained using different models. With these benchmarks, you can compare models and choose the right one for your task. Unfortunately, in the field of embedding of scientific texts, the choice is not so wide, especially for the Russian language. Thanks to the data provided to us by the eLibrary portal, we were able to take the next step and prepared the ruSciBench benchmark, which contains much more data on more topics, the researchers say.[1] |