SciRus-tiny

Product

Developers:	Moscow State University (MSU)
Date of the premiere of the system:	December 2023
Branches:	Education and Science

2023: Creating a neural network

On December 22, 2023, Russian researchers from the Institute of Artificial Intelligence of Lomonosov Moscow State University announced the development of the SciRus-tiny neural network, designed to obtain semantic vector representations (embeddings) of scientific texts in Russian. The system is suitable for solving a wide range of application problems - from search and classification to extraction of scientific terms.

The eLibrary portal took part in the project. The neural network, presented in the public domain, will form the basis of the search and recommendation system for scientists, testing of which will begin in 2024. It is noted that the trained model shows high metric values, while having a small number of parameters. This reduces the requirements for computing resources. Thus, SciRus-tiny can efficiently perform tasks at high loads.

The SciRus-tiny model is trained on an array of 1.5 billion tokens of scientific texts in Russian and English. This is a model of the RoBERTa architecture with 29 million parameters and an embedding dimension of 312. The size of the model dictionary is 50,265 tokens, and the maximum context length is 2 thousand tokens. SciRus-tiny is the first solution in the family of models to obtain semantic embeddings of scientific texts in different languages.

As part of the project, the participants also published in the public domain the ruSciBench benchmark for assessing embeddings of scientific texts. The test consists of 14 tasks that are performed in parallel annotations (almost 400 thousand) in Russian and English.

For common language topics, there are many multilingual benchmarks (sets of test tasks) to assess the quality of embeddings obtained using different models. With these benchmarks, you can compare models and choose the right one for your task. Unfortunately, in the field of embedding of scientific texts, the choice is not so wide, especially for the Russian language. Thanks to the data provided to us by the eLibrary portal, we were able to take the next step and prepared the ruSciBench benchmark, which contains much more data on more topics, the researchers say.^[1]

Notes

↑ themes/uchenye-mgu-obuchili-neyroset-analizu-nauchnykh-tekstov.html MSU scientists trained a neural network to analyze scientific texts

Источник — «https://tadviser.com/index.php/Product:SciRus-tiny»

The site content is translated by machine translation software powered by PROMT. The machine-translated articles are not always perfect and may contain errors in vocabulary, syntax or grammar. Read original article
If you find inaccuracies or errors in the results of machine translation, please write to editor@tadviser.ru. We will make every effort to correct them as soon as possible.

Simple Link

How to create a "smart plant": Key characteristics of a modern digital enterprise 10400

Model Studio CS: How to use BIM to give new impetus to the development of the fuel and energy complex 10900