HSE: AIpom System for determining the boundaries between original and AI-generated text fragments in scientific articles

Product

The name of the base system (platform):	Artificial intelligence (AI, Artificial intelligence, AI)
Developers:	Higher School of Economics (HSE)
Date of the premiere of the system:	2024/12/06
Branches:	Education and Science
Technology:	Speech technology

The main articles are:

2024: AIpom System Introduction

A team of researchers with the participation of Alexander Shirnin from the Higher School of Economics has created two models for detection of parts generated by artificial intelligence in scientific texts. In the AIpom system, two types of models are connected - a decoder and an encoder, which allows it to more efficiently find the generated inserts. The Papilusion system is suitable for recognizing corrections using synonyms and short retellings generated by a neural network, in its work it uses models of the same type - encoders. In the future, such models will help in checking the originality and reliability of scientific publications. The HSE announced this on December 6, 2024.

The more popular language models such as ChatGPT or GigaChat become, and the more they are used, the harder it is to distinguish the original human-written text from the generated text. Scientific publications and graduate papers are already written using artificial intelligence. Therefore, it is important to develop tools that will help identify AI inserts in texts.

Models do a good job with familiar topics, but if you give a new theme, the result becomes worse. This is like a student who, having learned to solve one type of problem, will not be able to solve the problem as easily and correctly on an unfamiliar topic or from another subject, "said Alexander Shirnin, one of the authors of the article, a research trainee at the Scientific and Educational Laboratory of Models and Methods of Computational Pragmatics, Faculty of Computer Science, HSE.

To improve the efficiency of the system, the researchers decided to combine two models - a decoder and an encoder. At the first stage, a decoder was used - a neural network, at the input of which an instruction was given plus the original text, and at the output a fragment of text was received, presumably generated by AI. Then, in the original text, using the > label, the <bREAKarea was highlighted, where, according to the model's forecast, the generated fragment began. The encoder worked with the text marked in the first stage and refined the decoder predictions. To do this, he classified each token - the minimum unit of text as a word or part of a word - and indicated whether it was written by a person or AI.

The Papilusion model also distinguished the written text from the generated one. With its help, sections of text are divided into four categories: written by a person, corrected using synonyms, generated by the model and briefly retold. The task was to correctly define each of the categories. The number of categories and the length of inserts in the texts differed.

In this case, the developers used three models, but already one type - encoders. They were trained to predict one of four categories for each token from the text, all models were trained independently of each other. When the model was wrong, it was fined and further trained, while freezing the lower layers of the model.

{{Quote 'Each model has a different number of layers depending on the architecture. When we teach the model, you can not touch, for example, the first ten layers and change the numbers in only the last two. They do this so that during training they do not lose some of the important data embedded in the first layers. You can compare this with an athlete who is mistaken in moving his hand. We have to explain only that to him, not nullify his knowledge and re-educate him, because then he can forget how to move properly in general. It works by the same logic here. The method is not universal and on some models it may be ineffective, but in our case it worked, - said Alexander Shirnin. }}

As the researchers note, models to detect AI work well, but still have limitations, primarily poorly processing data beyond the training, and in general there is a lack of diverse data for model training.

To get more data, you need to focus on collecting it. Both companies and laboratories are doing this. Specifically, for this type of task, you need to collect datacets, where the texts use several AI models and correction methods. That is, not just continue the text using one model, but create more realistic situations: somewhere ask the model to supplement the text, rewrite the beginning so that it fits better, remove something from it, try to generate part in a new style using another prompt (instructions) for the model. Also, of course, it is important to collect data in other languages, on different topics, - added Alexander Shirnin.

Источник — «https://tadviser.com/index.php/Product:HSE:_AIpom_System_for_determining_the_boundaries_between_original_and_AI-generated_text_fragments_in_scientific_articles»

The site content is translated by machine translation software powered by PROMT. The machine-translated articles are not always perfect and may contain errors in vocabulary, syntax or grammar. Read original article
If you find inaccuracies or errors in the results of machine translation, please write to editor@tadviser.ru. We will make every effort to correct them as soon as possible.

Simple Link

How to create a "smart plant": Key characteristics of a modern digital enterprise 10400

Model Studio CS: How to use BIM to give new impetus to the development of the fuel and energy complex 10800