Why does the Russian State Library teach a car to read newspapers?
The meeting in one project of the Russian State Library (RGB) and School 21, where IT specialists are trained on innovative educational methods, has become truly happy: tasks from the field of work with library funds, which are unusually relevant for the RGB, are ideal for honing the skills of students of School 21. This meeting took place during a hackathon organized by the RSF at the School 21 site on November 28-29, 2020.
Digitized RGB funds are an extensive field for the use of advanced computer linguistics technologies to solve problems of various levels of complexity in various aspects. The first aspect is the modern information support of bibliographic activity: the development of advanced digital catalogs, up to the clarification of the parameters of literary sources. The second is the support of the scientific work of specialists: historians, literary scholars, etc., working with literary works. The third aspect is the scientific activity of RGB employees who are actively participating in a large federal project to create a representative corps of the Russian language.
Why the library has advanced IT
Library and archival funds are a real wealth, memory and history of our country, "emphasizes Vadim Duda, Director General of the RSF. The former Leninka today stores about 47 million documents that are actively translated into digital format. The challenge of the modern information society is to integrate our documents, information, knowledge into the digital space. It is necessary to provide readers with convenient and modern navigation throughout the fund. This requires not only the "meta-fields" of bibliographic descriptions, but the ability to work with complete texts, their dynamic "markup" and tagging in a modern context, in a modern information field, in a modern cloud of scientific systematization. |
Corpus linguistics is a new phenomenon in linguistics, directly related to information technology. In fact, a corpus is a collection of texts in a particular language in electronic form, which is specially marked. Markups can be of different kinds, for example, grammatical markup, which maps each word its grammatical parameters. The ultimate goal of the Russian project is to create a so-called complete electronic corpus of the Russian language, which includes a large number (representative sample) of as diverse texts as possible (prose and poetry, official documents and letters, etc.), which are equipped with special markups.
According to Yuri Apresyan, an outstanding Russian scientist in the field of semantic studies of the natural language, in the information era, the national corps becomes the same necessary element of a strict scientific description of the language, along with a dictionary and grammar. Indeed, the appearance of the language corps can be compared with the revolution in linguistics - it becomes possible literally in real time to analyze texts in a variety of sections, moreover, on real "living" texts. However, it is previously necessary to carry out serious work on the preparation of texts of works.
For the newspaper Krasnaya Zvezda, rather complex algorithms are needed that will allow the scan of the strip to turn into a structure of related elements from texts, headlines, illustrations - an XML structure, "Vadim Duda, head of the RGB newspaper foundation, told about the current direction of work with the newspaper fund. |
The second task is to distinguish from this data array geographical names, rewards, proper names, dates, other information.
As a result, we get from the scan of the newspaper strip an incredibly valuable information array for work. In fact, we are creating the basis of a completely new library direction - digital bibliography! - said Vadim Duda. |
Super-dense layout - it was necessary to put as much information as possible into the newspaper strip, an unstable font. In addition, the newspaper archive suffered flooding. All this significantly worsens the quality of work of automatic text recognition programs, "explains Ilya Kutukov. |
An unusual language for modern hearing, a thesaurus of political information with a large number of abbreviations, neologisms, and specific vocabulary of that time. Working with digitized texts using computer programs that can distinguish entities, we saw how the vocabulary of the language changed as the course of World War II developed. |
Actually, the hackathon with the participation of students of School 21 was devoted to the problems of further work with the digitized Red Star sheet set of the war years.
Hackathon RGB and "Schools 21"
Among the various parameters that characterize any literary work in one way or another, dating occupies an important place. This is important, on the one hand, for bibliography - when did the author live and work? This issue is, among other things, pragmatic for the RSL: the study of copyright for a work that is placed in the public domain requires a lot of time and effort. On the other hand, the time parameter - one of the basic ones for analyzing the content of the work, as well as for research based on newspaper sources - it helps to establish causal relations between people, objects, events. In other words, answer basic questions: who? Where? When? Шаблон:Quote 'We will not be able to move further in our work in the digital space, neither in the field of cataloging, nor in research, if we cannot work with dates, "Ilya Kutukov explained.
At the same time, identifying dates and determining their exact location on the timeline is a very difficult task. Current commercial text analytics systems generally do well to identify dates in electronic documents, the writing of which is accepted in modern documents. However, pointing to the paper paper strip of the century using Roman numerals is a serious test even for "recognition." A classic example: the Roman number XVIII is recognized as an abbreviation of HUSH. Commercial text analytics systems are also not "trained" to identify archaic methods of indicating time, for example, "R.K." or "n.v." And for them, relative references like "the first month of spring," "two weeks after Christmas...," not to mention reference dates like "shortly before..." will become completely impossible.
It is necessary to find in the document a mention of time and lead it to a unified representation on a single timeline, so that then individual texts can be correlated with each other, "Ilya Kutukov explained the task. |
Due to the high complexity of the task, developers were not limited in their ability to use any available tools and ready-made libraries. To test their IT solutions, teams gained access to the open part of the current version of the language corps, which is created by specialists of the RGB.
In fact, you solved the problem of automatic dataset verification. This is a very necessary task. |
Indeed, the results of digitizing paper works must be checked, but the library, of course, does not have such resources. In this regard, the task of identifying errors in digitized texts is relevant.
Not what was required in the task was done, but this is what we need for our work. |
In general, according to his estimates, many teams unexpectedly thoroughly approached the task and received good results. Following the results of the hackathon, the organizers from the RSL will create a special open repository where the participants' code will be posted, as well as data sets prepared by the RSL. {{quote 'Our ultimate goal is maximum open access to structured information of the RSL, "he explained and added that teams participating in the hackathon can rightfully be called contractors of the national corps of the Russian language. Machine readability of the Russian language is our joint work with you, which is very important for studying how our language functions, "Ilya Kutukov emphasized. }}