2022/08/03 12:25:45

Processing of natural language documents and texts

Intelligent automatic natural language word processing is an area of software solutions that traditionally employs numerous teams of researchers and developers in different countries. Indeed, in relation to the corporate sector, we are talking about a wide range of possibilities for processing text documents: from structured forms to texts of agreements and correspondence with clients at the corporate forum. Which of these capabilities have become routine common services today, and which are at the forefront of intelligent processing?

According to analysts at the Competence Center, Artificial intelligence the MIPT authors of the Artificial Intelligence almanac 2021, which was published in April, the Natural Language Processing () segment of natural language processing solutions NLP occupies Russia 32.8% of the total AI market.

For the global AI market, NLP solutions are also a significant segment. According to analysts at Statista, its annual growth rate will remain at 20.3% until 2026. According to Frost & Sallivan researchers, the volume of the natural language processing technology market should reach $43.5 billion by 2024.

Growth of NLP segment of global AI market

Source: Statista, August 2021

The modern (V) stage of development of research and development in this field is characterized by the fact that not artificial (model) texts are automatically processed, but real documents and, in general, Web content; Processing of multi-language collections of documents, not single texts; the documents being processed contain typos, spelling errors, agrammatics and other real obstacles to their correct interpretation.

In addition, the specialist emphasizes, the purpose of processing a document is not just to obtain an internal representation of its meaning, but to present results in formats that are convenient for effectively storing knowledge, taking into account their constant replenishment and subsequent use.

Modern information-search and information-analytical systems work with text information in wide or unlimited subject areas, that is, areas that include thousands of different classes of entities that are included in unlimited types of relationships. Therefore, notes Natalya Lukashevich, professor at the Department of Theoretical and Applied Linguistics of Moscow State University, in her article in the journal "Ontology of Design" (Volume 5, No. 1, 2015), a characteristic feature of modern methods of processing text information in such systems was the minimal use of knowledge about the world and about the language, reliance on statistical methods for taking into account the frequency of word occurrence in a sentence, text, set of documents, joint occurrence of words, etc.

This approach is fundamentally different from how a person performs such operations: he identifies the main content of the document, the main topic and subtopics, and a large amount of knowledge about language, about the world and about the organization of a coherent text is usually used for this. The lack of linguistic and ontological knowledge (knowledge of the world) used in automatic word processing systems leads to a variety of problems, primarily irrelevant recognition of the meaning of the text and its individual details.

In such conditions, the main attention of researchers and developers is focused on systems designed to extract data from text texts, Text Mining word processing systems that help identify useful information in texts, as well as semantic classification and clustering systems that allow you to classify a particular text into a particular category of documents.

For example, the main problem for the correct operation of the chatbot system is the correct definition of the topic of appeal in order to issue the correct answer, explains Sanzhar Dosov, ML developer of Globus: the approaches to classifying text are well solved.

Classification of texts

The task of classifying texts has the purpose of assigning the target text to the correct class that is appropriate in meaning, and at least minimal markup is required to solve it. Sanzhar Dosov gives an example of such markup:

Целевой текст	Class
Добрый день! Хотелось бы сменить тарифный план оператора.	Change of tariff

It will be a big plus if there is additional markup as key phrases in customer requests that are found in individual classes, the expert adds.

He says that today there are two main approaches to solving the problem of classifying texts:

Regular expressions. This is a special language used in some text processing programs that aims to search and manipulate substrings in text. Today, this approach belongs to the category of legacy, but it is quite often used, including in combination with algorithmic machine learning.
Neural networks.

A huge advantage of neural network approaches is that they have a good generalizing ability and cope with a large number of classes in classification problems, says Sanzhar Dosov.

A striking representative of this approach is the popular BERT architecture, which in recent years has shown the best results in the task of classifying texts.

Identifying Entities in Text

Every industry, even every company, has a large number of its essential names. For example, banks have the types and names of bank cards (debit, credit, with a certain cashback), loans (consumer, mortgage, for business), deposits (ruble, currency, etc.). All this chatbot should be able to understand.

They also make excellent assistant HR departments with their own set of current terms: "vacation," salary"," "insurance sick leave," etc., "- Andrey Kulyashov comments, director of business development at ISS.

Let's say the task is to determine the essence of the "settlement" and highlight its name, - says Stanislav Ashmanov. - If an assistant is being developed for the railway, then the list of stations is finite, and it is easier to use a closed list provided by the customer to determine the station in the request. This is the dictionary.

Dictionaries also allow you to set the synonymy of names (St. Petersburg - St. Petersburg) and normalize the name found in the request in any case to the nominative case to form a request to the ticket sales service.

If we are talking about a cheat chat (chit-chat), that is, the ability to "chat" with the machine about this), and the developer of the assistant wants to select in the request any arbitrary name of the settlement, although fictional or from a literary work, then NER is more suitable for machine learning, continues Stanislav Ashmanov, because this technology is resistant to new names of the allocated entity.

Sanzhar Dosov from Globus notes that for the NER task, in addition to neural network tools, a rule-based approach is also used. One of the popular rule-based approaches for the Russian language is implemented, for example, in the NATASHA project.

This is not a research project, basic technologies are created for production, the company emphasizes: a number of libraries are combined in one convenient API, which allows solving the basic NLP problems for the Russian language: tokenization, segmentation of sentences, word embedding, morphology marking, lemmatization, phrase normalization, syntactic analysis, NER tags, fact extraction.

NER in NATASHA project

Source: Globus

This combination of approaches helps to successfully solve the problem, - notes Sanjar Doso, - because using the rule-based approach, it is possible to make a morphological analysis of the text, which helps to distinguish different forms of the same entities. Also, the rule-based approach allows you to successfully cover edge cases, and the neural network approach, in turn, gives the algorithm a good generalizing ability.

Text recognition

Another basic mechanism of NLP systems is optical text recognition. For most corporate NLP systems, it implements the first phase of working with documents - converting text on paper into electronic form. Today, OCR functionality has gone far beyond transforming a paper document into an electronic counterpart. The struggle is to teach AI to analyze texts: verify the information contained in them, detect possible errors, identify the data necessary for other systems to work, and transfer them there.

Smart Engines presented at the beginning of the year a system of intelligent automatic recognition of all pages of the Russian passport. The program itself finds passport pages on images and searches for stamps on them. The data of the main spread of the passport with printed and handwritten filling, all typewritten stamps about the place of registration and information about previously issued passports are automatically recognized. For all pages of the passport, the AI automatically determines the serial number of the page in the document, and also recognizes the document number, regardless of whether it is printed or applied by laser perforation.

Smart Engines experts say that by using AI algorithms, it was possible to solve a large problem of traditional OCR solutions - the need to manually verify the results of automatic recognition. Images with client data in no form are transmitted for manual input to external performers or to collective work services, the company emphasizes.

Our system can recognize the passport of the Russian Federation in real time if the user shows the document to the camera of his phone page by page or automatically checks, that in the application from the client in photographs or scan images there are all the necessary pages of one passport, and recognize the necessary data, - comments Vladimir Arlazarov, Ph.D., CEO of Smart Engins. - Performing recognition, the program considers the entire passport as a single set of data separated into different pages.

In the summer, Smart Engines released a recognition system for primary accounting and financial documentation on mobile phones with the same high quality that does not require manual verification. The system automatically classifies and recognizes invoices, TORG-12, PDA, delivery notes, acts and invoices for payment, and also ensures that information from documents and forms is entered into the ERP system or any other accounting information system with the ability to check the completeness and cross-verification of data within one set.

According to the company, recognition of the primary document on a modern phone in a mobile application takes 1-3 seconds per page, and in server mode on a 32-core high-performance computer (HPC) without the use of GPU, the speed of recognition during streaming scans in traditional input centers can reach 600 pages per minute. {{quote 'Now business can move to a new level of mobility, abandoning the use of streaming scanners for imaging and dedicated workstations or servers for recognizing primary documents,- says Vladimir Arlazarov. - Employees can scan and extract data by mobile phone directly when receiving documents from counterparties. }}

The CORRECT software platform of TKset also solves the problem of automating the procedures for entering accounting, personnel, transport, legal and other documents into the customer's accounting systems: it extracts information from many different types of documents according to pre-configured templates, performs mathematical checks, as well as checks on global directories (USRUL/USRIP, addresses, names of managers, etc.), and compares the information received with the local catalog of items.

The company says that the CORRECT architectural solution is based on two OCRs and has its own nomenclature matching methodology, which takes into account the SKU range in stock. The functionality of full-text document recognition, reading barcodes, QR codes, checking for seals and signatures is also supported. Through the open API, the platform can be integrated with various accounting systems ("1C," SAP, Oracle, Dynamics, etc.), RPA platforms (Electoneek, Sherpa, Lexema, etc.), crowdsourcing platforms (Yandex.Toloka). CORRECT software can be installed at the customer's site or provided as a page-by-page SaaS service.

Bill Recognition

Source: TKSET company

Intelligence is not only document recognition, but also work with classification, training on historical data. With this approach, it will be possible to collect all the commonality of data, get the result, make a discount on the probability and speed up the process of establishing a document in the IP, - notes Vitaly Astrakhantsev, evangelist of the AI direction of Directum.

These aspects are especially important when working with large amounts of various text information, for example, for electronic archive tasks.

A large volume of incoming documents was directed to simple and convenient entry and addition to the financial archive, says Vitaly Astrakhantsev.

At the same time, the accuracy of classifying the type of documents reached 97%, and the completeness of extracting attributes from incoming documents - 85%.

The ABBYY FlexiCapture platform is designed to collect information from various types of documents: texts of letters, attachments, electronic documents, photos, scans. Algorithms implemented in FlexiCapture make it possible to recognize and classify documents, extract and check the correctness of data, transfer them to the corporate information system.

For example, an operator communications Tele2 that develops self-service channels allows new customers to remotely open SIM cards using terminals face recognition with and identity documents. In this solution, ABBYY technologies automatically extract passport data and send it for verification to the operator.

If the customer's specialist enters the name of the product, he will receive all the relevant information: from the PLM system - about the product itself, from the ERP system - financial indicators related to expenses, - says Alexander Rodionov, Director of Document Management, Head of the LANIT Innovation Center.

It is worth noting that the task of recognizing popular fonts of text documents is a massively demanded solution. But such tasks as, for example, font recognition according to GOST on the technological diagram are much more complex. Nanosemantics specialists note that the quality of a product aimed at the mass consumer in this situation shows low quality, customization is required here.

One of the most important elements of solutions for identifying data from HER texts is the flexibility of the software system in customizing for new document types.

Thanks to the mechanisms for further training classifiers and auto-tagging mechanisms, you can unload document cards and, on their basis, make a markup project and train fact extraction models for a new type of document. We also work towards the clustering of uploaded documents in order to prepare a training sample and improve the quality of final models. In Directum RX, from the point of view of intelligent recognition, almost all conditions have been created in order to simply and quickly introduce new types and new forms of documents into work.

What else have you learned to extract from HER texts of the NLP system? For a long time, decisions have been developing to assess the tone (emotional coloring) of text messages. Such solutions are used, first of all, to analyze negative and positive responses of clients on forums, sites, social networks.

In May, at the interdepartmental conference "Artificial Intelligence in the Police Service," developers from Vyatka State University presented a system for automatic recognition of the point of view of the author of the text in Russian and English.

Specific Tasks for Processing HER Texts

A separate segment of AI solutions is associated with an adequate representation of the so-called transcribed text, that is, a text copy of an audio message. Speech-to-Text conversion functionality is becoming more and more popular and reaches the level of mass services. For example, Telegram Premium subscribers can use the button to send a request to translate a specific voice message into text (this uses equipment, Google as explained). Pavel Durov

One of the interesting problems of this class is the arrangement of punctuation marks in the final text. To solve it, a non-trivial analysis of the audio fragment is required: the specifics of pauses in words, the nature of the change in intonations, etc. Sberbank developers have created their own GPU-based SmartSpeech system for speech recognition, with a unique decoder and acoustic model. This technology provides, firstly, automatic determination of the speaker by voice, and, secondly, automatically determines pauses, the end of the statement, the speaker's emotions and places punctuation marks in the decoding of the conversation in real time.

The SmartSpeech system is used, in particular, in video conferencing solutions that Sberbank brings to the market: at any time of the meeting, a conference participant can download the full text of the conversation, and in the text of the dialogs you can search by notes.

The Yandex Cloud cloud platform updated Yandex SpeechKit, a machine learning-based speech synthesis and recognition service, in the spring. Now, when translating voice into text, the service automatically places the necessary punctuation marks. The text recognized by the neural network is as close as possible to the literary one, the company says.

The new Yandex SpeechKit feature, called the Punctuator, works both in real-time recognition for voice assistant scripts and in recognition of pre-recorded audio files. The punctuator uses two sequential machine learning models. The first translates the voice into the text, the second places punctuation marks in accordance with the norms of the Russian language.

Based on the basic technologies for identifying data from EE-texts of various nature, modern solutions for a wide range of corporate tasks are implemented:

Registration of incoming documentation
Full-text recognition
Entering documents into information systems
Work with internal documentation (prompt extraction of information from orders, instructions and other regulatory documents for easy access, reconciliation and search on them)
Compare documents (different versions of the same document)
Processing of design and estimate documentation (automatic comparison of data from design and estimate documentation with data in CAD)
Classification of incoming requests (automating the classification of cases, determining the importance of the request and finding a ready response, which is used primarily in user technical support systems, customer consulting, contact centers, etc.).
Intelligent search (classification, extraction of objects and related facts, tonality analysis)

For example, Naumen Enterprise Search is a single search engine that can be used by all employees of the company. Searches can be conducted not only among electronic documents, but also in image and video archives, emails and presentations, application texts, e-books and web pages, as well as corporate information processing systems: CRM, ERP, e-mail, NSI systems, wiki, Task trackers, systems for collecting and storing readings of metering devices, etc.

The ABBYY Intelligent Search mechanism, according to the company, performs a corporate search on all data sources, moreover, the search is within meaning, and not by keywords.

Sense search is also supported by RCO Zoom, a specialized search engine that combines the functionality of traditional search engines and information analysis. The system uses ontological engineering automation methods to extract knowledge from text, which allows you to implement both contextual and object text searches. The morphology and syntax of the Russian and English languages, the ability to work with branded information are supported. With a high-performance and independent database, the system can be used as a highly reliable document repository.

The RCO Zoom system is integrated with the RCO Fact Extractor SDK, designed to extract facts from its texts. And the interface for Python allows you to implement all kinds of add-ons for solving diverse problems in addition to storage and search: from finding information duplicates of documents to their classification and clustering.

Understanding the text

From a practical point of view, technologies and systems of understanding EI-texts (Natural Language Understanding, NLU) connect together facts, events, entities that occur in the text, specific relationships. Together, facts and connections between them form a unified picture of what is described in the text. This is some "machine understanding" of the text, which allows, for example, to give correct answers to questions related to this description. Thus, the company "Nanosemantics" uses machine learning for the NLU task. She offers a ready-made constructor platform, within which you can get high-quality ML models without even being a Data Science specialist.

The technological important operations for NLU are the resolution of frequently encountered linguistic constructs, for example:

ellipsis - restoration of missing words, which a person easily restores based on understanding what was read. For example, in the sentence "The company's revenue in the first quarter increased by 20%, and in the next - by 30%," the words "quarter" and "increased" are omitted;
anaphora - the pronoun replaces the noun. For example, in the phrase "The ruble continues to strengthen against the dollar, yesterday it grew by 1 ruble," the pronoun "he" replaces the noun "ruble";
homonymy is a multi-valued interpretation of the word. For example, the exchange rate can be in currency, a ship at sea and educational material.

Smart processing of entities identified in the text implies the transformation of data into knowledge (facts related to persons or certain objects defined by connections). For example, this mechanism is implemented in the ABBYY InfoExtractor SDK.

In this case, we are talking about extracting not just named entities (NER), but all kinds of data from unstructured texts.

A key feature of extracting entities in order to subsequently turn them into facts in this decision is the use of an ontological approach. This means that in the course of parsing the text, a semantic hierarchy of concepts is built, represented in the form of a tree. This makes it possible to avoid ambiguous interpretations. For example, the word "management" can be interpreted as a department of a company, or it can be interpreted as an action. Due to the fact that this word is represented differently in different branches, the system can choose the necessary words depending on the context of the phrase.

Ontology deals with the fact that it creates a pragmatic layer of text analysis, defines terminology for a specific subject area and rules for extracting the desired objects. As a result, all information is presented in the form required by the customer, - say ABBYY.

At the same time, data for the most common tasks are part of the basic ontologies of the software solution, and for other specific industry topics it may be necessary to develop an appropriate industry ontology from scratch.

To this is added machine learning with a teacher on a marked training sample, which as a result ensures a quick start of work on extracting data from texts based on some ontological model of the subject area.

An important stage of this work is the establishment of links between objects and events, which allows you to identify facts that are significant for the purposes of analyzing (understanding) the text. In particular, coreferent connections are established - connections between two different references that actually refer to the same element of reality.

The questions of love and death did not concern Ippolit Matveevich Vorobyaninov, although he was in charge of these issues, by the nature of his service, from 9 am to 5 pm daily, with a half-hour break for breakfast.

Result of semantic analysis of a sentence with regard to co-conference

Source: ABBYY blog on habr.ru

Based on these technologies, Sberbank has created a system for monitoring the news flow. This solution automatically analyzes the content of messages about counterparty banks in Russian and finds various risk factors in them. The entire volume of news about more than a thousand counterparty banks passes through ABBYY models in real time, but only useful data automatically enters the dossier in terms of risks.

RCO Fact Extractor SDK is a fact-finding technology using syntactico-semantic text analysis developed by RCO. The library performs linguistic analysis of the text taking into account the grammar and semantics of the language and provides a software interface for reading the results of the analysis and using other programs, for example, for visualizing the received data, building reports and tables, organizing searches by objects, etc.

The result of text analysis is entities selected from the text - names of organizations, persons, geographical objects, various character-digital constructs (numbers cars or policies, insurance addresses), as well as entity classes. In addition, a network of syntactico-semantic relationships between text entities is identified, as well as data structures describing events and facts mentioned in the text.

The RCO Fact Extractor SDK library is universal: it can be configured to work with different subject areas and even different languages. Moreover, it can help in compiling the ontologies of the subject area. All kinds of add-ons over the basic library allow you to solve different problems, the company says: from finding information takes (plagiarism) and building a semantic portrait of a document to anonymizing personal data in texts.

OntosMiner is a linguistic processor (developed by Avicomp Services, currently part of the United Instrument-Making Corporation), which provides recognition of named entities and semantic relationships between them in IT texts (facts), and is also able to determine the general tonality of the document. The OntosMiner linguistic processor performs intelligent text processing using ontologies and dictionaries available for user editing, as well as special heuristics.

Within the framework of the Ontos project, in particular, a ontological dictionary support system OntoDix was developed, where dictionary entries before immersion in the BZ are processed by a dialog module for parsing collocations. And the results of processing all dictionary entries, after "approval" by the user, are compiled into an effective automaton, connected as a dictionary resource to the corresponding systems of the OntosMiner family at the execution stage.

The general architecture of the OntosMiner family of linguistic processors

Source: V.F. Khoroshevsky. Online Knowledge Spaces Internet и Semantic Web. Part 3//Artificial Intelligence and Decision Making,# 1/2012

Ontosminer is actively used to analyze and systematize large amounts of correspondence, corporate documents and news publications in government agencies. So, in one of the federal departments, this system is processed in an automated mode by analyzing and classifying incoming correspondence with a volume of more than 1000 requests per day for a thousand categories.

Prospects

High expectations are associated with the automatic generation of text texts. They are largely related to the achievements of the GPT-3 algorithm developed by the OpenAI laboratory. On the basis of several examples, he knows how to perform many tasks directly or indirectly related to text (in English): write poetry and news, translate, give descriptions, solve anagrams, structure information.

The model, which uses 175 billion parameters, is trained on 570 GB of text information. The training array includes data from the Common Crawl open library, Wikipedia, datacets with books, texts from WebText sites. The size of the trained model is about 700 GB. Last summer, OpenAI opened private access to developer tools (APIs) and the GPT-3 model, and is gradually connecting more and more developers to it.

Six months after the appearance of GPT-3, Google presented a language model for 1 trillion parameters. parameters, and six months later announced their use in its products. And Microsoft, together with Nvidia, has developed the Megatron-Turing Natural Language Generation language model with 530 billion parameters, which is also designed to generate natural speech and is three times larger than GPT-3.

At the same time, Microsoft works closely with OpenAI. In the summer of 2021, the companies introduced Copilot for GitHub, which allows developers to automate some code writing processes. Four months after the launch, the companies reported that the tool helped write about 30% of the new code posted on the site.

As for the practical applications of these technologies, perhaps the most impressive successes they have achieved in the field of cognitive confrontation to create propaganda information packages. According to Elena Larina, a leading analyst at the Institute for Systems and Strategic Analysis, a member of the community of practitioners of competitive intelligence, in her article "New Habitat" (journal "Free Thought," No. 3, 2021), starting in the second half of 2019, the so-called cyber writers, created in the form of the GPT-2 application, and then GPT-3. The largest language model is developed in China: the WuDao 2.0 model uses 1.75 trillion parameters. The Russian leader is YaML, a language model developed by Yandex, with 13 billion parameters. On its basis, the developers have assembled a demonstration service "Balaboba," where users can ask the neural network to "think of" a proposal or whole pieces of text using only a small introductory phrase.

Today, there are no tools on the market that can be trained on user actions. And we're working on that. Our goal is to create intelligence that will gradually accumulate data for training and, upon reaching the required volume, will build models and begin to help the user.

Periodically, AI tools will adjust the entered information and build a recognition model. If the models meet the specified criteria, they will be used in work and regularly improved. If the models do not meet the specified criteria, then we will continue to collect data until we achieve the result and can give the maximum effect to the users of the system, "suggests Vitaly Astrakhantsev.

Next Overview Material > >
> Browse Home > > >