Developers: | Abbyy Infopoisk |
Last Release Date: | 2014/12/15 |
Technology: | Office applications |
Content |
The Compreno technology is intended for creation of analysis systems, transfer and search of texts in different languages.
Compreno is a language translation technology of any human language on a universal language of concepts. Respectively, Compreno includes also this universal language of concepts which ABBYY secretly developed from 1990th years in the research laboratories.
2015: Start of promotion of Compreno in a corporate segment
On March 25, 2015 the company ABBYY announced development of the direction of corporate business in the field of intellectual information processing which is an integral part of the markets ECM ERP, text analytics and search.
For March 25, 2015 on the ABBYY Compreno platform three solutions are implemented:
- ABBYY InfoExtractor SDK,
- ABBYY Smart Classifier SDK
- ABBYY Intelligent Search SDK.
Pilot projects using Compreno started in the State Duma, IES-Holding, large oil, power and other companies from the different industries. The solutions ABBYY Compreno allow:
- analyze and take the necessary information from arrays of unstructured data (internal and external sources);
- distribute a flow of incoming documents on departments and responsible;
- improve search systems.
"Considering the huge growth of volume of unstructured data, at the world there is big, constantly growing demand for solutions in the field of intellectual information processing. This direction is an integral part of the markets ECM and ERP, text analytics, search. Technologies ABBYY can be built in different information systems, expanding and complementing them with unique opportunities of extraction, the analysis and search of necessary information", – noted Maxim Mikhaylov, the senior vice president, the director of the department of ABBYY Compreno.
Technologies of intellectual information processing, 2014
2012: Announcement of revolutionary Compreno technology
In 2012 Abbyy provided Compreno technology. The Universal Semantic Hierarchy (USH) - a core of language of concepts - contains 60 thousand elements in the universal section of cognitive model, 80 thousand at this time - in the Russian section, and 90 thousand - in English. Nothing even remote in the world exists.
For February, 2012 Compreno has no world analogs though at some universities and developments in similar the directions are conducted. However odds in 15 years, the involved huge human resources and material costs allow to hope that ABBYY will manage to stake out for itself the exclusive place of the pioneer. The company is played into the hands also by that circumstance that the last 10 years an overwhelming lot of researches in the world was conducted in line with a statistical machine translation model.
Compreno is the full, not having analogs in the history technology revolution. The scale of this revolution, its value for people (for all people, and not just for fans of computers) are comparable unless to the invention of World Wide Web or e-mail. It is not less in any way. For descriptive reasons it is possible to transfer this revolution to clear realities material and by denomination: if ABBYY it is quiet, without vanity commercializes Compreno at least in the tenth part of its possible practical applications, and then will enter the stock market, capitalization of the company will eclipse all idols of today - from Apple, competently and stylish operating very and very mediocre solutions in the technology relation, to Google managing to lead up a blind alley armfuls the most part of own perspective undertakings. (Sergey Golubitsky, observer of Computerra, February, 2012[1])
As Compreno works
Traditional translation models
Success provided also the right initial choice of the direction for development of the system of automatic translation. In the world's 1990th rules one queen - Rule-Based Translation Model, the classical translation model based on a limited set of ready rules for some pair of languages. One of problems of RBTM - in accumulation all of new and new rules which at some point just begin to clash among themselves. Analyzing the offer, we can apply different sets of rules, at the same time to the machine priorities are unknown. The transfer based on RBTM, as a rule, is not anxious with complete parsing: instead of it the offer is divided into frames on which then interpolate the rules for receiving transfer existing in a system. A RBMT of a system do not consider semantics[1].
At the beginning of the 21st century by efforts of Google the world sat down on a needle of a new translation algorithm - a so-called statistical model. A basis of SM - existence of extensive base of multidirectional transfers. We set to the statistical engine the offer for transfer, he looks for in the database as in the dictionary options of already legacy translations of the similar text and after minor changes issues quite decent result.
Changes are not the most essential. Let's assume to us it is necessary to translate the sentence "in the room there is a red chair", and in statistical base there is already a translated phrase "in the room there is a green table" - the solution is elementary: already existing template of transfer undertakes and new words are just replaced according to the dictionary.
As in SM obviously high-quality already ready human translations are used, at the exit very nedurstvenny result because for implementation of transfer it is not necessary to plunge into syntax, specifics of phraseology of specific language and so forth turns out.
Everything is remarkable, however, only until case does not concern transfers in the directions with a so-called low covering (we will tell, some, Romanian-Russian or Thai-Hungarian).
Where to take analogs? According to Sergey Andreyev the danger traps also when leaving in data domains on the mass directions because parallel texts it becomes strong less, than in household and colloquial subject. The leaving combination in data domain and not the most mass translation direction results in weak results. Let's tell, IT. It would seem what difficulties can face at machine translation with the text on information technologies? Really - any if we are engaged in Russian-English transfer. But they right there will arise on the Russian-French Niva! Statistical base in this direction extremely scanty and lacunas arise continually.
Exit within SM for similar situations is found only palliative: working with languages / subjects of a low covering as the intermediary English is used. So at first the translation from Russian into English, and then from English on, say, Romanian, or Thai becomes. As a result very noticeable decline in quality of transfer turns out.
The most sad that the problem with covering density within SM is not solved essentially in any way. Only exit: employ hundreds of thousands of translators and force them to fill lacunas in all directions with a low statistical covering. As you understand, nobody will be able to do it and will be.
In addition to difficulties with the low density of transfers in the directions which are dropping out of a narrow mainstream at SM still a set of small defects. For example, the statistical model absolutely poorly copes with transfers of proper names. Many remember Yushchenko's translation as Yanukovych, and Russia as Canada. Denial (a part "not") is very difficult obstacle. The part cannot be positioned correctly as a result of linguistic analysis of the text, and is not engaged in like that SM. As a result of the offer, containing denial, are often translated by engines on a statistical model exactly the opposite.
Anyway, ABBYY initially refused Rule Based Translation Model and threatened on the system of computer translation of new generation. It is necessary to tell that it was required to think out especially nothing. The universal language of concepts exists in structural linguistics in the form of an old and pipe dream since the time of Ludwig Wittgenstein. Even Nahum Chomsky in the early works only deepened the existing utopia.
Universal Semantic Hierarchy (USH)
The Compreno project proceeded from three fundamental sendings:
- use of qualitative and uncompromising parsing.
- of creation of universal cognitive model of language which possibility is defined by an axiom that people, though live in different conditions and speak different languages, however on the whole think equally. Forms of expression of a thought different, and here the conceptual framework matches.
- the automated case after-training - linguistic descriptions are verified and complemented on the basis of statistical processing of case data.
Proceeding from these sendings the idea of the Universal Semantic Hierarchy (USH) capable to describe the phenomena from the general to the particular was formulated. On drawing up this hierarchy at ABBYY 15 years also left. For February, 2012 it is 70 thousand concepts of a universal part of cognitive model, more than 80 thousand - in the Russian, more than 90 - in English.
Algorithm of machine translation on USI
The algorithm of the machine translation based on USI looks as follows:
- Lexical analysis of the text (selection of words, punctuation marks, digits and other text units);
- Morphological analysis (determination of grammatical characteristics of lexemes);
- Parsing (establishment of sentence structure);
- Semantic analysis (identification of the expressed value in the system of language);
- Synthesis from universal semantic sentence structure on an output language.
As a result selection of words for transfer is performed not directly from the first language, and from conceptual set which, figuratively speaking, "hangs" on the same branch of a universal semantic tree, but only already from a second language.
As the USI model end-to-end, subordinate elements of a system on hierarchy inherit signs of higher elements. This the simple, apparently, circumstance allows to try to obtain the unprecedented accuracy of machine translation as each word from the translated sentence is described by the maximum set of conceptual equivalents, and not only specific, but also patrimonial qualities at all levels of semantic hierarchy.
Interrelations between the structure members belonging to the different classes are provided in USI, and these communications are also structured and formalized that allows to make the multilevel conceptual analysis of the text which is also increasing translation quality[1].
Application options
Prospects which are offered by Compreno are boundless and various:
- the computerized translation of the text from any language on any at the qualitative level, incomparable to all translation systems extended today;
- full intellectual search without specialized syntax of requests (Search in sense, extraction of the facts and communications between objects of search/monitoring; monitoring of the companies and a personnel and creation of analytical reports on the basis of parameters of different type, etc.);
- systems of artificial intelligence of the most various profiles and applications;
- automatic speech recognition;
- document classification and search of similar documents in sense;
- the analysis of tonality in monitoring;
- summarization and annotation (writing of the summary of long documents)
and it only the beginning.
Difficulties of use of technology
As the bottleneck for global application of the semantico-syntactical analysis in mass search systems very high requirements to the computer capacities necessary for indexation of information arrays at the conceptual level act. These requirements it is incommensurable above, than at the existing forms of traditional indexation. However, already today the technique of the semantico-syntactical analysis can effectively be applied (and ABBYY is applied - saw completely functional prototype of the search engine own eyes) to more purposeful and narrow search in the closed corporate systems.
2011: Creation of the company "Abbyy Infopoisk" and receiving a grant of 450 million rub from Skolkovo
In February, 2011 the Skolkovo Foundation approved selection of a grant in the amount of 450 million rubles ABBYY company. Money is selected for creation of technology of hands-off processing of texts of Compreno, the general project budget will make 950 million rubles (other investments will be made by Abbyy). Grant from Skolkovo - non-paid also does not assume return of investments, noted in Abbyy. Earlier the president Skolkovo Victor Vekselberg called Abbyy among those companies with which the fund intends to cooperate.
For a work on a grant the separate company "by ABBYY Infopoisk" was created. She will become one of the first residents of the innovation center Skolkovo, Andreyev notes, the staff of the specialists working on Compreno technology and the corresponding intellectual property will be transferred to her. However, so far the new company will function beyond limits Skolkovo because of not readiness of infrastructure of the innovation center.
As of February, 2011 the most part of works on creation of Compreno technology is already made, Sergey Andreyev, the president of ABBYY said, 300 specialists who spent for the 1 thousand chelovekolt project work on it. The costs which are already enclosed in Compreno, the president of ABBYY estimates approximately at $50 million. Commercial products on the basis of new technology should will appear during two – three years.
The 1990th: Development start
The beginning of development of Compreno fell on the 1990th years when in an arsenal of ABBYY (those years - still BIT Software) two ice breakers were already registered: dictionaries of Lingvo and program for FineReader text recognition. Products were on sale worldwide, were hits and made stable profit - a god-send for romantic projects like Compreno which stress would not be endured by any third-party investor (To invest millions of dollars in something absolutely revolutionary in addition and with unknown perspectives? And suddenly nothing will turn out?).
ABBYY did without someone else's money and it saved Compreno, having allowed to finish to the bitter end the project with so enormous material and human costs.