Developers: | I-Teco (iTeco), Center of cognitive technologies of AiTeco |
Technology: | Data Mining, OLAP, Corporate portals, Office applications, EDMS, EDMS - Systems of stream recognition |
Content |
"The analytical courier" – the software product, the instrument of analytical investigation of information space developed for extraction of knowledge from data which arrive in real time from huge number of raznoformatny sources. Thanks to methodology of semantic analysis a system carries out the detailed analysis of unstructured information, establishes connection between objects, events, subjects, builds forecasts of emergence of certain situations and also reveals sources of information campaigns, the attacks and so forth. The failsafe scalable storage Big Data which allows to process billions of documents and can be used as the independent solution is composition. The product is developed and is applied to monitoring of segments of the market, the analysis of image of persons and the organizations, the competitive analysis and also in activity of insurance and credit institutions, law enforcement agencies and intelligence agencies. Components of a system are used also in the system of patent search of Rospatent.
Functions:
- parallel processing of diverse unstructured information from different sources: managerial and legal documents, media reports, messages of news agencies, analytical materials of a different profile, Internet resources, etc.;
- search of resources in the Internet via the search websites, or according to the list of the studied websites;
- multi-lingual semantic search using the modern thesaurus by the Russian and other languages, request processing in a natural language for the text in the European languages;
- delivery of the list of relevant documents (ontology) which is thematically structured in the form of a tree;
- automatic general and thematic abstracting of collections or separate documents;
- thematic rubrication of documents and publications;
- determination of tone coloring of documents and separate objects, selection of references and citings;
- determination of the index of the information importance of an object of monitoring;
- automatic selection of theme groups in received at the request of a selection of documents (cluster analysis of publications);
- identification of key document subjects, collections of documents, creation of their interrelations in the form of a semantic network;
- creation of the digest (overview) on each object or a document subject;
- frequency analysis of headings and publications, multidimensional analytical data processing, research of dynamics of development of the problems presented in documents, Visualization of frequency distributions on the map;
- entity recognition in the Russian and English languages;
- search in personal libraries of users, Automatic delivery of new documents on the selected subject;
- procedural release of analytical reports;
- registration of significant actions of users.
Architecture of a software package
The Analytical Courier system is implemented on the.NET. Windows platform has three-unit architecture with the "thin" client and provides to users the Web interface.
For especially responsible applications system architecture with the components working in the separated networks is implemented. For example, WEB ROBOTS make monitoring of the Internet, results are transferred to internal network and systems where all funds are available to joint processing of the closed and open information are automatically introduced in storage.
The storage of analytical data is implemented for MS SQL Server and ORACLE DBMS
The Analytical Courier system develops in the direction of improvement of quality of the analysis of texts, increase in a range of the supported foreign languages, support of bigger number of server and portal platforms, improvement of the interface of the analyst and the administrator.
Competitive advantages
"The analytical courier" allows to plunge quickly into new data domains, to structure a perspective, to prepare reports and information and analytical materials. Unique feature of a system is combined use of different visual methods of extraction of knowledge to one selection of documents, for example, at first creation of ontology of a selection of documents is made; its cluster analysis, then the semantic network of subjects for the selected cluster then — frequency analysis of a time series of documents on the interconnected problems, etc. can be under construction. In a system the unique method of determination of tonality of publications is implemented.
Broad spectrum of functionality of a system provided it implementation in the organizations processing large volumes of documents and messages of different structure.
The minimum cost of its operation in comparison with the most known systems in the market belongs to the important advantage of a system.
Examples of screen forms of a system
Sample of the thematic cluster card of messages:
Sample of the semantic card of interrelations of conversation topics:
Examples of use
- Analytical divisions and security services of bank:
the analysis of solvency of the client, identification of objects making suspicious payments, identification of cases of dissemination of confidential information, etc.
- Insurance companies:
detection of the swindlers who repeatedly caused damage, unfair objects insurers, their communications and also patterns of the events (in the place and time) happening to clients of insurance company.
- Analytical divisions of production companies:
analysis of the most frequent malfunctions; the analysis of market reaction on goods quality, the help in decision making.
- Divisions of marketing of the enterprises (market research of drugs, products):
providing the persons making management decisions, information for development of an optimal variant of the solution of the standing problem.
- Special services, law enforcement agencies:
monitoring of events, objects, problems and analysis of interrelations of the studied entities.
New opportunities
Dynamic ontology of search issue of documents
The problem of selection of "the" documents among all documents in search issue is very relevant for users of search systems. For this purpose in the system of extraction of knowledge from the documents "Analytical Courier" methods of cluster and semantic analysis are already used. In search servers of such companies as Google and Vivisimo is implemented a possibility of creation of a thematic tree (ontology), in each of nodes of which the group of thematically homogeneous documents of issue is located.
The classical methods of the thematic (cluster) analysis based on proximity measures between documents separate search issue into groups of similar documents – so-called clusters in which documents are in pairs similar at each other, but "being" of similarity can differ at different pairs from one cluster, for example, one pair is similar in an entity "development of economy", and the second – in an entity "demographic crisis".
For increase in reliability of cluster analysis we used a biklasterization method (an object and character, conceptual clustering) in which the similarity of the documents integrated in one cluster is expressed through the general structural signs (entities, subjects) selected from documents. Advantage of a method is existence of all subjects of a cluster in each document. It is also important that it well works at a small number of documents in selection. The method of the analysis of formal concepts (AFC) – a powerful method of data analysis which is successfully put into practice is the cornerstone of a biklasterization. For receiving a thematic tree at first the grid of formal concepts of a type of a two-dimensional array to which lines there correspond documents, and to columns – the entities drawn from them forms. If the document contains a specific entity, then on intersection of these of a column and line there is a frequency of its emergence in the document. Thus, the grid contains all information on the interdependence existing between documents and entities. Visual representation of the formal concepts of dependences revealed in a grid is the tree of clusters of documents.
How does everything work?
The selection received based on search query is processed at first by the semantic processor selecting entities from documents. Then the analytical processor according to documents and entities creates a grid of formal concepts. On the basis of its analysis, linear dependences between documents and entities are defined and removed: the similar documents and also repeating or insignificant entities are removed, as a result there are only documents and entities independent from each other. On the basis of the general significant entities documents of an initial selection are separated into clusters for which visualization the multilevel tree is under construction.
High-speed performance of the program of a biklasterization practically does not depend on the volume of selection of documents. Speed of visualization of a tree depends only on number in parallel of the working users. The system response time during the work with documents of clusters also practically did not increase therefore we expect mass use of this tool by our numerous users.
Example of use
Search issue of an example is received in search result in the Analytical Courier system on demand "[(the journalist the editor the correspondent) & (persecutions murder beating the conclusion "dismissal under pressure" attack repression threat arrest prosecution detention criminal pressure)]".
The result of processing by the Analytical Courier system of search query is given below. In the left part of the screen the user can browse a tree of clusters, select the cluster which interested him then in the right part a system displays the documents entering it. Each document of a cluster contains all entities listed in hierarchy of the tree nodes corresponding to it. The document can be present at several clusters at the same time.
The method of conceptual document clustering is available in the current version of the Analytical Courier system.
Development of components of lingvo-semantic analysis of the text in the Russian and English languages
Linguistic analysis of the text
The program component performing the following functions is developed:
- lexical analysis (splitting text into offers and lexemes),
- morphological analysis (determination of morphological characteristics of words, such, as: word class, sort, number, case, etc., and synthesis of inflections),
- preparsing (selection of groups of lexemes - syntagmas, etc.),
- parsing (creation of a tree of analysis of the offer and determination of syntax roles of words in the offer: subject, predicate, addition, circumstance, etc.),
- post-parsing (selection of the typified entities, …).
Semantic analysis of the text
The subsequent semantic analysis of the text makes typification of entities (physical, legal entities; animate objects; dates; regions and many other types) and also their normalization. For identification ssylochno of the provided entities (pronoun) different heuristic methods are used: Permission of anaphoric references. Example. If at the found fact there are pronouns ("it", "it", etc.), then identification of an origin object of the link is made. Permission of abbreviations. Example. If in the fact the abbreviation as the name of an object met, then identification of the complete name of an origin object of an abbreviation is made. For example, if occurred in the text — NLMK, then having browsed the text and having found in it New Lipetsk metallurgical plant a system will generate a synonym for NLMK. Identification of geographical objects. Example. If in the fact a geographical object met the name, for example, "village of Ivanovo", then search of other geographical objects in the text, for example, Moscow region is run that will allow to connect the found village of "Ivanovo" with that in the reference book of the countries and regions which is in the Moscow region. Search of the most full name of the person. Example. If in the fact a "person" object met, then a system will look for its more full name in the text. For example, if in the fact the person "D. Medvedev" met, and in the text there is a person "President of Russia D. Medvedev" above, then a system will take this last name as the most complete in this text.
Selection of many types of entities (the addresses, phones, etc.) is made using expanded (including the user) governed.
Thesaurus of Russian
Development of the new modern thesaurus of Russian compatible to Wordnet standard 3.0 is complete. It has unique volume, in its structure more than 160 thousand groups of synonyms, 700 thousand communications between them, 170 thousand lexemes and 13 types of semantic relations.
Web service is developed for management of the thesaurus. It can be used both in the systems "Analytical Courier" and 'X-Files', and in other systems. Its feature is the possibility of simultaneous operation, both with the general, and with thematic thesauruses by the customer. The tool for creation new or editing the existing thesaurus enters delivery of a program component.
Dynamics of development of functions of the Analytical Courier system
Key subjects
Search; federated search, multilingual search, content analytics, content classification, categorization and clustering, fact and entity extraction, taxonomy creation and management, information presentation (for example, visualization) to support analysis and understanding.
Information search; search in several sources, multi-lingual search, analytical processing of text information, instruments of visual analytical processing of text information, classification of contents of documents, a categorization and a clustering, entity recognition, selection of the relations, selection of the facts, creation of taksonomiya and ontologies, information visualization using geoinformation services.
Conclusion
The patent product of I-Teco company "Analytical courier" is the cornerstone of an arsenal of the systems of analytical investigation of the companies, ensures their qualitatively new competitive advantages, safety and dynamic development.