RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2
2020/07/09 10:25:26

Speech technologies: on the way from recognition to understanding

Speech technologies within decades developed in own narrow area. However a few years ago there was this break – niche technologies actively entered mass market of commercial products. By estimates of Center of Speech Technologies (CST) company, the volume of the Russian market of speech technologies at the end of 2019 can grow by 25%. However practical success of speech solutions in many respects depends on that, the speaking interlocutors how well understand each other: person and computer. In the plane of training of the "colloquial" systems, first of all, of various chat-bots – assistants and consultants, - there is a center of development of speech technologies today. And there are many barriers, first of all, of the scientific plan which just researchers before people are able "to chat about that, about this" with the independent robot should overcome. Article is included into the overview of TAdviser "Technologies and solutions of artificial intelligence: change point"

Content

Evolution of technologies

File:Aquote1.png
Application of methods of deep learning gave the chance to considerably improve quality of continuous speech recognition. And here, as we know, a key factor of success is the completeness and quality of the training body for creation of acoustic and language models of the speech,
explains Alexey Lyubimov.
File:Aquote2.png

Sharp quality improvement of speech recognition to which other factors were added became "trigger" for commercial start of a segment of such solutions: inexpensive "iron", available Open Source library and convenient services for training of neuronets – they cardinally reduced the price of use of speech technologies, having made them much more available. The opportunities which opened before developers of speech solutions pushed them to active advance.

Achievements of today

File:Aquote1.png
The problem of recognition of two and more speakers in one audio recording quite successfully is solved. Dialog in two and more languages is not also a problem. Even if the premises zashumleno, but the microphone are near a sound source, then it is possible to recognize the speech if, of course, we do not speak about production where noise level is so high that it is impossible to select the human speech,
tells Alexey Lyubimov from 3iTech.
File:Aquote2.png

Services - assistants Google on the Google Home platform and Amazon Alexa already learned to distinguish voices of different people in one family and to interact with them taking into account differences of their preferences. It is expected that assistants will be able to hold the context related to different interlocutors in the near future, and to switch a context between them.

IT solutions based on speech technologies: what is on sale in the market today

Separate segment of the market specialized solutions of a professional class still make, for example, professional dictophones of GNOM (makes CST), which keep record in any conditions, provide legibility of the speech in very difficult acoustic situation, have the conclusion of Ministry of Internal Affairs of the Russian Federation expert center on suitability of phonograms for carrying out identification researches on a voice and the speech.

Between these highly specialized solutions and unpretentious free applications – the wide class of the commercial systems of speech analytics which can be separated into several main groups of products/services is diktovshchik using whom bloggers accelerate process of creation of the posts:

  • The simplest type - the systems or Speech-to-Text services, i.e. conversions of the speech to the text (the so-called, transcribed text) of the corporate level. Suppliers usually offer a system with the basic dictionary which can be complemented at a delivery stage with preset tuning on lexicon of a certain field of activity, for example, of the legal or financial sphere. Tools of expansion of its lexicon by the client's forces can be composition and also the supplier can add the necessary dictionary for some mezzanine board. In the Russian market are offered as cloud services (3iVOX, Yandex Speech Kit), shrink-wrapped software products (CST of Caesar-R), so the solutions implemented on the platform of the customer.
  • Speech-to-Text is a task, the return to speech recognition. Today such products find application, mainly, in voice assistants and in the systems of scoring of texts, for example, for reading of news.
  • Contact centers. Communications with clients of the company on different channels are supported: phone, e-mail, chats in mobile applications and on the websites, social network, messengers (omnichannel analytics). Such solutions are used by the specialized acoustic models focused on processing of telephone traffic. At the same time aggregation of several types of analytical mechanisms is provided: speech, text and business analysts. The most popular types of lexical models are adapted for terminology and problems of client and technical support in the field of e-commerce, telecommunications, banks and medical services.

The point of the competition of the different systems – quality of speech analytics and the deep analysis, i.e. capability of a system to collect data from different sources and big arrays of unstructured information to provide reports with information, valuable to the company.

Example of the report of a system of contact center with the detailed analysis of a talk of operators. Source: TouchPoint software, 3iTech, 2019.
  • Biometric identification of the personality. In CST company VoiceKey.PLATFORM - the platform of multimodal biometric authentication of users in channels of remote service is created. In other words, it is about two-factor identification of the personality: on the person and a voice. The detector of the living user (liveness detector) which reveals attempts to use record of a voice and the photo for verification passing is a part of the solution.

The screen of a system of user identification on a voice and the person

Points of growth of application solutions of speech analytics

  • Speech analytics in workplaces, for example, in sales points. Unlike solutions for contact centers, in this case acoustic models of records from microphones are used. A number of experts consider this segment of the market as a new point of growth of application solutions of speech analytics. For example, company 3iTech implements systems allowing to increase quality of work of personnel in retail and services industry: the cash zone or the field employee is equipped with the microphone, a talk is analyzed in real time. The company tells that during a pilot project in one federal retail network a system was trained for tracking of correct execution by cashiers of scenarios of promotion actions. According to the results of the project it was succeeded to increase an average bill for 2-3% due to growth in sales of promotional goods.

File:Aquote1.png
I think, it is the request for creation of the spontaneous organization where a system with AI not only controls work of employees, but also shows it points of professional and material growth, some kind of virtual supervisor,
believes Alexey Lyubimov.
File:Aquote2.png

IT solutions estimate service quality, including, on each operator. Source: 3iTech, 2019.
File:Aquote1.png
Machine learning opened essentially new opportunities for generation of fruitful hypotheses on the basis of the available data. The biggest technology jump happened in image identification algorithms where results of the machine much more exceed today results of the person. However in the field of transcribing of audio there are certain spaces though explicit progress on the person. Appearance of voice assistants - to that confirmation,
speaks Alexey Vyskrebentsev, the head of the center of examination of solutions of Foresight company
File:Aquote2.png

"The hereditary disease" of speech analytics – dependence on preset tuning on lexicon of specific data domain does not give in to "treatment" yet. Let's tell if the model is configured on processing of news plots on television, it will not show good results at recognition of record of the report at a conference on questions IT – not the really qualitative transcript, i.e. the text created using the computer conversion "The Speech — the Text" will be received. Its quality can be raised slightly due to identification and removal from the transcribed text of excess words and filler words which quite often litter the speech. And the future of speech technologies, the Head of CST group Dmitry Dyrmovsky, behind transition from recognition to understanding of the speech considers.

It is about semantic analysis of the speech using which computer systems should learn to select the meaning of what was said, the main thought and to build the transcribed text around it. How do today's systems do without this ability?

Chat-bots: colloquial intelligence of the beginning of 2020.

Natalia Lemeshevskaya, the marketing director Nanosemantika Laboratory, suggests to consider two large scopes of "colloquial intelligence": internal communications between the staff of the company and external communications of the company with her clients and partners.

File:Aquote1.png
Such projects are done also by us, and other companies, as in Russia, and abroad. They gradually become more and more because obviously – the technology optimizes internal processes and saves employees from a routine,
tells Natalia Lemeshevskaya.
File:Aquote2.png

The corporate HR department takes the certain place in the list of the interested departments: it turned out that it is very convenient to make primary candidate screen in the help of the virtual recruiter in the automated mode, and then to automate call-down with a reminder or receiving confirmation that the applicant will be on an interview. Voice bots perfectly cope with these tasks.

File:Aquote1.png
Everything moves delivering work with clients on "pipeline" - to give a part of tasks to the robotic dialogue systems. And afterwards on the basis of the data obtained from dialogs to work on improvement of the products, services or customer service,
believes Natalia Lemeshevskaya.
File:Aquote2.png

The chat-bot of Gold advises visitors of the website of Belarusbank on services and products of the organization

Chat-bot: make

File:Aquote1.png
The solution which is implemented based on the designer will be suitable for the small customer base (up to 100 people). The person without special skills will be able to make it. As a rule, the designer – it is native the clear program where "creator" writes the list of questions-answers, without any steps to the right-to the left. Communication is performed strictly according to the scenario,
tells Lemeshevskaya.
File:Aquote2.png

File:Aquote1.png
These systems very far left from tables of strictly written scenarios in Word document type. Perhaps, in a document type of Word there is only general description of the purposes and tasks of the virtual consultant – with what often the customer of the virtual consultant comes to the developer,
speaks Anna Vlasova, the head of department of linguistics, Nanosemantika Laboratory.
File:Aquote2.png

File:Aquote1.png
The platform provides a user-friendly interface for work, connectors to popular messengers at once to begin to use the created system, visual designers of dialogs, technologies for determination of intentions of the person interlocutor, or work with typos, editors for preparation of formal rules for conducting dialog, or visual editors for simpler dialogs, and many other things,
explains the head of department of linguistics of Nanosemantika Laboratory.
File:Aquote2.png

Solutions of tomorrow

Virtual assistants

Today at the market there is a huge number of various applications – the assistants ready to consult the owner of a device on the most different questions using a voice, a text request or scanning of an object or a barcode if it is about retail. But the future, according to consultants of Accenture, - for assistants who do not need the special interface. According to data of Accenture, by 2024 the majority of interfaces will not have the screen, virtual assistants will be integrated into daily tasks, and by 2027 digital assistants round the clock will work in the background at workplaces of employees.

The Russian experts also expect rapid growth of this segment of the market. By estimates of Kirill Petrov, founder and managing director of Just AI, by 2022 in the world there will be more than 500 million smart columns, and by 2025 their quantity will exceed one billion. Such forecasts are based, in particular, on expectations of close effect of synergy of several factors: development of biometric technologies, quality improvements of speech recognition, improvement of algorithms of NLU together with a trend integration of assistants B2C-services from "the real world".

Holographic assistants

The separate interesting direction – application in real services of assistants to the holographic image that gives ralistichny "chelovechinka" to a device with a computer payment. This that direction in which a high tech industry directed a few years ago. And today holographic employees can be met at the airports of the different countries of the world and also behind counters of shopping centers. For example, in the solution Accenture for retail 3D - the projection of the seller talks to the buyer, moving lips and expressing emotions.

The first passengers of the Simferopol airport which is dug out in 2018 were met not only by real employees, but also the hologram


On a projection the additional information, for example, the interesting goods or navigation on surrounding space can be also displayed. Special software eliminates the excess noise preventing recognition of a voice of the buyer in a real situation.

Meanwhile, holographic assistants continue the movement in the private sector. Today they try to settle on a coffee table in the living room in the form of a graceful knickknack, most often, in the form of the "speaking" cylinder. And here for the Obexx AI Box virtual assistant of the Chinese company of the same name specializing in development of the innovation voice assistants, the owner can create an image of a personal avatar – the corresponding tool is embedded in the application of the assistant.

The holographic virtual assistant to Obexx AI Box has the visual image which can be changed
The "speaking" assistants – holograms become elements of a modern interior

However, to become "this person", this to "the speaking hologram" still should purchase a number of skills and to integrate them into a unified environment of communication – a uniform context in which it is necessary to select a number of important aspects: mood, specifics of a surrounding situation, prediction of desires of the user, etc.

The calls facing the industry. Whether the smart program "talk how the person?" can

Alexey Ushakov, the head of product management of automation of remote service of CST group, is sure that new trends of development of virtual assistants will be connected with a possibility of parallel listening, the analysis of mood, an environment and prediction of desires of the user. Such integrated approach, in turn, in principle, will stimulate emergence of the new open projects capable to consolidate advanced innovative developments. For example, platforms around which the ecosystem of skills of virtual consultants will be created.

Technology tasks which still should be solved

Not all special problems of a problem of high-quality speech recognition are currently solved. So, specialists 3iTech work on a problem of "the remote microphone" now: it is about speech recognition from the microphone between which distance and a sound source constantly changes. For example, very not easy to recognize the "voice" removed from the static microphone in a trading floor if the employee whose speech needs to be analyzed, moves on the hall.

File:Aquote1.png
The variability of accent is very high. And own training selections are necessary for each accent. So the training selections on the English accent of Russian, the Armenian accent, Chinese, etc. are necessary. And if at weak accent recognition is after all high-quality, then at strong accent, alas …,
explains Alexey Lyubimov.
File:Aquote2.png

File:Aquote1.png
Yet there are no training selections on which it is possible to teach to understand the system of the person with strong accent or defect of the speech,
states Lyubimov.
File:Aquote2.png

Problems of scaling of speech solutions

Whether it is possible to tell that today's virtual consultants "understand" the speech of the user approximately as other person understands? Yes, if it is about a conversation on narrow subject.

Alexey Vyskrebentsev summarizes a current status of practical implementations:

File:Aquote1.png
All solutions considerably become simpler now, otherwise they cannot be scaled because of dependence on availability of data. The most advanced solutions (image identification and transcribing of a voice in the text) have probabilistic characteristics. Recently algorithms "grow wiser", but at their integration into solutions of the companies additional tuning and after-training of systems is required to raise quality of work.
File:Aquote2.png

It is not enough solutions, even among the most advanced, such which allow "from a box" to receive fast and qualitative result". Such state of affairs explains why in the market there is a scepticism concerning use of such technologies, especially in the companies where still keep paper magazines or where there is a problem with quality of data, the expert notes. But successful implementations, first of all, in contact centers, specialists of the industry consider as quite good base of reference projects for further practical market promotion.

Dissociation of the market – a barrier to development

One of the biggest problems of the Russian market of speech technologies should consider its dissociation. In fact, each developer deals with the problem of lack of the necessary training selections alone. At a slow pace towards consolidation and consolidation of efforts of separate development teams the SOVA project (Smart Open Virtual Assistant) started by the companies "Nanosemantika and "Ashmanov's Neuronets" laboratory is.

SOVA is a voice virtual assistant and the free open platform for creation of virtual assistants which closest analog it is possible to call Amazon Alexa. SOVA consists of a set of program libraries, utilities and services, its basic elements are engines of speech recognition, the chat-bot and speech synthesis. Is declared that software of SOVA can be started practically on any iron, and possibilities of SOVA extend thanks to intelligence modules – special plug-ins which are developed by community of developers and add new functionality to SOVA.

In the manifesto of the project is declared: "We want to bring together community which will train and improve all elements of intelligence of the virtual assistant – from speech recognition to decision making systems, advancing us on the way to General AI". In August, 2019 received 300 million rubles financing on creation of a software package for equipment of devices and applications the voice interface and developments of voice assistants from the Foundation of support of projects of the National Technology Initiative (NTI) which is set up by RVC.

File:Aquote1.png
How successful will be this project and whether the private company which received a grant will develop speech technologies for the benefit of all market participants is a question,
reflects Alexey Lyubimov.
File:Aquote2.png

More joint project NVidia and Amazon Web Services as it gives to developers the real chance to train neuronets is important for its command today.

{{the quote|the author = is sure Lyubimov.|As for a role of the state, would be great if in Russia there was an organization which is really interested in development of AI in the country. Because it is necessary to coordinate efforts of community of developers, to bring together libraries and the speech body for training of neuronets. For support of the export solutions using the systems of speech recognition creation of the training selections available to domestic developers in foreign languages is reasonable. And, languages are necessary as extended, for example, English, French, Chinese, and dialects – the American English, the Latin American Spanish. Obviously there are not enough training selections in local languages – Vietnamese, Indonesian, Swahili. It would give a new powerful spur to the domestic market of speech technologies, }

How to talk to the chat-bot "properly"?

Today's level of speech IT systems – creation of autonomous skills. They still should take a serious barrier of achievement of true erudition. Its key feature – integration of various skills into the uniform integrated situation context.

Formation of a complete context is, perhaps, a key call of the near future. Without it it will not be possible to provide a possibility of decision making with a computer system based on all completeness of data, anyway connected, as with the user and his request, and the data on the situation useful to decision-making.

File:Aquote1.png
Today chat-bots, including speech, are successfully applied only in narrow spheres, within contextually limited scenarios. At the same time the specialized question answering systems using qualifiers of questions and also question-answer databases often work more adequately than systems constructed on neuronets. And at all there are no text chat-bots capable to communicate adequately on a wide range of questions,
claims Alexey Lyubimov.
File:Aquote2.png

And how with voice assistants and smart columns with Alice created in Yandex Company? With the voice assistant to Alexa from Amazon or Google Assistant? In fact, they define the advanced edge of speech technologies for the mass user today.

File:Aquote1.png
Developer companies collect data from all the devices for the purpose of formation of the training selections. Only after the corresponding training selections will be created, it is worth waiting for break in the field of "colloquial intelligence". It happens when becomes possible to apply machine learning to the dialogue systems, having received at the same time acceptable quality. So far voice assistants and smart columns work as question answering systems".
explains a situation Alexey Lyubimov.
File:Aquote2.png

In other words, who owns the training selections on the broadest spectrum that, that will be able to create the universal chat-bot to which it will be possible to expect to talk properly. Perhaps, it is necessary to think up for bots the training program like high comprehensive school?

How to train the virtual consultant?

File:Aquote1.png
Even the person has no universal model of knowledge. The person will understand nothing at a medical conference if he is not a physician. But creation of the computerized system much simpler, the cheapest way and quicker, than training of the specialist person. There is no need for the digital systems to reproduce model of training of the specialist person. There should not be at all a task to create the electronic person. Systems should and will be improved, will learn to solve very complex problems. However it will be specific, contextually limited tasks.
File:Aquote2.png

Comparing the person and the virtual character, the expert notes that training of the person for work in this or that area takes ten-fifteen and more years, and the artificial intelligence can be trained much quicker. In other words, at the intellectual program it is possible to find advantages, however so far that breadth of views which people have is unavailable to it.

File:Aquote1.png
The virtual consultant has access to internal databases of the corporate customer or to corporate content and constantly supplements or changes the knowledge. So, the virtual consultants servicing the companies of cellular communication obtain information on changes in a tariff line (new rates, changes in the cost of the SMS, calls, transitions to a rate, additional services, etc.), and virtual consultants in retail obtain data on changes in the diagram work of shops, new promotion actions and many other.
File:Aquote2.png

File:Aquote1.png
For example, all misunderstood the virtual consultant of a remark of the interlocutor person are processed by clustering algorithms, and the selected clusters are analyzed further by knowledge specialists who connect them with a certain type of answers in the knowledge base of the virtual consultant,
explains Anna Vlasova.
File:Aquote2.png

There are also other methods constantly to train an automated system. The more such methods the virtual consultant for replenishment of the knowledge uses, the more successfully he as a result works, Vlasova considers.

File:Aquote1.png
Successful implementation of the speech solution depends, first of all, on correctness of problem definition. Not always high quality of speech recognition was necessary, for example, in the archaic systems of the smart house to make any sound enough to turn on the light in the room. Today for quality evaluation of work of the operator in contact center 70% of accuracy of recognition there are quite enough. And due to use of the corresponding metrics, statistical models, etc. a system will successfully work. If our task – auto eject of information, then is necessary higher accuracy, not less than 90%. And here NLU/NLP technologies (Natural Language Processing and Natural Language Understanding) are necessary only at creation of the dialogue systems. To this area our civilization is only risen.
File:Aquote2.png

Thus, today the successful implementation project of speech technologies is a compromise between narrowness of the selected application area and efforts on training of the computer program. In practice it is most often shown as a business insight: the business idea which allows to derive undoubted benefit from the existing technologies for reasonable cost. But further breaks can be expected only with the advent of much more universal computer systems capable to understand a continuous human speech.

Read Also

You look also (voice assistants)