Translated by
2019/07/30 11:39:25

Marking of data of Data labeling

The world market of machine learning grows with a speed about 50% a year. In 2018 its volume was 1.8 billion dollars, and for 2023 it is estimated by the amount almost at 20 billion Deep Learning Market worth 18.16 Billion USD by 2023 with a growing CAGR of of 41.7%. Here not only obvious components - hardware and the software, service, but also qualitatively new type of production which received the name data labeling or a marking of data join. In more detail about emergence of this term and use of similar transactions – in the material prepared especially for TAdviser by the journalist Leonid Chernyak.

Emergence of data labeling is connected with need of giving on an input of training systems of large volumes of specially prepared data. Speaking about it most often are limited to simple ascertaining of that fact that Big Data forms a basis of machine learning. At the same time, the volume of a data labeling segment, according to Cognica Research, in 2023 will reach 1.2 billion dollars[1].

The need for the industry of a marking is connected with the fact that the practical value has not certain abstract AI (Artificial Intelligence or Artificial intelligence), and its quite almost focused subset called by the same abbreviation of AI, but from Augmented Intelligence, i.e. AI strengthening possibilities of the person. Carry to Augmented Intelligence problems of image understanding, work with texts in natural languages, control of vehicles, etc. All these AI applications for work need information on the outside world.

Vanity around a marking of data allows to revaluate again wisdom of expression of the mathematician of Cleve of Hambi which told "Data is the new oil" in 2006 ("Data new oil"). This wisdom the Economist magazine in the report published in 2017 "confirmed The world’s most valuable resource is no longer oil, to but data" ("The most valuable resource in the world not oil now, and data"). But crude data, as well as crude oil, in itself have no consumer cost, in it their main similarity. For transformation of oil into fuel, oils and other useful products huge oil-processing industry is created. The biggest profit not oil-producing countries, but world concerns specializing in oil refining teach. The similar procedure needs to be made also over data to turn them into goods. But, unlike oil while there are no means for automation of preprocessing of data and in the near future will not be therefore this tiresome work will be performed manually by low-skilled employees (handmaid data labeling). It is possible to call them "blue-collar workers" of the industry of machine learning which still was provided only by "white collars". The working industries should execute manually the huge volume of work. For example, the summary of one human image requires the indication from 15 to 40 points and all this becomes normal means of the human-computer interface.

China has an obvious chance to become the super monopolist in the field of data labeling. The country locates the necessary number of highly qualified specialists, state programs on development of AI are developed here, and at the same time there is an unlimited number of the contractors of the low level wishing for a role. They work nadomno or in the constrained conditions at so-called "factories of a marking" (tagging factories), receiving extremely a low wage - less than one and a half dollars per hour.

China has an obvious chance to become the super monopolist in the field of data labeling

[2]numbering more than 10,000 home-workers executing a marking of data for problems of optical recognition (Optical Character Recognition, OCR) and processings of texts in natural languages is a common example of factory of a marking (Natural Language Processing, NLP). Among her clients there are large companies, including Microsoft, and the universities. Her head told:

We are construction workers of the digital world, we lay a brick on a brick, but we play a noticeable role in AI. Without us it is impossible to construct skyscrapers.


In spite of the fact that the marking, apparently, trivial transaction – entering into the image or the text of tags, contains in these words deep meaning. In the course of a marking high-quality conversion is made - crude data are supplemented with metadata and turn into information. The most utilitarian determination of information sounds as follows "Information is data plus metadata"[3].

Technologies and markup languages of images the phenomenon new, the first publications on this subject belong to 2016. The idea of a marking of texts is much older - she is from polygraphy. The proof symbols entered to manuscripts were the first markup languages. This revolution in a marking was made by Charles Goldfarb, the researcher from IBM who is called "the father of the modern languages of a marking". He created the Generalized Markup Language (GML) language which the machine, but not the typesetter understood. The founder of WWW Tim Berners-Li used this language as a prototype for creation of a markup language of hypertexts of HTML used in the first WWW project. In the mid-nineties other British, Jón Borsak, offered the version of the SGML for Web language. Development of the working version of a modern language was performed in 1996 by forces of the working group totaling 11 people, and the famous expert in the field of programming in open codes James Clark headed it. It also shifted the name accepted now — XML. For a marking of images now there are also freely extended technologies (Sloth, Visual Object Tagging), and commercial (Diffgram Supervisely), and others. The list of means for a marking of the test tests used when processing texts in natural languages of NLP, significantly more long.

All these technologies of a marking are integrated that they allow to turn data into information. Then this information will become a source of knowledge in the applications getting under determination of AI, performing the following intelligence function which essence consists in transformation of information into knowledge.

Existence of this natural technology chain distinguishes machine learning from symbolical approach to AI with its artificial attempts of transferring of human knowledge to the machine. Perhaps once the marking will be automated, but qualitatively new sensors and means for work with texts are for this purpose necessary. With their emergence the current technologies of work with data, everywhere and mistakenly called by information, will become information in the true sense of the word.