Data labeling
The global machine learning market is growing at a rate of approximately 50% per year. In 2018, its volume amounted to $1.8 billion, and for 2023 it is estimated at almost 20 billion Deep Learning Market worth 18.16 Billion USD by 2023 with a growing CAGR of 41.7%. This includes not only obvious components - hardware and software, service, but also a qualitatively new type of production, called data labeling or data markup . More information about the appearance of this term and the use of such operations - in the material prepared specifically for TAdviser by journalist Leonid Chernyak.
Listen to a 6-minute story about data markup on our podcast:
Content |
The appearance of data labeling is associated with the need to supply large amounts of specially prepared data to the input of training systems. Speaking about this, they are most often limited to a simple statement of the fact that Big Data serves as the basis of machine learning. At the same time, the data labeling segment, according to Cognica Research, will reach $1.2 billion in 2023[1]
The need for the markup industry is due to the fact that it is not a certain abstract AI (Artificial Intelligence or) that has practical significance Artificial intelligence , but its quite practically oriented subset, called the same abbreviation AI, but from Augmented Intelligence, that is, AI that enhances human capabilities. Augmented Intelligence includes image recognition tasks, working with texts in natural languages, transport managing tools, etc. All these AI applications require information about the outside world to work.
The hustle and bustle around data markup makes it possible to reassess the wisdom of the expression of mathematician Cleve Humby, who said in 2006 "Data is the new oil." This wisdom was confirmed by the Economist in the 2017 report The world's most valuable resource is no longer oil, but data. But raw data, like crude oil, have no consumer value on its own, which is their main similarity. To turn oil into fuel, oils and other useful products, a giant oil refining industry has been created. The largest profits are taught not by oil-producing countries, but by world concerns specializing in oil processing. A similar procedure must be done on the data in order to turn it into a product. But, unlike oil, while there is no means to automate data preprocessing and will not be in the foreseeable future, this tedious manual work will be carried out by low-skilled workers (handmaid data labeling). They can be called the "blue collar" of the machine learning industry, which until now has been represented exclusively by "white collar." Industry workers have to do a huge amount of work manually. For example, annotating one human image requires 15 to 40 points and all this is done using the usual means of the human-machine interface.
China There is an obvious chance of becoming a super monopoly in the field of data labeling. The country has the necessary number of highly qualified specialists, state programs for the development of AI have been developed here, and at the same time there is an unlimited number of people wishing to play the role of low-level performers. They work at home or in cramped conditions in the so-called "tagging factories," receiving extremely low wages - less than one and a half dollars per hour.
A typical example of a markup factory is Mada Code[2], which has more than 10,000 homeworkers who perform data markup for Optical Character Recognition (Optical Character Recognition) OCR and Natural Language Processing (NLP) tasks. Among its clients are large companies, including Microsoft universities. Her supervisor said:
Despite the fact that the markup is a seemingly trivial operation - adding tags to the image or text, these words contain a deep meaning. The markup process performs a qualitative transformation - raw data is supplemented with metadata and turned into information. The most utilitarian definition of information is as follows: "Information is data plus metadata"[3].
Technologies and languages for marking images are new, the first publications on this topic date back to 2016. The idea of marking texts is much older - it comes from printing. The first markup languages were proofreading signs included in the manuscripts. The real markup coup was made by Charles Goldfarb, a researcher at IBM, who is called the "father of modern markup languages." He created the language Generalized Markup Language (GML), which the machine understood, not the typesetter. WWW creator Tim Berners-Lee used the language as a prototype to create the HTML hypertext markup language used in WWW's first project. In the mid-90s, another Briton, Jon Borsak, offered his version of the language "SGML for the Web." The development of the working version of the new language was carried out in 1996 by a working group of 11 people, and was headed by its famous expert in the field of open-source programming James Clark. It was he who shifted the name now adopted - XML. For image markup, there are now free technologies (Sloth, Visual Object Tagging), commercial (Diffgram Supervised), and others. The list of tools for marking test tests used in processing texts in natural NLP languages is significantly longer.
All these markup technologies combine the fact that they allow you to turn data into information. Then this information will become a source of knowledge in applications that fall under the definition of AI, performing the following function of intelligence, the essence of which is to turn information into knowledge.
The presence of this natural technological chain distinguishes machine learning from the symbolic approach to AI with its artificial attempts to transfer human knowledge to the machine. Perhaps once markup will be automated, but this requires qualitatively new sensors and tools for working with texts. With their advent, current data technologies, universally and mistakenly called information, will become information in the full sense of the word.
Data Markup in Russia
2022: Russian data markup market up 70%
The Russian data markup market grew by 70% in 2022 and amounted to 1.6 billion rubles. ABK announced this on August 17, 2022.
At the same time, the number of users registered on specialized data marking sites in the first half of 2022 increased by 60%. A year earlier, the growth rate was lower - 20%.
In particular, a significant increase in users on the Elementary site occurred at the expense of the self-employed. As of August 2022, they account for 85% of all platform users. This became possible thanks to the connection of a free service from Sberbank for corporate clients - "Register payments to self-employed within the framework of a salary project." Due to it, register crediting of funds to the accounts of the self-employed is carried out quickly, and checks of the self-employed are formed automatically. The user of the site - an individual - only needs to get the status of self-employed (payer of professional income tax) through the "Svoy Delo" service in the SberBank Online mobile application or simply connect this service if the user is already registered as self-employed with the Federal Tax Service or another bank. Registration in the service takes only a few minutes.
At the end of 2021, Elementary conducted a pilot to involve employees of a number of territorial branches of Sberbank on maternity leave in the marking work. In the first 3 days of the pilot, more than 1,000 people registered at the site. As of August 2022, maternity doctors make up 10% of the markers employed on the site. Another 5% falls on people with limited mobility and mothers of children with disabilities.
With the development of the artificial intelligence market, the growing demand and popularity of AI solutions, the need to mark up the data necessary to train high-quality machine learning models is also growing. In specialized sites, such as Elementary, hundreds of thousands of data are marked daily and for people engaged in this work, this is a good opportunity for additional earnings. And the ability to work from home and in a free schedule is especially important for women on maternity leave and sedentary citizens. Creating our platform, we initially conceived it as a partly social project and see that our expectations were met, - said Dmitry Teplitsky, head of the Elementary site. |
In 2022, the pilot to involve in the work of marking these employees on maternity leave scales to the entire Sberbank.
2020: Sberbank will pay 400 unemployed people in the Caucasus for viewing and marking pictures of food
September 11, 2020 Sberbank announced TAdviser the provision of unemployed residents of seven regions of the North Caucasus Federal District (North Caucasus Federal District) the opportunity to earn money on the platform. data markup TagMe
In particular, unemployed residents of the regions of the North Caucasus and those who need part-time jobs will be invited to start marking audio recordings and images of food for a number of SberAI projects in the field of speech recognition and computer vision. At the first stage, up to 400 residents of the North Caucasus Federal District will take part in the project. If the "pilot" is successful, it can be scaled, including in other federal districts. Read more here.