RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2
2022/06/06 13:15:59

Data

Data - multi-interpretable presentation of information in a formalized form suitable for transmission, communication, or processing (definition according to ISO/IEC 2382-1: 1993).

Content

The volume of generated digital data in the world

Data for 2018 with forecast for subsequent years

How data became raw materials of the XXI century

In this article, the ordinal numeral "fourth" is found three times - the fourth transformation in data representation, the fourth paradigm in science and the "Fourth Industrial Revolution." Where it came from is incomprehensible, but it is natural that all three are combined with data that have become a critical raw material of the 21st century. It is no coincidence that the data was called the "Fourth Industrial Revolution" oil. Journalist Leonid Chernyak in the material prepared for TAdviser talks about fundamental changes in the attitude of humanity to data.

Difference between data and information

Back in the middle of the 2000s, it was difficult to imagine such a thing. There was no question of data as part of a computer. Since the advent of computers, that is, from the mid-forties of the XX century, attention has been focused first on hardware, and later on software. As for the data, it was seen as something obvious, for granted. As a result, there was a strange unilateralism of IT, distinguishing them from other industries. Production can be imagined as consisting of two things: a complex of technologies and raw materials, which, passing along the technological chain, turns into the final product. In IT, the technological process of converting source data into results remains as if "behind the scenes."

The revaluation of values, the recognition of the significance of data and data processing processes, which began around 2010, took only a few years. Ironically, excessive attention is now often shown to data. Part of the computer and near the computer community clearly suffers from a painful condition called datamania (data-mania). One of its manifestations is the abuse of the term "Big Data."

Another misunderstanding related to IT is that the concepts of "data" and "information" have long been considered synonymous, which, of course, was facilitated by the statistical theory of information, which would be more accurately called the theory of data transmission. The name "information theory" was proposed by John von Neumann to the extremely modest in his claims to Claude Shannon. In this theory, the measure of transmitted information is bits and bytes, although by definition they refer to data represented in a binary system.

It is significant that for many years, taking advantage of the capabilities of a journalist, at the first opportunity, he asked the interlocutors the same question: "What do you see the difference between data and information?" However, never (!) Received a meaningful answer. Almost no one thought about the fact that the so-called information technologies deal with data, and not at all with information. Disregard for the nature of data has led to the development of exclusively engineering techniques throughout the decades up to the 2010s to enable data transfer, storage and processing. Everything that needed to be known about them came down to binary or decimal units of measurement of the amount of data, formats and forms of the organization (arrays, bytes, blocks and files).

But the situation chipping around the data has changed dramatically. Its reflection was the popular slogan "It's the data, stupid," reflecting the increasing role of data in modern science, business and other branches of human activity. The shift in focus on data is a consequence of the greatest cultural transformation.

Four fundamental transitions can be distinguished, each of which is characterized by an increase in content availability:

  • Invention of paper and transition from clay and wax tablets, parchment and birch bark to a practical and inexpensive carrier.
  • The invention of the printing press and the transition from manual copying of manuscripts to publications replicated by machines.
  • The transition from material, most often paper, to digital; separation of content from physics.
  • Transform content into data that can be processed and analyzed automatically.

The main feature of the latter is that in the XXI century the data was abstracted from the carrier. The necessary tools were created to work with them, which opened up unlimited opportunities for extracting information from data.

From data to knowledge, DIKW model

For the sake of fairness, it should be noted that in the academic environment, the significance of data as a source of knowledge and its place in the system of accumulation of knowledge began to be thought about earlier than in business - from about the end of the eighties of the XX century. Then the four-link DIKW model, which became classic, including data, information, knowledge and deep knowledge (data, information, knowledge, wisdom), developed.

  • Data is obtained from the outside world as a result of human activity or from various sensors and other devices.
  • Information is created by analyzing the relationships and relationships between data fragments as a result of answering the questions: Who? What? Where? How much? When? Why?
  • Knowledge is the most difficult to define concept, it is obtained as a result of the synthesis of the information received and the human mind.
  • Deep understanding (wisdom?) Serves as the basis for decision-making

For several decades, the DIKW model has remained the basis for research in what is called Knowledge Management (KM). It is generally accepted that KM studies the processes of creating, preserving, distributing and applying the basic elements of intellectual capital necessary for the work of the organization, allowing the transformation of intellectual assets into means for increasing productivity and efficiency.

By means of KM, it was not possible to obtain tangible results and go beyond the general reasoning, creating the corresponding tools. KM has been and remains an area of interest for a very limited community of scholars. The failure of KM is due to several reasons - that the desire to manage knowledge was ahead of time, and that the need to work with knowledge had not yet formed. But most importantly, the level D from the DIKW model was out of sight of KM.

However, it does not follow from the failure of KM at all that there is no such problem as automating the extraction of knowledge from data. As they say, "a holy place is not empty," and in the second decade of the 21st century, KM took a new direction, which was not very well called Data Science. The role and place of Data Science in the knowledge accumulation system is shown in the figure below.

A traditional researcher observes the system directly, and Data Scientist uses the accumulated data

For thousands of years, people have observed the world around them, using certain tools and in an accessible form recorded knowledge. Today, the process is divided into data accumulation and analysis of this data. A striking example is modern astronomy or geophysics, where observation with accumulation of data and subsequent analysis of this data are independent tasks.

Data Science

The term Data Science in the mid-2000s was proposed by William Cleveland, a professor at Purdue University, one of the most famous specialists in statistics, data visualization and machine learning. At about the same time, the International Council for Science (CODATA) (Committee on Data for Science and Technology) and its journal CODATA Data Science Journal appeared. Then Data Science was defined as a discipline that combines various areas of statistics, data mining, machine learning and the use of databases to solve complex problems related to data processing.

Data Science is an umbrella term. Under the general name Data Science, many different methods and technologies have been collected to analyze large amounts of data. In a strict scientific understanding, for example, as Kral Popper determined science, it is impossible to call Data Science science. Nevertheless, experts in the field of Data Science use what is called the scientific method, so they can quite rightly be called Data Scientist. The classical cycle of the scientific method is shown in the figure below.

Scientific method cycle

The general concept of Data Science is divided into two directions. One, less popular, would be more accurate to call Data-Intensive Science, and three times - well-publicized - the application of Data Science to business.

Fourth paradigm of science

The direction of Data-Intensive Science can be translated as scientific research with significant use of data. This term refers to a new style of research based on data, with the widespread use of computer infrastructures and software for the operation, analysis and distribution of this data (data-driven, exploration-centered style of science). For him, astronomer and futurologist Alex Shalai and outstanding computer expert Jim Gray in 2006 proposed their own name - "The Fourth Paradigm of Science."

They divided humanity's scientific past into three periods of data use. In ancient times, science was limited to describing observed phenomena and logical conclusions made on the basis of observations. In the XVII century, there was more data, and then people began to create theories using certain analytical models as evidence. In the 20th century, computers opened up opportunities for using numerical modeling methods. Finally, in the 21st century, scientific methods based on data analysis (eScience) began to develop, and here synthesizing theories, statistical and other methods of extracting useful information began to be used to work with colossal amounts of data.

Shalai and Gray wrote: "In the future, working with large amounts of data will involve sending calculations to data, rather than loading data into a computer for subsequent processing." The future came much earlier, already in 2013 the same Shalai wrote about the Data-Intensive Science era as a fait accompli.

By 2017, eScience methods found their application not only in such data-intensive fields as astronomy, biology or physics. They have also found their application in the humanities, significantly expanding the field called Digital Humanities. The first works that used digitized materials and materials of digital origin date back to the late forties of the XX century. They combine traditional humanities - history, philosophy, linguistics, literary criticism, art criticism, archeology, musicology and others, with computer sciences. In individual universities, such as NIU Higher School of Economics, data analysis is introduced as a compulsory subject in all faculties.

Data Science in Business

The use of Data Science methods in business is caused by the explosive growth in data volumes characteristic of the second decade of the 21st century. It is figuratively called data flood, data surge, or data deluge. An information explosion is not a new phenomenon. It has been talked about since about the mid-fifties of the XX century. Previously, the growth of volumes remained synchronous with the development according to Moore's Law, it was possible to cope with traditional technologies. But the avalanche that collapsed due to the emergence of numerous Internet services and billions of users, as well as the revolution of smart sensors (smart sensor revolution), requires completely different approaches. Administrators and database managers alone were not enough. It required specialists or teams of specialists able to extract useful knowledge from the data and provide it to decision makers. The means used by these specialists are shown in the figure below.

Data Science Methods

The tools that Data Scientist uses can be likened to IT to all conventional technologies, in the sense that there will be raw data at the input, and processed data and information for decision-making at the output. The technological cycle implements the classical cycle of the scientific method. It can be roughly divided into several stages:

  • Wording of the problem
  • Raw Data Collection
  • Data wrangling (from wrangler, an employee who goes around horses) is the preparation of raw data to perform subsequent analytics on them, the conversion of raw data stored in any arbitrary formats into those required for analytical applications.
  • Preliminary data analysis, identification of general trends and properties.
  • Selection of tools for deep data analysis (R, Python, SQL, mathematical packages, libraries).
  • Creates a data model and checks it against actual data.
  • Depending on the task, performing statistical analysis, using machine learning, or recursive analysis.
  • Comparison of results obtained by different methods.
  • Visualization of results.
  • Interpretation of the data and preparation of the information received for transfer to decision-makers.

This process may look something like shown in the Data Science Technology Cycle figure.

Data Science Technology Cycle

In practice, the process of extracting knowledge from data is rarely linear. After performing a step, it may be necessary to return to the previous one in order to clarify the methods used, up to the task. It happens that after obtaining satisfactory results, clarifying questions arise and the cycle has to go through again.

Both in science and in business, knowledge is extracted from data using Data Science methods, so it is quite fair to paraphrase the well-known aphorism of Maxim Gorky "Love data - a source of knowledge."

Data Code of Ethics

Main article: Code of Ethics for the Use of Data

Data management

The relevance of the Data Governance topic is growing every year. Indeed, the need to organize processes to improve the efficiency of data collection, processing, storage and use as a valuable asset is already evident to almost all companies. A lot has been said about the benefits that companies bring to properly structured data management processes, and many organizations have already begun implementing this initiative. At the same time, organizations often make similar mistakes that negatively affect the pace of implementation and the effectiveness of the created data management processes. Svetlana Bova, Chief Data Officer of VTB Bank, tells about what mistakes are, how to avoid them and what questions the organization should find answers to during the implementation of Data Governance in the material prepared for TAdviser.

Data Quality Management

Main Article: Data Quality Management

The definition of data quality is formulated as a generalized concept of data usefulness formalized in a particular set of criteria. For corporate data of information management systems, it is customary to distinguish the following six criteria: demand, accuracy, consistency, timeliness, availability and interpretability. For each criterion, a set of key performance indicators (KPIs) is determined and practices that improve them are worked out (more).

Data visualization

Main article: Data visualization

Data breaches

Main article: Data breaches

Data protection

Main Article: Data Protection

Improving Data Management Models

White Paper: Improving Data Management Models

Data Disclosure, Use and Sale

2022

How to benefit from simple and secure data sharing

With advances in data sharing technologies, in the first half of 2022, there is an opportunity to buy and sell potentially valuable information in highly efficient cloud markets. By combining this data with a new array of privacy technologies such as fully homomorphic encryption (FHE) and differential privacy, it becomes possible to share encrypted data and compute over it without the need for initial decryption. This provides new opportunities: data sharing while maintaining security and privacy. All this contributed to the emergence of promising new trends. Sensitive data stores, servers around the world, due to privacy issues or regulatory requirements, are beginning to generate value for enterprises in the form of new business models and capabilities. In 2022, an increasing number of organizations are expected to begin exploring the possibility of seamless and secure data sharing, and opportunities that will help them monetize their own information assets and fulfill business goals using other people's data.

This trend of data sharing is gaining momentum. A March 2021 survey by Forrester Research,[1], found that more than 70% of managers making data decisions and analysts are expanding their ability to use external data, and another 17% plan to implement in 2022.

Moreover, the global FHE market alone is growing at an annual rate of 7.5% and is expected to be $437 million by 2028. In 2022, health and finance are the sectors that lead most FHE research.

What explains this growth? Simply put, data gains value when it is shared. Gartner predicts that by 2023, organizations promoting data sharing will outperform their competitors in most Data[2] to[3]

You can illustrate the exchange of data in action in the following scenarios:

  • Use aggregated data to safely achieve common goals. Even competing organizations will be able to work collaboratively to achieve common goals, such as deepening customer understanding or detecting fraud patterns across the sector.

  • Increased collaboration in research. Sharing fundamental or early results can accelerate critical research initiatives without compromising the competitive advantage won with difficulty.

  • Protection of intellectual property. Ultra-sensitive data, such as data for AI training, can be stored in open clouds but still be more secure.

  • Data encryption in motion. In the areas of high-frequency trading, robotic surgery and smart factory manufacturing, sensitive data must be transferred quickly at many points. FHE allows users to access critical data quickly without encryption keys. Similar opportunities for data monetization by sharing and merging can offer many competitive advantages for those first to take advantage of the situation. Then, competitors, seeing that "pioneers" are effectively using technology, will also want to rebuild their business based on an organizational structure based on the use of data and artificial intelligence.

But, again, unlocking this potential requires different data management - this time adding innovative technologies and techniques that free information assets from traditional constraints to privacy and security.

The data trend in 2022 includes three main dimensions: how to take advantage of opportunities, ease of use and privacy.

New Business Models and Opportunities

Common data can create common opportunities and new business models. As data sharing trends evolve, Deloitte expects more organizations to engage in "data collaboration" to address common challenges and leverage mutually beneficial benefits, operational and research opportunities.

In addition, the ability to securely communicate with external data management service providers can help organizations streamline data management processes and reduce associated costs.

Consider the following possibilities that data exchange can open:

  • "Industry Vertical Trading Platforms." Even the fiercest competitors often face common challenges that are best addressed through collaboration. For example, suppliers of the food industry: if everyone anonymized confidential sales and supply data and combined them for analysis, perhaps these suppliers could reveal the "secret" of supply and demand. Or banks in developing regions could combine anonymous credit data to create an interbank credit risk assessment system. Or one of the biggest opportunities: Can pharmaceutical researchers and physicians working within a protected ecosystem combine data to understand how to bring vital innovation to market faster?

  • "Partners in the value chain." Many manufacturers and retailers buy consumer data from third-party data brokers, but as is often the case, quality data is not enough to make the right decision. What if partner systems in the value chain - from suppliers to manufacturers to marketers - combine their customer data to create a more nuanced picture of demand?

  • "Let someone else do AI model training." AI models are often considered highly sensitive forms of intellectual property. Because they usually fit on a flash drive, they also pose a high security risk, so many organizations have traditionally run their own simulations in-house. Thanks to encryption technologies, this may change soon. By protecting simulation data, data directors can safely outsource AI modeling and training to third parties.

  • Data providers optimize delivery. On data exchange platforms, it will be possible to buy access to market or logistics data in real time as easy as pressing a button. Data providers will no longer need to provide APIs or upload files.

Easily retrieve external data at the touch of a button

Cloud-based data-sharing platforms help organizations seamlessly share, buy, and sell data. These highly virtualized, high-performance data markets are typically structured by a data-sharing-as-a-service model in which, for a fee, service subscribers can manage, oversee, and adapt data. They can also protect their data to a certain extent using platform-provided "clean rooms," secure spaces with certain rules where organizations can pool their data assets for analysis. Finally, subscribers can merge and sell access to their data to other subscribers. Data buyers get standard or customized insights into different aspects of markets, products, or research.

The fundamental business strategy behind this sharing-as-a-service model has already demonstrated its effectiveness in other important areas of information and content sharing, such as music file sharing and social media. In these, the provider provides an easy-to-use platform for data sharing, and customers provide content.[4]

The data market sector is going through an early phase of the gold rush, with the likes of, startups Databricks Datarade, Dawex and, and hyperscale Snowflake cloud service providers like,, and AWS, looking Azure Google Salesforce to make a statement in this promising market. And that's promising: The link between data growth and democratization, and digital transformation, are helping usher in a revolution in which demand for external data is skyrocketing[5]

Data has ceased to be just a tool to inform decision-makers, it is now a business-critical asset that can be sold, bought, exchanged and shared. And the platform that facilitates that sharing most easily and efficiently could eventually become the standard for data sharing in industry data verticals or even in entire markets.

Examples of the use of data sharing - and in some areas and success stories - are spreading as more organizations begin to seize opportunities to monetize and expand their data assets. For example:

  • In the early days of the COVID-19 pandemic, global pharmaceutical companies with fierce competition sought ways to share preclinical research data[6] data sharing platforms[7].

  • COVID-19 vaccine administrators have used centralized government platforms to share daily micro-tiered vaccination and testing data with public health agencies.

  • Investment managers at a global financial services company collect and analyze data from their back, middle and front offices in real time. As a result, the time it takes to start sharing investment data with customers is reduced from "months to minutes"[8]

For the first half of 2022, there are no reliable forecasts about how certain aspects of the data exchange platform market will develop. While there will eventually be some consolidation and standardization, markets from multiple platforms may also take root. For example, there may be partner systems in private data markets, or perhaps there will be organic public markets focused on unique needs. Whatever form the data markets eventually take, the Gold Rush is expected to continue to gain traction, especially as suppliers get serious security work done and more organizations sign up to these platforms, thereby increasing the amount of external data available for consumption.

Share data without compromising privacy

Data gains value when it is shared. However, data privacy policies and competitive privacy requirements have historically hindered the ability to realize this value. A new class of computational approaches known collectively as "privacy computing" (or "confidential computing") is poised to free organizations and their data from privacy shackles. Approaches such as FHE, differential privacy, and functional encryption allow organizations to enjoy the benefits of data sharing without compromising privacy.

Six Ways to Keep Your Data Private

Privacy practices can also promote collaboration between competitors. For example, several financial institutions that compete with each other in various areas of financial services are considered. While they compete for customers, collectively they may want to collaborate to achieve common goals such as detecting the risk of excessive concentration, sophisticated fraud schemes or financial crimes. As another example, another scenario can be considered: organizations that do not compete, but functionally complement each other in an industry such as tourism. There are lucrative uses for data sharing, where companies provide information for co-marketing and discount campaigns between airlines, hotels and car rental agencies. Each participating company would like to know about the behavior of customers and the activities of others so that they can provide their end consumers with greater value and more pleasant customer service. However, each of them is obliged to protect information about customers. Privacy-aware computing could be the catalyst for a breakthrough that allows these companies to interact and collaborate more closely.

Development prospects

While sensitive computing and advanced data sharing technologies help organizations at the forefront of this trend benefit more from data, they are not a panacea for all data management requirements and tasks. There is still a need for robust data management; Apply tags and metadata.

Moreover, new tools and approaches will not change the current culture of the company's data overnight. Reputable companies, for example, often have established processes and standards for managing and using data, whereas start-ups and digital natives can take quieter approaches. Or, because of the highly personal relationships that influence decision-making and strategy, family-owned businesses tend to be more reluctant to share data, even anonymized outside the enterprise. It is assumed that these and similar problems are just small obstacles to a fundamentally new era of transformative data exchange.

2021

Shadayev: in 2022, the data marketplace will be launched

In 2022, Ministry of Digital Development will launch a data marketplace. Data sets will be posted by both state bodies and business, said Minister of Digital Development Maksut Shadayev, answering questions from the IT industry during TAdviser SummIT on November 23, 2021.

"The state will ask businesses to provide impersonal data for free in certain areas that are critical for the public administration system. And business will have access to impersonal state data sets in order to use this data, develop its decisions, and form advanced analytics, "he said.

The full text and video of Maksut Shadayev's speech on TAdviser SummIT - here.

Overview of models for providing access to government data

Experts CGRP on September 10, 2021 presented analytical a review of the "Model for institutionalizing researchers' access to data." states It discusses the approaches of different countries to solving the problem of "usability v sprivacy."

The microdata accumulated by government agencies contain sensitive information, the disclosure of which is associated with risks to the safety of certain subjects of this data, therefore, such data are usually not published. However, their volume is so high that it is impractical to process them independently and leave them completely inside the public administration system.

By choosing one or another data access model for researchers, the state decides for itself how to maintain a balance between the level of detail of the disclosed microdata, and, therefore, their usefulness and applicability, and maintaining confidentiality.

The presented analytical review is more focused on the analysis of three aspects of data protection - the rules for determining users, projects and "settings" of access, thus, it covers organizational and infrastructure ways to solve the dilemma "usability vs privacy."

The analytical review identifies three basic models for organizing access to microdata used in the world:

  • supplementing the functionality of the state statistical agency without creating intermediary organizations (unmediated access approach);
  • the creation or co-establishment by the state of a separate research center approach;
  • partnership with universities or other independent research organizations (reseach-practice partnership).

For Russia, the minimum (or starting) scenario for organizing access to microdata for researchers can be a model built on the creation of a special data research center (research data-center), which will be directly subordinate to individual authorities or interdepartmental in nature.

The optimal scenario for Russia, which should be switched to after the formation of sustainable practices for providing access to data, experts of the CPUR consider the use of a partner network of organizations that implement a full range of work with state microdata - from their processing to the organization of access. In addition, regardless of the selected model, in parallel, it is necessary to expand the composition of information published in the public domain, including in the format of open data.[9]

Global Data Volumes

2020: Data generated reached 64.2 Zb, with less than 2% retained - IDC

In 2020, 64.2 zettabytes of data were created in the world, but by 2021 less than 2% of new data was saved, that is, most of it was temporarily created or replicated for use, and then deleted or overwritten by new data. This is evidenced by the results of the IDC study.

According to IDC, the volume of created, consumed and transmitted data in 2020 has grown significantly due to a sharp increase in the number of people who, amid restrictions due to the COVID-19 pandemic, are forced to work and study remotely. Due to the global pandemic, the volume of transmitted multimedia content has also increased.

The volume of data created in 2020 reached 64.2 Zb, of which less than 2% were saved

The researchers claim that the Internet of Things is the fastest growing segment in the data market, not taking into account data obtained from video surveillance systems. It is followed by social media. Data created in the cloud does not show the same rapid growth as data stored in the cloud, but data creation on the periphery is developing at the same rapid pace as in the cloud. IDC also notes that the corporate "DataSphere" is growing twice as fast as the consumer one due to the increasing role of the cloud for storage and consumption.

File:Aquote1.png
The installed "storage sphere" capacity base (StorageSphere) reached 6.7 zettabytes of data in 2020 and is growing steadily, but with slower annual growth rates than the "datasphere," meaning we store less data we create each year, said DataSphere's vice president of research at IDC.
File:Aquote2.png

The IDC has identified three reasons why humanity should store more of the data it creates. First, data is critical to any organization's efforts to achieve digital sustainability - an organization's ability to quickly adapt to business disruptions by leveraging digital capabilities not only to restore business operations but also to capitalize on changed conditions. Second, digitally transformed companies use data to develop new innovative solutions for the enterprise's future. Third, companies should monitor the rhythm of their employees, partners and customers to maintain a high level of trust and empathy that ensures customer satisfaction and loyalty. Data is the source for tracking these metrics.[10]

1955: What 5MB of data looked like

This is how 5 MB of data looked like in 1955. This is 62,500 punched cards weighing 110 kg.

Why Data Scientist is sexier than a BI analyst

Due to the growing popularity of data science (DS), two very obvious questions arise. First, what is the qualitative difference between this recently formed scientific direction from the existing one for several decades and the business intelligence (BI) direction actively used in the industry? The second - perhaps more important from a practical point of view - how do the functions of specialists of two related specialties data scientist and BI analyst differ? The answers to these questions are contained in a separate TAdviser material.

The Problem of Digital Chording or Pathological Data Storage

Main article: Digital Hoarding

The ability to analyze big data, colloquially called Big Data, is perceived as a blessing, and unequivocally. But is that really the case? What can runaway data accumulation lead to? Most likely, to what domestic psychologists in relation to a person call pathological accumulation, syllogomania or figuratively "Plyushkin syndrome." In English, a vicious passion to collect everything in a row is called a chording (from the English hoard - "reserve"). According to the classification of mental diseases, chording is classified as a mental disorder. In the digital age, digital hoarding is added to the traditional real chording; both individuals and entire enterprises and organizations can suffer from it (more).

Data types

Read also

Notes