Translated by
2017/08/08 12:15:57

Data

Data — the information representation which is giving in to repeated interpretation in the formalized type suitable for transfer, communications, or processings (determination on ISO/IEC 2382-1:1993).

Content

As data became raw materials of the 21st century

The ordinal numeral "fourth" - the fourth transformation in data view, the fourth paradigm in science and "The fourth industrial revolution" occurs in this article three times. From where it undertook – it is unclear, however the fact that all three are integrated by the data which became crucial raw materials of the 21st century is natural. Not accidentally data were called oil of "The fourth industrial revolution". The journalist Leonid Chernyak in the material prepared for TAdviser tells about fundamental changes in the mankind relation to data.

Difference of data from information

In the middle of zero years of the 21st century it was difficult to provide similar. About data as about a computing component, and the speech could not be. From the moment of emergence of computers, i.e. from the middle of the fortieth years of the 20th century, the attention was concentrated at first on hardware, and later and the software. As for data, they were considered as something obvious, self-evident. As a result there was a strange unilaterality of IT distinguishing them from other productions. Production can be provided consisting of two things: a complex of technologies and raw materials which, passing on a technology chain, turns into an end product. In IT technology process of conversion of initial data to resultants stays kind of "behind scenes".

On revaluation of values, on the recognition of the importance of data and processes of processing of data which began approximately in 2010 only several years were required. On a twist of fate now to data quite often show excess attention. A part computer and near computer community obviously suffers from the disease state which is referred to as with a datamaniya (data-mania). One of its manifestations - abuse of the term "Big Data".

One more misunderstanding connected with IT is that the concepts "data" and "information" long time were considered as synonyms that was promoted of course by a statistical information theory which would be to call the theory of data transmission more precisely. The name "information theory" was offered John von Neumann to extremely modest Claude Shannon in the claims. In this theory as a measure of the transmitted data serve bits and bytes though by determination they belong to the data provided in a binary system.

It is indicative that the author for many years, using the journalist's opportunities, at the first opportunity asked interlocutors the same question: "In what you see difference between data and information?". However, never (!) received the informative answer. Almost nobody thought that so-called information technologies deal with data, and not with information at all. The neglect to the nature of data led to the fact that for decades up to the 2010th years exclusively engineering methods providing transfer, storage and data processing developed. Everything that it was necessary to know about them, came down to binary or decimal units of measure of number of data, formats and forms of the organization (arrays, bytes, blocks and files).

But the situation which is chopping off around data sharply changed. The popular slogan "It's the data, stupid" reflecting the increasing role of data in modern science, business and other industries of human activity became its reflection. Shift of emphasis on data is a consequence of the greatest cultural transformation.

It is possible to select four fundamental transitions, each of which is characterized by increase in availability of content:

  • The invention of paper and transition from clay and wax tablets, parchment and birch bark to the practical and inexpensive carrier.
  • The invention of the press and transition from manual copying of manuscripts to the editions replicated by machines.
  • Transition from material, most often paper carriers, to digital; separation of content from physics.
  • Transformation of content in data which can be processed and analyzed automatically.

The main feature of the last that in the 21st century data abstracted from the carrier. Necessary means for work with them were created that opened unlimited opportunities for information extraction from data.

From data to knowledge, the DIKW model

To be fair it is necessary to notice that in the academic environment as source of knowledge and their place in the system of accumulation of knowledge began to think of value of data earlier, than in business - approximately since the end of the eightieth years of the 20th century. Then there was a four-unit DIKW model which became classical including data, information, knowledge and deep knowledge (data, information, knowledge, wisdom).

  • Data are obtained from the outside world as a result of human activity or from different sensors and other devices.
  • Information is created by means of the analysis of the relations and interrelations between fragments of data as a result of the answer to questions: Who? What? Where? How many? When? Why?
  • Knowledge the defined concept is the most difficult, they turn out as a result of synthesis of the acquired information and human mind.
  • Deep understanding (wisdom?) forms a basis for decision making

The DIKW model for several decades remained a basis for researches in area which is called Knowledge management (Knowledge Management, KM). It is considered to be that KM studies processes of creation, preserving, distribution and application of the basic elements of an intellectual capital necessary for work of the organization allowing to transform knowledge assets to means for performance improvement and efficiency.

Means of KM did not succeeded to receive notable results and to go beyond the general reasonings, having created the corresponding tools. KM was and remains area of interest for very limited community of scientists. The failure of KM is explained by several reasons - the fact that desire to manage knowledge outstripped time, and that yet the need for knowledge processing was not created. But the main thing, out of sight of KM there was level of D of the DIKW model.

However does not follow from a failure of KM at all that there is no such problem as automation of extraction of knowledge from data. According to, "nature abhors a vacuum", and in the second decade of the 21st century the place of KM was taken by the new direction which received not too successful name Data Science. The role and the place of Data Science in the system of accumulation of knowledge are shown in the drawing below.

The traditional researcher observes a system directly, and Data Scientist uses the saved-up data
The traditional researcher observes a system directly, and Data Scientist uses the saved-up data

Throughout the millennia people observed the world around, using these or those tools and in an available form fixed knowledge. Today process was separated into data storage and the analysis of these data. A striking example - modern astronomy or geophysics where observation with data storage and the subsequent analysis of these data are independent tasks.

Data Science

The term Data Science in the middle of zero years of the 21st century William Cleveland, professor of Purdue University offered, one most famous specialists in statistics, data visualization and machine learning. Approximately then the international council of CODATA appeared (International Council for Science: Committee on Data for Science and Technology) and the CODATA Data Science Journal magazine issued by it. Then Data Science defined as the discipline integrating in itself the different directions of statistics, production of data (data mining), machine learning and application of databases for the solution of the difficult tasks connected with data processing.

Data Science is an umbrella term. Under the general name Data Science the set of the different methods and technologies serving for the analysis of large volumes of data is collected. In strict naukovedchesky understanding, for example as defined science Popper Stole, it is impossible to call Data Science science. Nevertheless specialists in the field of Data Science use what is called a scientific method therefore them it is quite fairly possible to call Data Scientist. The classical cycle of a scientific method is shown in the drawing below.

Cycle of a scientific method
Cycle of a scientific method

The general concept Data Science is divided into two directions. One, less popular, would be to call Data-Intensive Science more precisely, and three times – widely advertized - application of Data Science to business.

The fourth paradigm of science

The Data-Intensive Science direction can be translated as scientific research with considerable use of data. Understand new style of researches with a support on data as this term, with wide use of computer infrastructures and the software for operating, the analysis and distribution of these data (data-driven, exploration-centered style of science). For it the astronomer and the futurologist Alex Shalai and the outstanding computer expert Jim Gray in 2006 offered own name – "The Fourth Paradigm of Science".

They separated the scientific past of mankind into three periods of use of data. In antique times the science was limited to the description of observed phenomena and logical conclusions drawn on the basis of observations. In the 17th century of data became more, and then people began to create theories, using these or those analytical models as proofs. In the 20th century computers opened opportunities for use of methods of computational modeling. At last in the 21st century the scientific methods based on data analysis (eScience) began to develop and here the synthesizing theories, statistical and other techniques of extraction of useful information began to be applied to work with enormous amounts of data.

Shala and Gray wrote: "In the future work with large volumes of data will assume transfer of calculations to data, but not data loading in the computer for postprocessing". The future came much earlier, in 2013 same Shalai wrote about Data-Intensive Science era as about the come true fact.

By 2017 the eScience methods found the application not only in such date - capacious areas as astronomy, biology or physics. They found the application and in the humanities, significantly having broadened the area called by "The digital humanities" (Digital Humanities). The first works where the digitized materials and materials of digital origin were used are dated back to the end of the fortieth years the 20th centuries. They integrate the traditional humanities - history, philosophy, linguistics, literary criticism, art criticism, archeology, musicology and others, with computer sciences. At the separate universities, such as Higher School of Economics National Research University, data analysis is entered as an obligatory subject at all faculties.

Data Science in business

Application of the Data Science methods in business is caused by the explosive growth of amounts of data, characteristic of the second decade of the 21st century. It is figuratively called a flood of data (data flood), a wave of data (data surge) or an avalanche of data (data deluge). Information blowup - the phenomenon not new. Speak about it approximately from the middle of the fiftieth years of the 20th century. Before growth of volumes remained synchronous to development in Moore's law, it was possible to cope with it traditional technologies. But that avalanche which collapsed in connection with emergence of numerous Internet services and billions of users and also revolution of smart sensors (smart sensor revolution) requires absolutely other approaches. Some administrators and managing directors of databases appeared insufficiently. The specialists or groups of specialists capable to take useful knowledge from data were required and to provide them to those who make decisions. The means used by these specialists are shown in the drawing below.

Data Science methods
Data Science methods

Those means which use Data Scientist it is possible to assimilate IT to all normal technologies in the sense that on an input there will be crude data, and data and information at the exit processed for decision making. The fabrication cycle implements a classical cycle of a scientific method. It can be separated into several stages conditionally:

  • Formulation of a problem
  • Collecting of crude data
  • Data wrangling (from wrangler, the worker training horses) is a preparation of crude data for accomplishment of the subsequent analytics over them, conversion of the crude data which are stored in any any formats to required for analytical applications.
  • Preliminary data analysis, identification of the general trends and properties.
  • The choice of tools for deep data analysis (R, Python, SQL, mathematical packets, libraries).
  • Creation of a data model and its check on compliance to real data.
  • Depending on a task accomplishment of statistical analysis, use of machine learning or recursive analysis.
  • Comparison of the results received by different methods.
  • Visualization of results.
  • Interpretation of data and execution of the acquired information for transfer to the persons making decisions.

This process can look approximately as it is shown in the drawing "Fabrication Cycle of Data Science".

Fabrication cycle of Data Science
Fabrication cycle of Data Science

In practice seldom process of extraction of knowledge of data is linear. After accomplishment of this or that step there can be a need of return to previous for the purpose of refining of the used methods, up to problem definition. Sometimes, that after obtaining satisfactory results, there are specifying questions and a cycle it is necessary to pass again.

Both in science, and in business the Data Science methods from data take knowledge therefore it is quite fair to paraphrase the known aphorism of Maxim Gorky "Love data – a source of knowledge".

Quality management of data

Determination of quality of data is formulated as the generalized concept of usefulness of data formalized in a certain set of criteria. For corporate these information management systems it is accepted to select the following six criteria: demand, accuracy, coordination, timeliness, availability and interpretiruyemost. For each criterion a set of the key performance indicators (KPI) is defined and the practicians improving them [1] (more detailed) are studied.

Why Data Scientist is more sexual, than the BI analyst

Due to growth of popularity of data science (DS) there are two obvious questions. The first – in what consists qualitative difference of this recently created scientific direction from the business intelligence direction (BI) existing several decades and actively used in the industry? The second - perhaps more important from the practical point of view - with what functions of specialists of two related specialties data scientist and BI analyst differ? Answers to these questions contain in separate material of TAdviser.

Problem of a digital hording or pathological moneymaking of data

An opportunity to analyze Big Data, in a popular speech the called Big Data, is perceived as the benefit, and is unambiguous. But is that really the case? What can lead impetuous data storage to? Most likely to what domestic psychologists in relation to the person call pathological moneymaking a sillogomaniya or it is figurative "Plyushkin's syndrome". In English the vicious passion to collect everything call a hording (from engl. hoard - "stock"). On classification of mental diseases the hording is ranked as mental disorders. To a digital era to a traditional material hording it is added digital (Digital Hoarding), both individuals, and the whole enterprises and the organizations can suffer from it [2] (more detailed).

Read Also