RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2
2017/10/24 16:45:31

Big Data

The Big Data category includes information that can no longer be processed in traditional ways, including structured data, media, and random objects. Some experts believe that new massively parallel solutions have replaced traditional monolithic systems to work with them.

Content

What is Big Data?

Simplest definition

From the name, it can be assumed that the term "big data" refers simply to the management and analysis of large amounts of data. According to the McKinsey Institute report "Big data: The next frontier for innovation, competition and productivity," the term "big data" refers to datasets that outsize typical databases (DBs) to capture, store, manage, and analyze information. And the world's data repositories certainly continue to grow. A mid-2011 report by the analytical company IDC 'Digital Universe Study', sponsored by EMC, predicted that the total global volume of created and replicated data in 2011 could be about 1.8 zettabytes (1.8 trillion gigabytes) - about 9 times what was created in 2006.

IDC, Nexus of Forces Gartner
Traditional Database and Big Data Base

A more complex definition

Yet 'big data' suggests more than just analyzing vast amounts of information. The problem is not that organizations create huge amounts of data, but that most of them are presented in a format that does not correspond well to the traditional structured database format - these are web logs, video recordings, text documents, machine code or, for example, geospatial data. All this is stored in many different storages, sometimes even outside the organization. As a result, corporations may have access to a huge amount of their data and lack the necessary tools to establish the relationships between these data and draw meaningful conclusions from them. Add the fact that data is now being updated more and more often, and you will get a situation in which traditional information analysis methods cannot keep up with the huge amounts of constantly updated data, which ultimately opens the way for big data technologies.

Best definition

In essence, the concept of big data involves working with information of a huge amount and diverse composition, very often updated and located in different sources in order to increase efficiency, create new products and increase competitiveness. Consulting company Forrester gives a brief wording: "Big data combines techniques and technologies that extract meaning from data at the extreme limit of practicality."

How big is the difference between business intelligence and big data?

Craig Bati, Chief Executive Marketing and Chief Technology Officer at Fujitsu Australia, pointed out that business analysis is a descriptive process for analysing the results the business has achieved over a period of time, while the speed of big data processing allows the analysis to be predictive, able to offer business guidance for the future. Big data technologies also allow you to analyze more data types compared to business intelligence tools, which makes it possible to focus not only on structured storage.

Matt Slocum of O'Reilly Radar believes that while big data and business analytics have the same goal (finding answers to a question), they differ from each other in three ways.

  • Big data is designed to handle larger amounts of information than business analytics, and that of course fits the traditional definition of big data.
  • Big data is designed to handle faster and more changing information, which means deep exploration and interactivity. In some cases, results are generated faster than loading a web page.
  • Big data is designed to handle unstructured data, the uses of which we are just beginning to learn after we have been able to establish its collection and storage, and we require algorithms and dialogue to facilitate the search for trends contained within these arrays.

According to Oracle the company's white paper Oracle big data Information Architecture: An Architect's Guide to Big Data, when working with big data, we approach information differently from when conducting business analysis.

Working with big data is not like a normal business intelligence process, where simply adding known values ​ ​ brings results: for example, the result of adding data on paid invoices becomes the volume of sales per year. When working with big data, the result is obtained in the process of cleaning them by sequential modeling: first, a hypothesis is put forward, a statistical, visual or semantic model is built, on its basis, the fidelity of the hypothesis put forward is checked and then the following is put forward. This process requires the researcher to either interpret visual values ​ ​ or compile interactive queries based on knowledge, or develop adaptive 'machine learning' algorithms that can obtain the desired result. Moreover, the lifetime of such an algorithm can be quite short.

Big Data≠Data Science



Big Data is:

  • ETL\ELT
  • Storage technologies for large amounts of structured and non-structured data
  • Technologies for processing such data
  • Data Quality Management
  • Technologies for providing data to the consumer

Data Science is:

Big Data Analysis Techniques

There are many different methods for analyzing data sets, which are based on tools borrowed from statistics and computer science (for example, machine learning). The list does not claim to be complete, but it reflects the most popular approaches in various industries. It should be understood that researchers continue to work on creating new methods and improving existing ones. In addition, some of these techniques do not necessarily apply exclusively to large data and can be successfully used for smaller arrays (for example, A/B testing, regression analysis). Of course, the larger and more diversified array is analyzed, the more accurate and relevant data can be obtained at the output.

A/B testing. A technique in which the control sample is alternately compared with others. Thus, it is possible to identify the optimal combination of indicators to achieve, for example, the best response of consumers to a marketing offer. Big data allows you to conduct a huge number of iterations and thus get a statistically reliable result.

Association rule learning. A set of techniques for identifying relationships, i.e. associative rules, between variables in large data arrays. Used in data mining.

Classification. A set of techniques that allows you to predict consumer behavior in a certain market segment (purchasing decisions, outflow, consumption, etc.). Used in data mining.

Cluster analysis. Statistical method of classifying objects by group by identifying previously unknown common features. Used in data mining.

Crowdsourcing. A method for collecting data from a large number of sources.

Data fusion and data integration. A set of techniques that allows you to analyze comments from social media users and compare with sales results in real time.

Data mining. A set of methods that allows you to identify the most susceptible categories of consumers for the promoted product or service, identify the features of the most successful employees, and predict the behavioral model of consumers.

Ensemble learning. This method involves many predictive models, thereby improving the quality of the forecasts made.

Genetic algorithms. In this technique, possible solutions are represented as' chromosomes' that can combine and mutate. As in the process of natural evolution, the fittest individual survives.

Machine learning. The direction in computer science (historically the name 'artificial intelligence' has been assigned to it), which pursues the goal of creating self-learning algorithms based on the analysis of empirical data.

Natural language processing (NLP). A set of techniques borrowed from computer science and linguistics for recognizing the natural language of a person.

Network analysis. A set of methods for analyzing links between nodes in networks. With regard to social networks, it allows you to analyze the relationships between individual users, companies, communities, etc.

Optimization. A set of numerical methods for redesigning complex systems and processes to improve one or more metrics. Helps in making strategic decisions, for example, the composition of the product line brought to the market, conducting investment analysis, etc.

Pattern recognition. A set of techniques with elements of self-study for predicting the behavioral model of consumers.

Predictive modeling. A set of techniques that allow you to create a mathematical model in advance of a given probable scenario. For example, analyzing the CRM system database for possible conditions that will push subscribers to change the provider.

Regression. A set of statistical methods to identify a pattern between a change in a dependent variable and one or more independent ones. It is often used for prediction and prediction. Used in data mining.

Sentiment analysis. The methods of assessing consumer sentiment are based on technologies for recognizing the natural language of a person. They allow you to isolate messages related to the subject of interest (for example, a consumer product) from the general information flow. Further evaluate the polarity of judgment (positive or negative), the degree of emotionality, etc.

Signal processing. A set of techniques borrowed from radio engineering, which pursues the goal of signal recognition against the background of noise and its further analysis.

Spatial analysis. A set of partly borrowed from statistics methods for analyzing spatial data - topology of terrain, geographical coordinates, geometry of objects. The source of big data in this case is often geoinformation systems (GIS).

Statistics. The science of data collection, organization and interpretation, including the development of questionnaires and the conduct of experiments. Statistical methods are often used to evaluate judgments about the relationships between certain events.

Supervised learning. A set of techniques based on machine learning technologies that allow you to identify functional relationships in the analyzed data arrays.

Simulation. Modeling the behavior of complex systems is often used to predict, predict, and work out different scenarios in planning.

Time series analysis. A set of methods for analyzing data sequences that repeat over time, borrowed from statistics and digital signal processing. One obvious use is tracking the securities market or patient morbidity.

Unsupervised learning. A set of techniques based on machine learning technologies that allow you to identify hidden functional relationships in the analyzed data arrays. It shares features with Cluster Analysis.

Visualization. Methods of graphically presenting the results of big data analysis in the form of diagrams or animated images to simplify the interpretation of easier understanding of the results obtained.

{{main 'Data Visualization
}}

A visual representation of the results of big data analysis is of fundamental importance for their interpretation. It is no secret that human perception is limited, and scientists continue to conduct research in the field of improving modern methods of presenting data in the form of images, diagrams or animations.

Analytical Toolkit

For 2011, some of the approaches listed in the previous subsection, or a certain combination of them, make it possible to implement analytical engines for working with big data in practice. From free or relatively low-cost open systems, Big Data analysis can be recommended:[1]

Of particular interest in this list is Apache Hadoop, an open source software that has been tested as a data analyzer by most share trackers over the past five years[2] As soon as Yahoo opened the Hadoop code to the open source community, the IT industry immediately had a whole direction to create Hadoop-based products. Almost all modern big data analytics tools provide Hadoop integration tools. Their developers are both startups and well-known global companies.

Big Data Management Markets

Big Data Platform (BDP) as a means to combat digital chording

The ability to analyze big data, colloquially called Big Data, is perceived as a blessing, and unequivocally. But is that really the case? What can runaway data accumulation lead to? Most likely, to what domestic psychologists in relation to a person call pathological accumulation, syllogomania or figuratively "Plyushkin syndrome." In English, a vicious passion to collect everything in a row is called a chording (from the English hoard - "reserve"). According to the classification of mental diseases, chording is classified as a mental disorder. In the digital age, digital hoarding is added to the traditional real chording; both individuals and entire enterprises and organizations can suffer from it (more).

World and Russian market

Big data Landscape - Major Suppliers

Almost all leading IT companies showed interest in tools for collecting, processing, managing and analyzing big data, which is quite natural. Firstly, they directly face this phenomenon in their own business, and secondly, big data opens up excellent opportunities for mastering new niches of the market and attracting new customers.



Many startups have appeared on the market that make business on processing huge amounts of data. Some of them use the ready-made cloud infrastructure provided by large players like Amazon.

Big Data Theory and Practice in Industries

The main article is the Theory and Practice of Big Data in Industries.

How to use analytics to develop quality IT services

White Paper - Using Analytics to Develop IT Services

History of development

2024: Big Data and Instant Solutions: Daria Kalishina's Tools

Daria Kalishina, a business consultant with international experience in business analysis, told how using big data analytics helps companies make strategic decisions. Read more here.

2017: TmaxSoft forecast: Next Big Data "wave" will require DBMS upgrade

IDC With data generated by to the Internet connected devices, sensors and other technologies on the rise, big data revenues will increase from $130 billion in 2016 to more than $203 billion by 2020, according to the report.[3] to the IT infrastructures company's experts. TmaxSoft

Enterprises know that the huge amounts of data they have accumulated contain important information about their business and customers. If the company can successfully apply this information, then it will have a significant advantage over competitors, and it will be able to offer better products and services than them. However, many organizations are still unable to use big data effectively because their legacy IT infrastructure is unable to provide the storage capacity, data exchange processes, utilities, and applications needed to process and analyze large amounts of unstructured data to extract valuable information from them, TmaxSoft said.

In addition, the increased processing power required to analyze ever-increasing amounts of data may require significant investment in the organization's legacy IT infrastructure, as well as additional maintenance resources that could be used to develop new applications and services.

According to the Andrei Reva executive director, TmaxSoft Russia these factors will lead to the fact that organizations that continue to use the legacy infrastructure will in the future be forced to pay much more for the transition to current technologies or will not be able to get any effect from the big data revolution.

File:Aquote1.png
The big data phenomenon has led many enterprises to realize the need to collect, analyze, and store structured and unstructured data. However, implementing these processes requires an action plan and the right process optimization tools. And many companies are really unable to get a tangible effect from big data due to the use of legacy DBMSs that lack functionality and scalability, and as a result, the big data revolution does not help their business in any way, "Andrey Reva explained his forecast.
File:Aquote2.png

According to the TmaxSoft representative, enterprises need a strategy that takes into account, among other things, data sources for extraction, data lifecycle, compatibility between relational DBMSs and storage scalability.

2016

EMC Forecast: BigData and Real-Time Analytics Will Merge

In 2016, we will get acquainted with the new chapter of the history of the development of analytics big data"" as the two-level processing model develops. The first level will be "traditional" BigData analytics, when large amounts of data are not analyzed in real time. The new, second layer will provide the ability to analyze relatively large amounts of data in real time, mainly through in-memory analytics technologies in-memory (). In this new phase of BigData development, technologies like DSSD Apache Spark GemFire will be as important as. Hadoop The second level will offer us at the same time new and familiar ways to use "data lakes" - for "analytics on the fly" in order to influence events, at the time when they occur. This opens up new business opportunities on a scale no one has seen before.

But in order for analytics in memory to become a reality, it is necessary that two events occur. First, supporting technologies must be developed to provide sufficient memory to accommodate truly large data sets. You also need to think about how to efficiently move data between large object stores and systems that conduct in-memory analysis. After all, these two elements work in fundamentally different modes, and IT groups will need to create special conditions so that data can move back and forth at the desired speed and transparently to users. Work is already underway, new object storage, special flash arrays for rack mounting, as well as special technologies that can combine them into one system. Open source initiatives will play an important role in finding the answer to this challenge.

Second, large-scale in-memory computing environments require data stability and dynamism. The problem is that by ensuring the persistence of the data in memory, we make any of its defects resistant as well. As a result, in 2016 we will see the emergence of storage systems for in-memory environments. They will provide deduplication, snapshot, tiered storage, caching, replication, and the ability to determine the last state when the data was correct and the system worked correctly. These features will be critical as we move to real-time analytics, when safer in-memory technologies become commercial in 2016.

2015

Gartner excluded "Big Data" from popular trends

On October 6, 2015, it became known that big data was excluded from Gartner's 2015 Technology Maturity Cycle report. The researchers explained this by blurring the term - the technologies included in the concept of "big data" have become the daily reality[4] business[4].

Gartner's Hype Cycle for Emerging Technologies report has roiled the industry with the lack of big data collection and processing technology. Analysts of the company explained their decision by the fact that the concept of "big data" includes a large number of technologies actively used in enterprises, they partially relate to other popular areas and trends and have become an everyday working tool.

Gartner chart "Hype Cycle for Emerging Technologies 2015"

"Initially, the concept of" big data "was deciphered through a definition of three" V ": volume, velocity, variety. By this term was meant a group of technologies for storing, processing and analyzing large data, with a variable structure and high update rate. But reality has shown that the benefits in business projects are carried out according to the same principles as before. And the described technological solutions themselves did not create any new value, only speeding up the processing of a large amount of data. Expectations were very high, and the list of big data technologies was growing intensively. Obviously, as a result of this, the boundaries of the concept blurred to the limit, "said Sviatoslav Stumpf the chief expert of the product marketing group Peter-Service."

Dmitry Shepelyavy, Deputy General Director of SAP CIS (SAP CIS), believes that the topic of big data has not disappeared, but has been transformed into many different scenarios:

"Examples here are state repairs, precision farming, anti-fraud systems, systems in medicine that allow diagnosing and treating patients at a qualitatively new level. And real-time logistics and transportation planning, advanced business intelligence to support and maintain the core functions of companies. One of the main trends now is the Internet of Things, which allows you to connect machines (machine-to-machine). The electronic sensors you install produce millions of transactions per second, and you need a reliable solution that can transform, store, and work with them in real time. "

In May 2015, Andrew White, vice president of research at Gartner, reflected on his blog:

"[[Internet of Things (IoT)|Internet of Things (Internet of Things, IoT)]] will eclipse big data as too focused technology. It may give rise to several more effective solutions and tools, but the Internet of Things will become the platform of the future, which in the long term will increase our productivity. "

Similar ideas used to - according to the results of the Gartner report for 2014, published by Forbes columnist Gil Press (Gil Press).

According to Dmitry Shepelyavy, an era has come when it is important not only to be able to accumulate information, but to extract business benefits from it. The first to come to this conclusion were industries that directly work with the consumer: telecommunications and banking, retail. Now interaction processes are reaching a new level, allowing you to establish communication between different devices using augmented reality tools and opening up new opportunities for optimizing business processes of companies.

"The concept of" big data "has lost interest for real business, on the Gartner diagram other technologies with a clearer and more understandable sound for business took its place," Svyatoslav Stumpf emphasized.

This is, first of all, machine learning - a means of finding rules and connections in very large amounts of information. Such technologies allow not only testing hypotheses, but looking for previously unknown factors of influence. Segment of storage and parallel access solutions (NoSQL Database), Marshalling, Advanced Analytics with Self-Service Delivery. In addition, according to the expert, the data mining tools (Business Intelligence and Data Mining) remain important, reaching a new technological level.

In the understanding of Yandex, according to the company's press service, big data has not disappeared or transformed anywhere. To process large amounts of data, the company uses the same technologies and algorithms that it uses in Internet search, the Yandex.Traffic service, in the machine translator, in the recommendation platform, in advertising. Algorithms are based on the company's ability to accumulate, store and process large amounts of data and make them useful to the business. The scope of Yandex Data Factory is not limited - the main thing is that there is data for analysis. The focus of the company as of October 6, 2015:

More data is no better

Big data and price discrimination of customers

Below are selected fragments from an article by Morgan Kennedy published on February 6, 2015 on InsidePrivacy on the issue of protecting[5] privacy[6].

On February 5, 2015, the White House released a report discussing how companies use "big data" to set different prices for different buyers - a practice known as "price discrimination" or "personalized pricing." The report describes the benefits of "big data" for both sellers and buyers, and its authors conclude that many of the problematic issues raised by the rise of big data and differential pricing can be addressed under existing anti-discrimination and consumer rights laws.

The report notes that at this time there are only isolated facts showing how companies use big data in the context of individualized marketing and differentiated pricing. This information shows that sellers use pricing methods that can be divided into three categories:

  • study of the demand curve;
  • Steering and differentiated pricing based on demographic data; and
  • targeted behavioral marketing (behavioral targeting) and individualized pricing.

Demand Curve Study: In order to clarify demand and study consumer behavior, marketers often conduct experiments in this area, during which customers are randomly assigned one of two possible price categories. "Technically, these experiments are a form of differentiated pricing, since their consequence is different prices for customers, even if they are" non-discriminatory "in the sense that all customers have the same chance of" hitting "a higher price."

Guidance: This is the practice of presenting products to consumers based on their belonging to a specific demographic group. Thus, the website of a computer company can offer the same laptop to different types of buyers at different prices, approved on the basis of information reported by them (for example, depending on whether this user is a representative of government agencies, scientific or commercial institutions, or a private person) or their geographical location (for example, determined by the IP address of the computer).

Targeted behavioral marketing and individualized pricing: In these cases, personal customer data is used for targeted advertising and individualized pricing of certain products. For example, online advertisers use data collected by advertising networks and through third-party cookies about user activity on the Internet in order to target their advertising materials. This approach, on the one hand, allows consumers to receive advertising of goods and services of interest to them. However, it may cause concern to those consumers who do not want certain types of their personal data (such as information about visiting sites related to medical and financial issues) to be collected without their consent.

Although targeted behavioral marketing is widespread, there is relatively little evidence of individualized pricing in the online environment. The report suggests that this may be due to the fact that appropriate methods are still being developed, or the fact that companies are in no hurry to use individual pricing (or prefer to keep quiet about it) - perhaps fearing a negative reaction from consumers.

The authors of the report believe that "for the individual consumer, the use of big data is undoubtedly associated with both potential returns and risks." While acknowledging that issues of transparency and discrimination emerge when big data is used, the report argues at the same time that existing anti-discrimination and consumer protection laws are sufficient to address them. However, the report also highlights the need for "ongoing monitoring" in cases where companies use sensitive information in an opaque manner or in ways that are not covered by the existing regulatory framework.

This report is a continuation of the White House's efforts to study the use of "big data" and discriminatory pricing on the Internet, and the corresponding consequences for American consumers. Earlier it was reported[7] that the White House working group on big data published its report on this issue in May 2014. The Federal Trade Commission ( FTC) also addressed these issues during its September 2014 seminar on discrimination over the use of big data[8]

2014

Gartner dispels' Big Data 'myths

In a fall 2014 research note, Gartner listed a number of myths about Big Data common among IT leaders and cited their refutations.

  • Everyone is implementing Big Data systems faster than us

Interest in Big Data technologies is record high: in 73% of organizations surveyed by Gartner analysts this year, they are already investing in relevant projects or going. But most of these initiatives are still in their earliest stages, and only 13% of respondents have already implemented such solutions. The most difficult thing is to determine how to extract income from Big Data, decide where to start. In many organizations, they are stuck at the pilot stage, because they cannot tie the new technology to specific business processes.

  • We have so much data that there is no need to worry about minor errors in them

Some IT executives believe that small flaws in the data do not affect the overall results of the analysis of huge volumes. When there is a lot of data, each error individually really affects the result less, analysts say, but there are more errors themselves. In addition, most of the data analyzed are external, unknown structure or origin, so the likelihood of errors is growing. Thus, in the world of Big Data, quality is actually much more important.

  • Big Data Technology Will Eliminate the Need for Data Integration

Big Data promises the ability to process data in the original format with automatic formation of the circuit as it is read. It is believed that this will allow you to analyze information from the same sources using several data models. Many believe that this will also enable end users to interpret any dataset themselves at their discretion. In reality, most users often need a traditional method with a ready-made schema, where the data is formatted accordingly, and there are agreements on the level of integrity of the information and how it should relate to the use case.

  • There is no point in using data stores for complex analytics

Many administrators of information management systems believe that there is no point in wasting time creating a data warehouse, taking into account that complex analytical systems use new types of data. In fact, many complex analytics systems use information from the data warehouse. In other cases, new data types need to be further prepared for analysis in Big Data processing systems; You have to make decisions about the suitability of data, the principles of aggregation and the required level of quality - this preparation can occur outside the storage.

  • Data storage will be replaced by data lakes

In reality, suppliers mislead customers by positioning data lake as a replacement for storage or as critical elements of an analytical infrastructure. Underlying data lake technologies lack the maturity and breadth of functionality inherent in storage. Therefore, managers in charge of data management should wait until the lakes reach the same level of development, according to Gartner.

Accenture: 92% of big data implementers are happy with the result

According to a study by Accenture (fall 2014), 60% of companies have already successfully completed at least one big data project. The vast majority (92%) of representatives of these companies turned out to be quite the result, and 89% said that big data was an extremely important part of the transformation of their business. Among the rest of the respondents, 36% did not think about introducing this technology, and 4% have not yet completed their projects.

More than 1,000 company executives from 19 countries took part in the Accenture study. The study was based on data from the Economist Intelligence Unit survey among 1,135 respondents around the world[9] are[10].

Among the main advantages of big data, the respondents named:

  • "search for new sources of income" (56%),
  • "customer experience improvement" (51%),
  • "new products and services" (50%) and
  • "inflow of new customers and retention of old loyalty" (47%).

With the introduction of new technologies, many companies faced traditional problems. For 51%, security became a stumbling block, for 47% - the budget, for 41% - the lack of necessary personnel, and for 35% - difficulties in integrating with the existing system. Almost all surveyed companies (about 91%) plan to soon solve the problem with a shortage of personnel and hire specialists on big data.

Companies are optimistic about the future of big data technology. 89% believe that they will change the business as much as the Internet. 79% of respondents noted that companies that do not engage in big data will lose their competitive advantage.

However, the respondents disagreed on what exactly should be considered big data. 65% of respondents believe that these are "large data files," 60% are sure that they are "advanced analytics and analysis," and 50% that they are "data from visualization tools."

Madrid spends €14.7m managing big data

In July 2014, it became known that Madrid would use big data technologies to manage urban infrastructure. The project cost is 14.7 million euros, the basis of the solutions being implemented will be technologies for analyzing and managing big data. With their help, the city administration will manage the work with each service provider and pay for it accordingly, depending on the level of services.

We are talking about administration contractors who monitor the state of the streets, lighting, irrigation, green spaces, clean the territory and remove, as well as process garbage. During the project, 300 key indicators of the efficiency of urban services were developed for specially designated inspectors, on the basis of which 1.5 thousand different checks and measurements will be carried out daily. In addition, the city will begin using an innovative technology platform called Madrid iNTeligente (MiNT) - Smarter Madrid.

Read more: Why does Madrid need analytics and big data?

2013

Experts: Fashion peak at Big Data

All vendors in the data management market, without exception, are developing technologies for Big Data management at this time. This new technological trend is also actively discussed by the professional community, both developers and industry analysts and potential consumers of such solutions.

As Datashift found out, as of January 2013, the wave of discussions around "big data" exceeded all conceivable sizes. After analyzing the number of references to Big Data on social networks, Datashift calculated that in 2012 this term was used about 2 billion times in posts created by about 1 million different authors around the world. This is equivalent to 260 posts per hour, with mentions peaking at 3,070 per hour.

Discussions of Big Data on the network are very active. Moreover, as can be seen from the pie charts presented above, the peak of discussions is only growing: if in the first quarter of 2012 there were more than 504 thousand references to the term, then in the fourth quarter - already more than 800 thousand. The main topics of discussion in relation to big data are myths and reality, experience in use, the human factor, return on investment, new technologies. Among vendors, Apache, 10gen, IBM, HP and Teradata were most often mentioned.

Gartner: Every second Chief information officer is ready to spend money on Big data

After several years of experiments with Big data technologies and the first implementations in 2013, the adaptation of such solutions will increase significantly, predict in the Gartner[11]. Researchers surveyed IT leaders around the world and found that 42% of respondents have already invested in Big data technologies or plan to make such investments within the next year (data for March 2013).

Companies are forced to spend money on big data technologies as the information landscape rapidly changes, requiring new approaches to information processing. Many companies have already recognized that large amounts of data are critical, and working with them can achieve benefits that are not available when using traditional sources of information and how it is processed. In addition, the constant exaggeration of the topic of "big data" in the media fuels interest in the relevant technologies.

Frank Buytendijk, vice president of Gartner, even urged companies to temper the fervor, as some show concern that they are lagging behind competitors in mastering Big data.

"You
shouldn't worry, the possibilities for implementing ideas based on big data technologies are virtually endless," he said.

According to Gartner forecasts, by 2015, 20% of Global 1000 companies will take a strategic focus on "information infrastructure."

In anticipation of the new opportunities that big data technologies will bring with them, many organizations are already organizing the process of collecting and storing various kinds of information.

For educational and government organizations, as well as companies in the industry, the greatest potential for business transformation is laid in a combination of accumulated data with the so-called dark data (literally - "dark data"), the latter include e-mail messages, multimedia and other similar content. According to Gartner, it is those who learn to handle a wide variety of sources of information who will win the data race.

Cisco Survey: Big Data Will Help Increase IT Budgets

A study (spring 2013) called the Cisco Connected World Technology Report, conducted in 18 countries by the independent analytics company InsightExpress, surveyed 1,800 college students and the same number of young professionals aged 18 to 30 years. The survey was conducted to find out the level of readiness of IT departments for the implementation of Big Data projects and to get an idea of ​ ​ the related problems, technological flaws and strategic value of such projects.

Most companies collect, record and analyze data. Still, the report said, many companies face a range of complex business and information technology challenges with Big Data. For example, 60 percent of respondents admit that Big Data solutions can improve decision-making processes and increase competitiveness, but only 28 percent said that they already receive real strategic advantages from the accumulated information.

More than half of the IT managers surveyed believe that Big Data projects will help increase IT budgets in their organizations, as there will be increased requirements for technology, personnel and professional skills. At the same time, more than half of respondents expect that such projects will increase IT budgets in their companies in 2012. 57 percent are confident that Big Data will increase their budgets over the next three years.

81 percent of respondents said that all (or at least some) Big Data projects would require cloud computing. Thus, the proliferation of cloud technologies can affect the speed of distribution of Big Data solutions and the business value of these solutions.

Companies collect and use a wide variety of data types, both structured and unstructured. The Cisco Connected World Technology Report receives data from these sources:

  • 74 percent collect current data;
  • 55 percent collect historical data;
  • 48 percent take data from monitors and sensors;
  • 40 percent use real-time data and then erase it. Most commonly, real-time data is used in India (62 percent), the United States (60 percent) and Argentina (58 percent);
  • 32 percent of respondents collect unstructured data - for example, video. In this area, China 56 percent of respondents collect unstructured data.

Nearly half (48 percent) of IT executives predict a doubling of the load on their networks over the next two years. (This is especially true of China 68 percent of respondents Germany , and 60 percent.) 23 percent of respondents expect the network load to triple over the next two years. At the same time, only 40 percent of respondents declared their readiness for an explosive increase in network traffic.

27 percent of respondents admitted that they need better IT policies and information security measures.

21 percent need bandwidth expansion.

Big Data opens up new opportunities for IT departments to build value and build close relationships with business units, allowing them to increase revenues and strengthen their financial position. Big Data projects make IT a strategic partner for business units.

According to 73 percent of respondents, it is the IT department that will become the main locomotive for the implementation of the Big Data strategy. At the same time, the respondents believe, other departments will also connect to the implementation of this strategy. First of all, this applies to the departments of finance (24 percent of respondents named it), research (20 percent), operational (20 percent), engineering (19 percent), as well as marketing (15 percent) and sales (14 percent).

Gartner: Managing Big Data Requires Millions of New Jobs

Global IT spending will reach $3.7 billion by 2013, which is 3.8% more than IT spending in 2012 (the forecast for the end of the year is $3.6 billion). The big data segment will develop at a much faster pace, according to Gartner[12]

By 2015, 4.4 million jobs in the field information technology will be created to serve big data, of these, 1.9 million jobs - in. USA Moreover, each such workplace will entail the creation of three additional jobs outside the IT sphere, So only in the United States in the next four years 6 million people will work to maintain the information economy.

According to Gartner experts, the main problem is that the industry does not have enough talent for this: both the private and public educational systems, for example, in the United States are not able to supply the industry with a sufficient number of qualified personnel. So of the mentioned new jobs in IT, only one out of three will personnel be provided.

Analysts believe that the role of nurturing qualified IT personnel should be taken over directly by companies that are in dire need of them, since such employees will become a pass for them to the new information economy of the future.

2012

The first skepticism about "Big Data"

Analysts at Ovum and Gartner suggest that the big data theme, fashionable in 2012, may be the time for liberation from illusions.

The term "Big Data" at this time usually refers to the ever-growing amount of information coming online from social media, sensor networks and other sources, as well as the growing range of tools used to process data and identify important business trends based on it.

"Because of (or despite) the hype about the big data idea, manufacturers in 2012 looked at the trend with great hope," noted Tony Bayer, an analyst at Ovum.

Bayer said DataSift conducted a retrospective analysis of big data mentions on Twitter for 2012. By limiting the search to manufacturers, analysts wanted to focus on the market's perception of this idea, rather than the broad user community. Analysts have identified 2.2 million tweets from more than 981 thousand authors.

These data varied across countries. Although it is generally accepted that the United States leads in terms of installed platforms for working with big data, users from Japan, Germany and France were often more active in discussions.

The idea of ​ ​ Big Data attracted so much attention that even the business press, and not just specialized publications, widely wrote about it.

The number of positive reviews of big data from manufacturers was three times the number of negative, although in November there was a surge in negativity due to HP's purchase of Autonomy.

The concept of big data is expected to be much harsher times, although, past them, this ideology will reach maturity.

"For big
data supporters, the time is coming for parting with illusions," explained Svetlana Sikular, an analyst at Gartner. She referred to the mandatory stage included in the classic popularity cycle curve (Hype Cycle), which is used in Gartner.

Even among those customers who have made the most strides using Hadoop, many are "losing their illusions."

"They
by no means feel they are ahead of others and believe success falls to others while they are going through hard times. These organizations have amazing ideas, and now they are disappointed because of the difficulties in developing reliable solutions, "Sikular said
.

However, the source of optimism for big data advocates at this time may be that the next cycle on the popularity curve, as well as the final stages, have very promising names, namely the "slope of enlightenment" and the "plateau of productivity."

Slow DSS are holding back Big Data

If the performance of modern computing systems has grown by many orders of magnitude over several decades and does not compare with the first personal PCs of the early 1980s. Last century, then the situation with DSS is much worse. Of course, the available volumes have increased many times (however, they are still in short supply), the cost of storing information in terms of bits has sharply decreased (although ready-made systems are still too expensive), but the speed of extracting and searching for the necessary information leaves much to be desired.

If you do not take into account still too expensive and not quite reliable and durable flash drives, information storage technologies have not gone very far ahead. You still have to deal with hard drives, the rotation speed of the plates of which, even in the most expensive models, is limited to 15 thousand rpm. Since we are talking about big data, obviously, a considerable amount of it (if not overwhelming) is placed on drives with a spindle speed of 7.2 thousand rpm. Quite prosaic and sad.

The identified problem lies on the surface and is well known to the Chief information officers of companies. However, she is far from the only[13]:

  • Technological lag.

Big data can turn into a big headache or open up great opportunities for government agencies, if only they can take advantage of it. Such conclusions were reached in the second quarter of 2012 by the authors of the study with the disappointing name The Big Data Gap (with the English gap - "discrepancy," in this context between theoretical benefits and the real state of affairs). According to a survey of 151 Chief information officer in the next two years, the amount of stored data in government agencies will be recovered by 1 Petabyte (1024 Terabyte). At the same time, it is becoming more difficult to derive benefits from constantly growing information flows, the lack of available space in DSS affected, access to the necessary data is difficult, there is not enough computing power and qualified personnel.

Technologies and applications at the disposal of IT managers demonstrate a significant lag behind the requirements of real tasks, the solution of which can bring additional value to large data. 60% of representatives of civilian and 42% of defense departments are still only studying the phenomenon of big data and are searching for possible points of its application in their activities. The main one, according to the Chief information officers of federal authorities, should be an increase in work efficiency - this is how 59% of respondents believe. In second place is an increase in the speed and accuracy of decisions made (51%), in third place is the ability to make forecasts (30%).

Be that as it may, the flows of processed data continue to grow. 87% of the surveyed Chief information officers have indicated an increase in the volume of stored information over the past two years, 96% of respondents (with an average increase of 64%) expect this trend to continue in the next two years. To be able to take advantage of all the benefits that big data promises, institutions taking part in the survey will take an average of three years. So far, only 40% of authorities make strategic decisions based on accumulated data, and only 28% interact with other organizations to analyze distributed data.

  • Poor data quality.

In a large house, it is always more difficult to put things in order than in a tiny apartment. Here you can draw a complete analogy with big data, when working with which it is very important to adhere to the formula 'garbage at the entrance - gold at the exit'. Unfortunately, modern master data management tools are not effective enough and often lead to reverse situations ('gold at the entrance - garbage at the exit').

  • Metadata: Knowledgeable - means armed.

A query that does a good job of finding hundreds of rows out of a million might not handle a table of one hundred billion rows. If data changes frequently, it is critical to log and audit. The implementation of these simple rules will allow you to have important information on the volume of data, speed and frequency of its change, which is important for developing a methodology for storing and working with data.

  • Tell me what company you keep and I will tell you what you are.

Literally, a few trained specialists can correctly interpret the trends and relationships hidden in big data arrays. To some extent, they can be replaced by filters and structure recognizers, but the quality of the results obtained at the output still leaves much to be desired.

  • Visualization.

The eponymous section of the article clearly illustrates the complexity and ambiguity of the approaches used to visualize big data. At the same time, presenting results in a perceptually accessible form is sometimes critical.

  • Time is money.

Viewing data in real time means a constant recalculation, which is far from always acceptable. We have to compromise and resort to a retrospective way of analytics, for example, based on cubes, and put up with partly outdated results.

  • Firing a cannon at sparrows.

You can never know in advance at what time interval big data is of particular value and most relevant. But collecting, storing, analyzing, creating backups requires considerable resources. It remains to hone the storage policy and, of course, do not forget to apply it in practice.

Oracle: Solving Big Data in Data Center Upgrades

A study by Oracle suggests that many companies appear to be caught off guard by the "big data" boom.

"Fighting big data looks set to be the biggest IT challenge for companies in the next two years," said Luigi Freguia, senior vice president of hardware at Oracle in the region EMEA. - By the end of this period, they will either cope with it, or will significantly lag behind in business and will be far from both threats and the capabilities of "big data."

The task of "mastering" big data is unique, Oracle recognizes. The main response of companies to big data challenges should be the modernization of corporate data centers (data centers).

To assess the degree of readiness of companies for changes within data centers, for almost two years Oracle, together with the analytical company Quocirca, collected data for the Oracle Next Generation Data Centre Index (Oracle NGD Index) study. This index assesses the progress of companies in the thoughtful use of data centers to improve the performance of IT infrastructure and optimize business processes.

The study consisted of two phases (cycles), and analysts noticed significant changes in all key indicators already on the threshold of the second stage. The average score on the Oracle NGD Index, which was scored by survey participants from Europe and the Middle East, was 5.58. The maximum score is -10.0 - reflects the most thoughtful strategy for using data centers.

The average score (5.58) became higher compared to the first cycle of the study conducted in February 2011 - 5.22. This suggests that companies are increasing investment in data center strategies in response to the big data boom. All countries, industries and industries covered by the study have improved the Oracle NGD Index from the second cycle compared to the first.

Scandinavia and the DCH region (Germany/Switzerland) are leading in sustainability with a 6.57 Sustainability Index. Next in this ranking is Benelux (5.76) and, then, Great Britain with a score of 5.4, which is already below the average.

Russia, which was included in the list of countries/regions only in the second cycle of the study and did not participate in the first, has significant potential for growth (indicator 4.62), analysts say.

According to the study, Russian organizations consider supporting business growth as an important reason for investing in data centers. More than 60% of companies see the need for such investments today or in the near future, suggesting that organizations will soon find that it becomes incredibly difficult to compete if and not yet make the appropriate investments.

In general, in the world, the share of respondents with their own corporate data centers decreased from 60% according to the results of the first research cycle to 44% in the second research cycle, on the contrary, the use of external data centers increased by 16 points to 56%.

Only 8% of respondents said that they do not need new data center capacities in the foreseeable future. 38% of respondents see the need for new data center capacities within the next two years. Only 6.4% of respondents reported that their organization does not have a sustainable development plan related to the use of data centers. The share of data center managers who review copies of electricity bills increased from 43.2% to 52.2% over the entire study period.

Investment in Big Data startups

In the second decade of October 2012, three American startups immediately received investment in the development of applications and services for working with Big data. These companies, by their example, show the unquenchable, but increasing interest of venture capital in this segment of the IT business, as well as the need for a new infrastructure for working with data, writes TechCrunch on October 21, 2012.

Investors' interest in Big data is explained by Gartner's positive forecast for the development of this segment until 2016. According to the study, solutions for Big data will amount to about $232 billion in the IT spending structure of companies.

At the same time, many companies and startups in the Big data segment are beginning to move away from the scheme of work of industry pioneers (Google, Amazon), when big data solutions were only part of the data center. Now they have transformed into a separate direction of the IT market.

Big data is now both infrastructure proposals and applications of both box and cloud types, it is a tool for working not only for large corporations, but also for medium and sometimes small businesses.

And this market movement forces vendors to look at Big data differently and change their approach in working with them, and also changes their view of consumer customers, which are now not only telecommunications or financial corporations.

India braces for big data boom

The Indian IT market is gradually beginning to slow down and the industry has to look for new ways to maintain the usual growth dynamics or ways not to collapse after other industries during the global economic crisis. Software and application developers are starting to offer new uses for the latest technologies. So some Indian companies analyze purchasing activity based on large amounts of unstructured data (Big Data) and then offer the results of research to large stores and retail chains. This was reported on October 8, 2012 by Reuters.

Video surveillance cameras, reports on purchases, requests on the Internet, reports on completed purchases using a particular web resource fell under close scrutiny.

"This data can let us know about the visitor's tendency to buy one or another, and therefore this information gives the key to concluding a profitable deal for all parties," said Dhiraj Rajaram, CEO of Bangalore company Mu Sigma (Dhiraj Rajaram), one of the largest organizations engaged in the analysis of Big Data.

Dhirai Rajaram noted that the bulk of such analysis is done in the United States, but now that the rapid development of the Indian IT market has begun to weaken, companies are paying more and more close attention to this promising segment.

At the same time, Indian companies, when working with Big Data, most often use cloud technologies to store and process data and the results of their activities.

The volume of global data produced in 2011 is estimated, according to Dhirai Rajaram, at about 1.8 zettabytes - 1.8 billion terabytes, which is the equivalent of 200 billion full-length high-definition films.

As well as analysing requests and the results of CCTV image processing, Dhirai Rajaram sees a huge scope for work in how much information from users and buyers appears on social media. In his opinion, this relatively new segment of the IT market may become a driver for the entire industry soon.

India's National Association of Software and Services Companies (Nasscom) forecasts a sixfold growth in the Big Data solution segment to $1.2 billion.

At the same time, the global growth of Big Data will be more than 2 times from $8.25 billion now, to $25 billion in the next few years, according to Nasscom.

2011

'Big Data 'fashion blossoms

In 2011, it was generally accepted that modern software tools are not able to operate with large amounts of data within reasonable time intervals. The designated range of values is highly conventional and tends to increase upwards as computing techniques are continuously improved and become increasingly available. In particular, Gartner in June 2011 considers "big data" in three planes at once - growth in volume, growth in the speed of data exchange and increase in information diversity[14].

At this time, it is believed that the main feature of the approaches used within the framework of the big data concept is the ability to process the entire information array to obtain more reliable analysis results. Previously, you had to rely on the so-called representative sample or subset of information. Naturally, the errors with this approach were noticeably higher. In addition, this approach required a certain amount of resources to prepare data for analysis and bring it to the required format.

According to media reports during this period, "it is difficult to find an industry for which the big data issue would be irrelevant." The ability to operate large amounts of information, analyze the relationships between them and make informed decisions, on the one hand, carries the potential for companies from various verticals to increase profitability and profitability, and increase efficiency. On the other hand, this is a great opportunity for additional earnings to partners of vendors - integrators and consultants.

To highlight the benefits of developing and implementing big data tools, McKinsey offered the statistics below. It is tied mainly to the US market, but it is not difficult to extrapolate it to other economically developed regions.

  • The potential volume of the healthcare market in the United States is $300 billion per year. Some of this huge amount goes to the implementation of modern IT, and obviously big data will not be left out.

  • The use of "big data" analysis tools in retail chains could potentially lead to a 60% increase in profitability.

  • In the United States alone, 140-190 thousand analysts and over 1.5 million managers will be needed to effectively process "big data" to manage information arrays.

  • American companies in 15 out of 17 sectors of the economy have more data than the US Library of Congress.

Why the data got big

In 2011, apologists for the Big Data concept declare that there are a great many sources of big data in the modern world. They can be:

  • continuously transmitted data from measuring devices,
  • events from radio frequency identifiers,
  • streams of messages from social networks,
  • meteorological data,
  • earth remote sensing data,
  • mobile network subscriber location data streams,
  • audio and video recording devices.

Actually, the massive spread of the above technologies and fundamentally new models of using various kinds of devices and Internet services served as a starting point for the penetration of big data into almost all areas of human activity. Primarily, research activities, the commercial sector and public administration.

File:1 BigData1.jpg

Data growth (left) as analog storage is displaced (right). Источник: Hilbert and López, `The world’s technological capacity to store, communicate, and compute information,`Science, 2011Global

A few telling facts of this time:

  • In 2010, corporations around the world accumulated 7 exabytes of data, 6 exabytes of information are stored on our home PCs and laptops.
  • All the music of the world can be placed on a disk worth $600.
  • In 2010, 5 billion phones were served in the networks of mobile operators.
  • Every month, 30 billion new sources of information are posted on Facebook.
  • Annually, the amount of stored information grows by 40%, while global IT costs grow by only 5%.
  • As of April 2011, the Library of Congress USA held 235 terabytes of data.
  • American companies in 15 out of 17 sectors of the economy have more data than the US Library of Congress.

File:2 BigData.png

The growth of computing power of computer technology (left) against the background of the transformation of the data paradigm (right). Источник: Hilbert and López, `The world’s technological capacity to store, communicate, and compute information,`Science, 2011Global

For example, sensors installed on an aircraft engine generate about 10 TB in half an hour. Approximately the same flows are typical for drilling rigs and oil refineries. Only one short message service, Twitter despite the 140-character message length limit, generates an 8 TB/day stream. If all such data is accumulated for further processing, then its total volume will be measured in tens and hundreds of petabytes. Additional complexities arise from the variability of data: its composition and structure are subject to constant changes when launching new services, installing advanced sensors, or deploying new marketing campaigns.

Recommendations to Chief information officers

The unprecedented variety of data arising from a huge number of all kinds of transactions and interactions provides an excellent fundamental base for the business to refine forecasts, assess the prospects for the development of products and entire areas, better cost control, and assess efficiency - the list is easy to continue for as long as you like. On the other hand, big data poses difficult challenges for any IT division, 2020vp.com experts wrote in 2011. Not only are they of a fundamentally new nature, when deciding them, it is important to take into account the limitations imposed by the budget on capital and current costs.

A chief information officer who intends to benefit from large structured and unstructured data should be guided by the following technical considerations[15]:

  • Divide and govern.

Data movement and integration is necessary, but both approaches raise capital and operating costs for information extraction, transformation, and download (ETL) tools. Therefore, you should not neglect standard relational environments such as Oracle and analytical data stores such as Teradata.

  • Compression and deduplication.

Both technologies have significantly gone ahead, for example, multi-level compression allows you to reduce the amount of "raw" data tenfold. However, it is always worth remembering how much of the compressed data may require recovery, and already starting from each specific situation, make a decision to use the same compression.

  • Not all data is the same.

Depending on the specific situation, the range of queries for business intelligence varies widely. Often, to obtain the necessary information, it is enough to get an answer to a SQL query, but there are also deep analytical queries that require the use of tools endowed with business intelligence and have a full range of dashboard and visualization capabilities. To prevent a sharp increase in operating costs, you need to carefully consider compiling a balanced list of necessary proprietary technologies in combination with open source POApache Hadoop.

  • Scalability and manageability.

Organizations are forced to solve the problem of heterogeneity of databases and analytical environments, and in this regard, the ability to scale horizontally and vertically is of fundamental importance. Actually, it is the ease of horizontal scaling that has become one of the main reasons for the rapid distribution of Hadoop. Especially in light of the possibility of parallel processing of information on clusters from ordinary servers (does not require highly specialized skills from employees) and thus saving investment in IT resources.

Growing demand for big data administrators

46% of IT directors surveyed at the end of 2011 by the Robert Half HR agency call database administration the most demanded specialty. Network administration was named by 41% of respondents, Windows system administration - 36%, desktop application technical support - 33%, and business analytics and reporting tools - 28%.

Processing large volumes of data is becoming a serious problem for many companies, and this increases the demand for database management specialists, concludes Robert Half. In addition to the growth of unstructured data (for example, messages on social networks), demand is increasing due to preparations for the introduction of new regulatory requirements in Europe - including Solvency II standards for insurance companies and Basel III capital and liquidity standards for the banking sector.

Analysts at Robert Half also predict a shortage of mobile and cloud specialists. Their conclusion is based on the fact that 38% of the surveyed Chief information officers named mobile technologies as the main investment area, and 35% - virtualization.

2008: Emergence of the term "Big Data"

The term "big data" itself appeared in use only in the late 2000s. It is one of the few names with a fairly reliable date of birth - September 3, 2008, when a special issue of the oldest British scientific journal Nature was published, dedicated to finding an answer to the question "How can technologies that open up the possibility of working with large amounts of data affect the future of science?" The special issue summed up previous discussions about the role of data in science in general and in electronic science (e-science) in particular[16].

You can identify several reasons that caused a new wave of interest in big data. The volumes of information grew by exponential law and its lion's share refers to unstructured data. In other words, the questions of correct interpretation of information flows became increasingly relevant and at the same time complex. The reaction from the IT market followed immediately - large players acquired the most successful highly specialized companies and began to develop tools for working with big data, the number of relevant startups exceeded all conceivable expectations.

Along with the growth of computing power and the development of storage technologies, the possibilities of big data analysis are gradually becoming available to small and medium-sized businesses and are ceasing to be exclusively the prerogative of large companies and research centers. To a large extent, this is facilitated by the development of a cloud computing model.

At this time, it is expected that with the further penetration of IT into the business environment and daily life, the information flows to be processed will continue to grow continuously. And if in the late 2000s big data is petabytes, it was expected that in the future you will have to operate with exabytes, etc. It was predicted that in the foreseeable future, the tools for working with such gigantic arrays of information would still remain overly complex and expensive.

1970s: The Age of Mainframe - the Emergence of the Big Data Concept

The concept of "big data" itself arose back in the days of mainframes and related scientific computer computing[17] As you know, science-intensive computing has always been difficult and is usually inextricably linked to the need to process large amounts of information.

See also

Notes