Developers: | Apache Software Foundation |
Date of the premiere of the system: | 2010 |
Last Release Date: | 2014/01/21 |
Technology: | DBMS |
Apache Hadoop is free Java the framework supporting accomplishment of the distributed applications working at the big clusters constructed on the normal equipment. Hadoop is transparent provides to applications reliability and high-speed performance of transactions with data. In Hadoop the computing paradigm known as MapReduce is implemented. According to this paradigm the application is separated into a large number of small tasks, each of which can be executed on any of cluster nodes. In addition, the distributed file system using computing nodes of a cluster for data storage that allows to reach very high aggregated capacity of a cluster is provided. This system allows applications to be scaled easily to the level of thousands of nodes and petabyte of data.
Materials on Google File System (GFS) were an inspiration source for Hadoop developers.
Hadoop is the advanced Apache project, the community of developers from around the world participates in its development.
The Hadoop project was called in honor of a toy elephant calf of the child of the founder of the project, Doug Cutting. Initially the project was developed as a system for support of distributed computing for the Nutch project. Hadoop contains the platform for distributed computing which initially was a part of the Nutch project. It includes Hadoop Distributed File System (HDFS) and implementation of map/reduce.
According to the report (August, 2011) published by Ventana Research company, the organizations use the Apache Hadoop system for the distributed data processing more and more widely, however it does not replace the settled technologies, and is applied along with them.
Hadoop advantage — capability to break very big data sets into the small blocks distributed on a cluster on the basis of normal equipment for the accelerated processing. The companies Facebook Amazon, eBay and Yahoo, the first Hadoop which began to use, apply technology to the analysis of petabytes of unstructured data with which relational database management systems would cope hardly. According to Ventana which staff conducted survey in more than 160 companies, more and more enterprises apply Hadoop to the similar purposes. In most of them a system analyzes huge volumes of information generated by computers — protocols of work of systems, search results and content of social networks. At the same time in 66% of the Hadoop organizations performs functions of the analysis earlier unavailable to them. Much less often Hadoop is used for the analysis of traditional structured data — transactions, the information about clients, data on calls. Relational usually are still applied to these cases DBMS.
Wednesday with the open code Open Source-framework Hadoop will be integrated into new versions of the MS SQL Server database and a cloud platform of Windows Azure. The vice president of Microsoft Ted Kummert at opening of the Pass Summit 2011 conference in Seattle in October reported about it. According to him, integration of Hadoop with SQL Server and Azure will help to satisfy requests of users which need effective processing of data bulks. "The next stage — to integrate technologies of data processing with cloud computing and to have opportunities which it was impossible even to imagine all a few years ago", - the vice president before five thousand SQL Server users said.
Hadoop is the free framework for the organization of distributed computing developed by community Apache Software Foundation on the basis of MapReduce and Google File System Storage technologies. Hadoop is used in such large web projects as Facebook, Twitter, Rackspace and EBay and also in commercial software products of IBM, EMC, Dell and Oracle. The largest contribution to development of the project is made by Yahoo — in particular, her backs-off Hortonworks with which Microsoft also signed the agreement on integration of a framework into the products.
Microsoft not for the first time pays attention to Hadoop: so, the framework some time was used for implementation of semantic search in Bing, was not replaced with the closed analog yet. The company did not make more large-scale attempts of integration, instead concentrating on development of Dryad, the closed Hadoop analog on the basis of own Cosmos technology. Unlike Hadoop developed on Java, this product of Microsoft is based on .NET, and its development will continue in parallel with implementation of an open framework.
Within the agreement about strategic cooperation with Hortonworks, Microsoft already released 2008 R2 update of Hadoop Connectors facilitating data exchange between two systems for the MS SQL Server. Now the companies jointly work on creation of the Hadoop distribution kit adapted for work for Windows unlike the original expected use in Linux systems. Microsoft claims that it will be completely compatible to an original product from Apache and also promises to open a project code for community.
The first Hadoop from Microsoft users of Azure at the end of December, 2011 will see. Thanks to an innovation, the developers using a cloud platform will be able to create on it applications, without setting Azure in the data centers.
The new version of the MS SQL Server which officially replaced the code name Denali with SQL Server 2012 will be issued in 2012, and will contain both the MS SQL Server database, and Hadoop. The first will be applied to processing of structured data, Hadoop will undertake processing of unstructured arrays of information. Two components are going to be connected among themselves through Hadoop Connectors.
2016: DIY Hadoop: hopes and risks
September, 2016. Waiting for Internet of Things (IoT) and explosive growth of amounts of data, the companies of all sizes and almost in all industries try to control and manage the growing information volumes. In the attempts to cope with this avalanche, in a number of the organizations deploy technological solutions on the Apache Hadoop platform.
But acquisition, deployment, setup and optimization of a cluster of Hadoop by own forces for use within the operating infrastructure can be much more difficult task, than believe in the companies - even with the assistance of the specialists capable to perform this work[1].
Management of Big Data — not only questions of extraction and information storage. It is required to solve a set of various aspects of confidentiality and security. Defects in information security can not only cause damage to reputation (that in recent years were influenced by such companies as Sony and Target), but also to face sanctions of regulating authorities.
At the beginning of 2015 Federal trade commission of the USA (FTC) published the report on Internet of Things containing the guidelines of protection of personal consumer information and security. In the document Careful Connections: Building Security in the Internet of Things, ("Carefully thought over connections: security in Internet of Things") Federal trade commission strongly recommends the companies to apply approach based on risks and to follow the best techniques developed by experts in security issues, such as use of strong enciphering and authentication.
FTC noted, as business, and law-enforcement bodies are interested in compliance of IoT-solutions to consumer expectations in security issues. To the companies processing IoT-data, the report recommends to apply the checked techniques among which:
- embedding of security aids in products and services since the beginning of design but not to include afterwards.
- adhere to the principle of echelon protection providing security measures at several levels.
Heads of the companies and IT services which will decide to follow the recommendations of FTC for security of Big Data with high probability are expected by difficulties, in particular in attempt of integration of Hadoop with the operating IT infrastructure: the problem of Hadoop is that the product was developed initially without requirements to security, it was created only for solving of tasks of storage and fast processing of large volumes of the distributed data. It led to emergence of threats:
- risks are inherent to Hadoop cluster organized by own forces - it is often developed without due security protections, small group of specialists, in vitro separately from the production environment. In development of a cluster from the small project before corporate environment of Hadoop every period of growth — deployment of packets of corrections, setup, control of versions of the modules Hadoop, OS libraries, utilities, management of users, etc. — becomes more difficult and labor-consuming.
- Hadoop is the cornerstone the principle of "democratization of data" (all data are available to all users of a cluster) that results in difficulties of observance of a number of regulatory requirements, such as law on succession and accountability of health insurance (Health Insurance Portability and Accountability Act, HIPAA) and the standard of data security of the industry of payment cards (Payment Card Industry Data Security Standard, PCI DSS). It is connected with lack of management tools data access, including management tools passwords, authorization for access to files and databases and carrying out audit.
- when using the Hadoop environment it is difficult to define origin of a specific data set and data sources for it. As a result crucial business solutions can be based on the analysis of suspicious or doubtful data.
Independent deployment of clusters of Apache Hadoop attracts many heads of the companies and IT services with the seeming economy of costs due to use of the ordinary equipment and the free software. However despite initial economy, Hadoop cluster created by own forces is not always an optimal variant for the organizations where corporate solution for work with Big Data - is required in terms of, both security, and productivity.
2015: Contribution of Huawei to Hadoop development
2014
HDP 2.0
On January 22, 2014 the Microsoft company announced an exit of Hortonworks Data Platform 2.0 for Windows. The product is certified for Windows Server 2008 R2/Windows Server 2012/2012 R2.
In the updated platform to developers it is available convenient to Windows installer to deployment of Hadoop 2.0 on one computer. Available "sandbox" (sandbox) emulating a cluster from several nodes.
Within HDP 2.0 exit for Windows the company Hortonworks announced updating NoSQL DBMS Apache HBase to version 0.96 (now it is possible to do DB molds).
The second project phase of Stinger — the accelerator for the Apache Hive engine supporting SQL queries to Hadoop is begun. Against the background of the recent publication Cloudera company of results of testing of the similar Impala engine which outstripped Hive in tens of times, this announcement of acceleration of Hive by 100 times on petabyte amounts is very relevant.
Apache Hadoop 2.0
The mechanism of task management Yarn designed to simplify Hadoop application development will become the main innovation of the platform. Still the processing of tasks in Hadoop performed using the MapReduce mechanism was possible only in the serial mode. Yarn will allow to carry out tasks in parallel. The new mechanism creates containers for applications, monitors their resource requirements and selects additional as necessary. If MapReduce at the same time was responsible for task scheduling and resource management, then Yarn differentiates these functions.
According to experts, thanks to the new mechanism of task scheduling for Hadoop the whole wave of new analytical applications can appear. This process already began: for example, Apache Tez, the analysis system of data in real time accelerating accomplishment of requests due to processing in RAM uses means of Yarn.
In Hadoop 2.0 some more new components, including an instrument for ensuring of high readiness and expansion of scale of individual clusters appeared (the Hadoop environments can consist of several clusters): each of them may contain up to 4 thousand servers now.
The Hadoop platform will become the standard of the industry in 2015
On December 1, 2014 analysts of Forrester Research company read the forecast according to which the Hadoop platform from Apache will become the standard de facto for IT infrastructure of all large companies in 2015. It is supposed, the noticeable growth of number of specialists and speed of implementation of systems on[2].
In the market the corresponding tendency to obligatory integration of Hadoop which analysts called Hadooponomics is observed, it should provide capability of linear scaling both storage, and data processing. Haduponomika is closely connected with a possibility of further active application of cloud solutions on large enterprises.
According to the report of analysts, not all enterprises actively apply Hadoop, however importance of the platform was proved by many industries of the company leading in the industries: WalMart, Fidelity Investments, Sears, Verizon, USAA, Cardinal Health, Wells Fargo, Procter & Gamble, Cablevision, Nasdaq, AutoTrader, Netflix and Yelp.
It is expected that ANSI compatible opportunities of SQL on the Hadoop platform will give to Hadoop all opportunities to become the useful platform of data for the enterprises as these options will be familiar to professionals in the field of data management and are available on the existing systems. All this will allow to create a sandbox for data analysis which was not available earlier.
"The cloud elasticity", possibility of synchronization of computing and network powers with the stored data will become one of key factors for decrease in an expense of means, experts consider. Therefore, it is expected that the Hadoop platform will be applied more and more actively in cloud solutions against the background of the growing demand for specialized analytics.
Emergence of the new Hadoop distribution kits from analogs from HP, Oracle, SAP, Software AG and Tibco seems very probable. Microsoft, Red Hat, VMware and other vendors of operating systems can not find the reasons for failure from integration of the platform into own OS.
Important factor of influence - presence of the qualified personnel capable to work with the Hadoop platform. They should appear, experts consider. Thanks to their participation fast and more effective project implementation of Hadoop will become possible.
Sergey Zablodsky, the director of department of data warehouses of IBS, considers: "Strong exaggeration - Hadoop as an obligatory component of IT architecture of most the companies. Even if we speak about the main application of Hadoop, i.e. processing of Big Data, not all companies use for this purpose Hadoop-clusters. It is necessary to understand that the perspective of Big Data arose not yesterday and Hadoop - not the only tool allowing to work with such data. Today there is a number of successful alternative solutions both among the commercial software, and among the Open source software. Certainly, Hadoop is popular and it has the niche, and its application for the solution of certain tasks will grow, but I would not speak about obligation of this technology as there are tasks in the field of Big Data for which Hadoop is not the best choice. For most of customers Hadoop implementation is not a relevant task today and there will be no such in the short term. Meanwhile projects using Hadoop in our market are more likely experiments with technology and attempts to grope limits of applicability. Rather actively master Hadoop some telecom operators and banks, but it is not always attempts to solve analytical problems. One of the directions of experiments — cheap storage systems of the low-used unstructured data. In other industries of significant interest in solutions of such class it is not noticeable yet: for the majority of essential tasks enough classical technologies of work with data and hardware and software systems".
2010: Apache Hadoop 0.23
Version 0.23 of distributed system of data processing of Apache Hadoop developed now will be able to work at clusters from 6 thousand machines, in each of which is set on 16 and more main cores and in parallel to perform up to 10 thousand tasks, the vice president of Apache Foundation and the founder of Hortonworks company Arun Marty at the O'Reilly Strata conference told. Version 0.23 is at an alpha testing stage now. Its exit should take place in the current year. Already now Hadoop is tested on clusters from 4 thousand machines.
In the new version of Hadoop it is going to implement also support of federation and means of high availability in the HDFS file system. The MapReduce platform based on which Hadoop is constructed will be updated too. Its new version under the conditional name Yarn, has the increased performance, in particular on big clusters.
At the same conference the Hortonworks company specializing in work with the Hadoop systems, and MarkLogic company announced integration of the platforms thanks to what users will have an opportunity to integrate means of MapReduce with means of indexation and the interactive analysis in real time of MarkLogic.
Analysts of Gartner are sure that by 2015 two thirds of packets of analytical applications with advanced functions of analytics in this or that form will include cluster Hadoop DBMS. Though so far for work with Hadoop scarce highly qualified specialists are required, in the organizations begin to understand possibilities of the solution in the field of processing of Big Data, in particular unstructured, the text, the analysis of behavior and so on, analysts note. Meanwhile developers of analytical packets take the next step: Hadoop directly include in the packets. And it means that it is time for companies to think of ensuring compatibility of infrastructure with Hadoop, to define analytic functions which use in work of business division and to reveal the valuable internal projects based on Hadoop.
In process of growth of popularity Hadoop will enter, on Gartner terminology, in a phase of "the valley of disappointment" — when after initial rise users will lose interest in the project which did not achieve the expected results. Further in the scraper of development of technology there is "a growth of education" and an exit to "the plateau of productivity" when the technology becomes mature.
Why projects of Hadoop-analytics often do not achieve the goals
Xing Sachtier, the co-founder and the CEO of the Pepperdata company creating software of optimization of clusters in real time as the person managing the first commercial experience of implementation of Hadoop at the time of the team working by Yahoo is familiar with the technology shortcomings of Hadoop influencing business. For optimization of use of Hadoop of the company should be in advance well familiar with its most widespread defects, many of which affect reliability, predictability and visibility. Below based on information from Sachtier we investigate the main reasons why Hadoop can not give the expected return[3].
Tasks come to an end with failure or strongly brake
Hadoop deployment often begins in the environment of "sandbox". Over time workloads increase, and lay down the support function of productive applications managed by agreements on the service layer one cluster more and more. Momentary requests and other low-priority tasks compete for system resources with business and critical applications, and as a result high-priority tasks are performed with unacceptable delays.
Impossibility to trace cluster performance in real time
Diagnostic means of Hadoop are static, and their log files give information on task performance on a cluster only after their end. Hadoop does not allow to trace at rather granular level that occurs when many tasks are performed. As a result difficult and often it is impossible to take the remedial actions preventing operational problems until their emergence.
Lack of macro-level visibility and control over a cluster
Different diagnostic means of Hadoop give an opportunity to analyze statistics of individual tasks and to investigate activity of separate nodes of a cluster. Besides, developers can arrange the code under the optimal performance of individual tasks. What however is not present, so it is opportunities to trace, analyze and control that happens to all users, tasks and tasks on the scale of the whole cluster, including use of each hardware resource.
An insufficient opportunity to set and provide priorities of tasks
Though schedulers of tasks and managers of resources give basic opportunities, for example, the organization of the sequence of tasks, planning for time and events and selection of nodes, they are insufficient for ensuring the most effective use of resources of a cluster at task performance.
Underutilized and dissipated resources
The organizations usually select the size of the clusters under the maximum peak loads. Seldom involved resources often cost much and really are not necessary.
An insufficient opportunity to control selection of resources of a cluster in real time
When the non-standard tasks performed on a cluster, inefficient or resource-intensive requests or other processes adversely influence performance, Hadoop operators often do not manage to take the necessary remedial measures for prevention of failure of agreements on the service layer.
Lack of the granular overview of use of resources of a cluster
When tasks come to the end with failure or with a big delay, it is difficult for operators and administrators of Hadoop to diagnose performance concerns. Hadoop does not give methods of monitoring and performance review of a cluster with a sufficient context and detail. For example, it is impossible to isolate problems on the user, to a task or a task and to identify the bottlenecks connected with network, OZU or a disk.
Impossibility to foretell when the cluster exhausts opportunities
Additional and more diverse tasks, the extending volumes and a variety of data types, complication of requests and many other factors load cluster resources more and more over time. And often the need in additional cluster resources is realized only after accident (we will tell, the website for clients ceases to work or the crucial report is not created). The disappointment of clients, the missed business opportunities, unplanned needs in capital costs, etc. can become a result.
Competition of HBase and MapReduce
The competition for system resources of the HBase and MapReduce tasks can strongly influence overall performance. The impossibility to optimize use of resources when different types of workloads are at the same time involved, forces many organizations to go on costs and to unroll the separate selected clusters.
Lack of important visual panels
The interactive research and the fast diagnosis of performance concerns in a cluster remain unresolved questions. The static reports and detailed log files created by schedulers and managers of resources do not suit for the simple and fast diagnosis of problems. On sifting of huge data arrays in search of the reasons of malfunctions hours, and even days can leave. Hadoop operators need an opportunity to quickly visualize, analyze and define the reasons of performance concerns and to find possibilities of optimization of use of resources.