The name of the base system (platform): | Apache Hadoop |
Developers: | IBM |
Technology: | BI, Big Data, Data Mining |
Content |
IBM released at the end of 2011 the software of InfoSphere BigInsights and InfoSphere Streams which allows clients to gain a fast impression about information streams in a zone of interests of their business.
BigInsights in approach
BigInsights is the platform for data analysis allowing the companies to turn difficult data sets of scale of the Internet into knowledge. Easily set Apache Hadoop distribution kit and also a set of the connected tools necessary for application development, data transfer and management of a cluster are a part of this platform. Thanks to the simplicity and scalability of Hadoop, the representing Open Source-реализацию of infrastructure MapReduce, uses deserved recognition in different industries and sciences. In addition to Hadoop, the following Open Source-технологии are a part of BigInsights (all of them, except for Jaql, are the Apache Software Foundation projects):
- Pig is the platform including a high-level language of the description of the programs analyzing big data sets. The compiler transforming the Pig applications to the sequences of the MapReduce tasks performed in the environment of Hadoop is a part of Pig.
- Hive is the solution for data warehousing developed on the basis of the Hadoop environment. In it the familiar principles of relational databases - tables, columns, sections are implemented. Also set of SQL statements (HiveQL) for work in the unstructured Hadoop environment is its part. Requests of Hive are compiled in the MapReduce tasks performed in the environment of Hadoop.
- Jaql is the language of requests with the SQL-like interface developed by IBM and intended for JavaScript Object Notation (JSON). Jaql perfectly maintains enclosure, is highly function-oriented and extremely flexible. This language well is suitable for work with poorly structured data; also it serves as the interface of storage of the HBase columns and is used for the analysis of the text.
- HBase - the data storage environment focused on columns by a не-SQL intended for support of big tables with small degree of fullness in Hadoop.
- Flume is the distributed, reliable and available service intended for effective movement of large volumes of the generated data. Flume well is suitable for obtaining event logs from several systems and their moving to the Hadoop file system (Hadoop Distributed File System, HDFS) in process of their generation.
- Lucene is the library of the search system providing the high performance and full text search.
- Avro is the technology of consecutive ordering of data using JSON for determination of data types and protocols. Arranges data in a compact binary format.
- ZooKeeper is the centralized service intended for support of the configuration information and naming; provides the distributed synchronization and group service.
- Oozie is the schedule system of line processing of tasks intended for the organization and management of Apache Hadoop task performance.
In addition to above-mentioned products the BigInsights distribution kit includes the following technologies of IBM:
- BigSheets is the browser interface in the form of the spreadsheet intended for search and data analysis and using all power of Hadoop; allows users to collect and analyze data easily. Contains the wired programs of viewing data able to work with several widespread formats including JSON, CSV (the value separated by commas) and TSV (the value separated by signs of tabulation).
- Text analytics is previously brought together library of text annotator for widespread business objects. Contains a rich language and tools for creation of the user annotator of locations.
- Adaptive MapReduce is the solution developed by IBM Research and intended for acceleration of accomplishment of the small MapReduce tasks by change of a method of their processing.
InfoSphere platform
InfoSphere is the comprehensive platform on integration of information including means of storage and data analysis, an integration tool of information, a management tool master data, management tools lifecycle and also means of protecting and ensuring confidentiality of data. InfoSphere does development process of applications by more effective, allowing the organizations to save time, to reduce costs for integration and to increase quality of information.
The product BigInsights, being a part of the platform IBM Big Data, contains integration points with other its components, including storage systems and data integration, mechanisms of management and third-party tools for data analysis. It is possible to BigInsights to integrate with the InfoSphere Streams platform.
New paradigm of calculations
Stream calculations - a new paradigm, requirement for which is caused by new scenarios of generation of data - universal use of mobile devices, services on position fix and a wide use of various sensors. All this generated the sharp need for the scalable computing platforms and parallel architecture capable to process huge volumes of the generated stream data.
BigInsights technologies are not suitable for processing of stream data in real time as are focused generally on batch processing of static data. When processing static data reply to the request "to Select all users connected to network" one resulting set of values will be. When processing stream data in real time it is possible to execute a continuous request, for example "to Select all users connected to network in the last 10 minutes". This request will continuously update results. In the world of static data the user will look for a notorious needle in a haystack whereas in the world of stream data he looks for this needle as wind blows off hay from a stack.
In the drawing the difference between the calculations executed over static data, and the calculations executed over stream data is illustrated.
When processing static data (the left part of the drawing) requests to static data are executed. When processing stream data (the right part of the drawing) data continuously pass through static requests.
The IBM InfoSphere Streams platform supports processing of stream data in real time, providing periodic updating of results of continuous requests. The necessary knowledge can be taken from data streams which still are in the movement.