The name of the base system (platform): | Apache Hadoop |
Developers: | IBM |
Technology: | BI |
Content |
IBM released the new software of InfoSphere BigInsights and InfoSphere Streams which allows clients to gain a fast impression about information streams in a zone of interests of their business. The software which includes more than 50 patents analyzes the traditional structured data found in databases and also unstructured data – such as text, video, audio, images, information of social media and given about a route of movement of users on the Websites ("click streams") – giving the chance to the persons making decisions to work taking into account this information with unprecedented efficiency.
The software of InfoSphere Streams which is also been born in IBM Research analyzes the data coming to the organization in real time, controlling them regarding any changes which can mean a new form, structure or a trend. This opportunity allows the organizations to find timely important information and to make more reasoned decisions and also to react quickly to events in process of their emergence.
New improvements in InfoSphere Streams give the chance to analyze different Big Data – such as tweets, messages and comments in blogs, the video frames given electrocardiograms, the coordinate of GPS, data arriving from sensors and the data arriving from stock exchanges – with a speed, for 350% bigger, than before. InfoSphere BigInsights supplements InfoSphere Streams, applying means of the analysis to statistical data of the organization and also to flows to the data "passing" through Streams. All this represents a continual loop of the analysis which purchases the increasing power in process of increase in volume available yielded and analysis results in real time for modeling for the purpose of improvements.
Being an old supporter of software technologies open source ("open source"), IBM selected the Hadoop project as a cornerstone of the strategy of Big Data. Continuing to pay special attention to creation of the advanced analytical solutions for the enterprises, IBM relies on opportunities of these open technologies, expanding them with the improved control functions, security and reliability that is required to business. Capability of Hadoop to support processing of a broad spectrum of types of information on numerous computing platforms in combination with analytical opportunities of IBM allows clients to solve today's growing problems with Big Data. The briefcase of IBM with the offers based on Hadoop includes the product IBM Cognos Consumer Insight which integrates content of social media with a traditional business intelligence, and the product IBM Coremetrics Explore which segments consumer models of consumers and allows to investigate mobile data in details. Besides, Hadoop is core program infrastructure which the IBM Watson computing system uses for the purpose of distribution of workload for information processing that maintains revolutionary capability of a system to understand a natural language and to give specific answers to questions with very high speed.
Streams is a development environment and at the same time the environment of execution of applications for stream data. The product includes sets of libraries which allow to create analytical applications for processing of different data types — financial, text, video, audio. For example, by the corresponding tulkit it is possible to write the application for the information processing arriving from video cameras which will compare all persons with a certain database to Streams and to perform some operations at coincidence identification. The application developed using a set of libraries for audiodata will be able to select, for example, a sound of a human voice from noise of the sea. Libraries for financial data give the chance to sort formats of this information type. Streams provides building tools in the special Streams Processing Language language created in IBM. After the program on SPL is written and debugged, she is had to production environment of Streams. At this moment the technology for optimization begins to work. In time of writing of the program the developer does not know in what environment it will be performed. Depending on intensity of a data stream the application can be unrolled on one notebook or on a cluster from one hundred powerful machines which will conduct parallel processing. Beauty of the solution of IBM is that the program in declarative language will be transformed to effective machine code. Streams at first receives a task to perform these or those data translations, and then information that these conversions should be implemented, for example, for a data stream in 1 Tbyte per hour. And the application will be unrolled on an optimal hardware configuration for such processing speed.
Description
The platform is useful to business because of a possibility of operational obtaining the necessary data and to the best results during the work with the applications sensitive to response time (such as identification of cases of fraud or network management). InfoSphere Streams can integrate flows, helping to gain new knowledge from several flows, as shown in the drawing.
Main tasks of InfoSphere Streams
- Quick response to events and to changes of conditions and requirements of business.
- Support of continuous data analysis at the speeds exceeding on orders of speed of the existing systems.
- Fast adaptation to changes of forms and data types.
- Ensuring high availability, management of heterogeneous data and implementation of a new stream paradigm.
- Ensuring protection and confidentiality of information provided for the general access.
As a part of InfoSphere Streams a programming model and integrated development environment for determination of data sources, software analytical modules - the operators united in executive modules of processing. InfoSphere Streams contains the infrastructure making of these components scalable applications for processing of flows.
Principal components of the platform
- Wednesday of runtime - includes the standardized services and the scheduler for deployment and monitoring of Streams-applications on one or several integrated nodes.
- The programming model - allows to create Streams-applications by means of the declarative Streams Processing Language (SPL) language. This language allows to describe the required facts, and Wednesday of runtime is responsible for the choice of the best method of processing of a request. In this model of a Streams-application are presented in the form of graphs who consist of operators and the flows connecting them.
- Instruments of monitoring and interfaces of administration - the speed of work of Streams-applications with data is much more, than that which normal monitors of the operating system can trace. InfoSphere Streams contains the tools intended for work in such environment.
Язык Streams Processing Language
SPL is the programming language for InfoSphere Streams which is language of creation of the distributed data streams. SPL represents a full-function programming language like C ++ or Java supporting the user data types. Own functions can be written both on SPL, and in languages C ++ or Java. The operators determined by the user it is possible to write on C ++ or Java.
The continuous InfoSphere Streams application is described by the focused graph consisting of the certain operators connected and working with several data streams. Data streams can get to a system as from the outside, and to be generated by applications in it. SPL-applications consist of the following main components:
- Flow - infinite sequence of the structured records. It can be processed by operators line by line or on the basis of the set window.
- Record - the structured attribute list and their types. Each record of a flow corresponds to the form determined by flow type.
- The flow type - defines a name and data type for each attribute of record.
- Window - narrow consecutive group of records. The window can be based on the counter, time, value of attribute or punctuation marks.
- The operator - the main component of SPL. Operators process data of flows and can generate new flows.
- The processing element - the main executive block, can consist of one or several operators.
- The task - the Streams-application unrolled for accomplishment consists of one or several processing elements. In addition to the processing elements the compiler SPL generates the file in the Application Description Language (ADL) language describing structure of the application. The ADL file contains detailed information on each processing element: information on what binary file need to be loaded and executed, given about restrictions of the scheduler, the description of formats of a flow and an internal dataflow graph of the operator.
In the drawing the operating cycle of a SPL-application in the environment of InfoSphere Streams is shown.
Operating cycle of InfoSphere
Development environment
InfoSphere Streams includes a development environment, it consists of Eclipse IDE, Streams Live Graph, and Streams Debugger. Besides, the structure of the platform includes the tool kits accelerating and facilitating development of solutions for certain types of tasks or the directions:
- The standard tool kit - contains the operators delivered in a product by default
- Operators of interface
- Auxiliary operators
- Tool kit for the Internet
- The tool kit for work with databases - supports different DBMS, including DB2 Netezza, Oracle Database, SQL Server and MySQL.
- Other built-in tools for work with finance, the text and "Big Data" and also for deep data analysis.
Integration and interaction of BigInsights and InfoSphere Streams
The companies where large volumes of information are generated, try to solve a data analysis problem proceeding from the important reasons:
- need of timely recognition and response to the arising events
- forecasting of actions on the basis of accumulated information.
These reasons lead to need of existence of functions: transparent work with the data which are "in the movement" (continuous data), data analysis, being "in rest" (the saved-up data), work with huge volumes of diverse data "in the movement". Integration of IBM InfoSphere Streams (data "in the movement") and BigInsights (data "at rest") is suitable for the following scenarios:
Scalable process of capture of information - process of continuous data transmission of Streams in BigInsights. For example, unstructured text data of social networks (similar Twitter and Facebook), as a rule, subject to processing to learn about different opinions or trends. In this case will take much more effectively the necessary data on a measure of their receiving, and unnecessary data (for example, spam) to destroy at early stages. Such integration will allow the companies to avoid excess storage costs of huge volumes of unnecessary information.
Improvement and enrichment - the historical context generated by BigInsights for the advanced analysis and enrichment of the entering data of Streams. BigInsights can be used for the analysis of the integrated data obtained from different dynamic and static sources for a long time frame. Results of this analysis create contents for different methods of the analysis in real time, and these results can be given to a required status. If we consider social networks again, then we will see that the message of Twitter network contains only the identifier of the person who wrote this message. However the saved-up historical data can add this information with different attributes (for example, the message prime cause), having created an opportunity to analyze data at lower level and it is correct to react to mood of this user. Adaptive models of the analysis are the models generated in BigInsights in the course of the analysis (deep data analysis, machine self-training or statistical modeling can be such models). These models can form base for the analysis of the entering data in Streams and be updated on the basis of observations in real time.
Data "in the movement" and data "at rest" - a part of the IBM Big Data platform, it is possible to integrate using three main types of components:
- The general means of the analysis - both in Streams, and in BigInsights it is possible to use the same means of data analysis.
- The general formats of data - operators of formatting Streams can transform data from a format of the records Streams to the formats used in BigInsights.
- Data interchange adapters - for data exchange with BigInsights it is possible to use adapters.