RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2
2010/12/20 18:06:05

Problems of work with superdata bulks

In an interview of PwC John TransUnion discusses problems in the field of data processing which will become relevant for the increasing number of the companies in 2011-2015 John TransUnion fulfills duties of the CIO of TransUnion company, is the chairman of the board and the owner of Parkwood Advisors, in the past worked as the CIO in Capgemini. In the interview TransUnion tells about requirement of TransUnion for data analysis with the low level of structurization and lights a number of the technology problems connected with it which, according to him, many other companies will face in the near future.

PwC: Working in TransUnion you tested many technologies of processing of large volumes of data. What do you think of Hadoop and MapReduce?

DP: MapReduce is very attractive technology for a certain class of computing tasks. If you work with such class of tasks, it makes sense to think of MapReduce use. However the main problem of this system consists that the number of people who really understand mathematical formulas, the cornerstone MapReduce, much less number of the people trying to understand what with all this to do. This system did not evolve to such level that the technical specialist at any enterprise could use it with ease.

PwC: What class of tasks do you mean?

DP: MapReduce best of all works in cases when it is necessary to make careful comparison on indistinct signs and a categorization of large volumes of semistructured data. In TransUnion we spend a lot of time for search of tens and hundreds of billions of fragments of data in search of elements, approximately corresponding to a template. MapReduce is more effective many other applied at us filters for some search algorithms on a template. At least in its theoretical to formulation this system perfectly supports high degree of a parallelization of execution of tasks that cannot be told about other filtering algorithms applied by us.

The stack with the open code perfectly is suitable for experiments, however a problem that, according to us, Hadoop at all not Google, is only attempt of group of clever guys to repeat technologies of Google. They did good work, but, as well as the majority of programs with the open code, their system is finished only for 80%. And the lacking 20% also are the most difficult. In terms of experiments we achieved considerable progress, proving that the computing formulas which are the cornerstone of MapReduce really work, however the software which we locate today very unreliably and difficult in operation. There are not eliminated bugs, and the program not too well works in the operational mode. Besides, in it a number of mysterious restrictions is put, the scales which are shown at increase and performance of calculations.

We detected a number of problems with use of a data stack of HDFS/Hadoop/HBase for accomplishment of tasks which on the available documentation had to be solved. However in practice the restrictions put in the code led to failure long before what we would consider a suitable theoretical limit. Of course, availability of the source code is positive aspect. However it at the same time and negative aspect. The source code is necessary for work with such software product, but it at all not in what we would like to be engaged in the daily activity. I have many good engineers, but I do not want at all that they spent all the time for technical support of a product which would have to enter our architecture in finished form. Yes, this product has a certain potential, but there will pass a lot of time before it reaches sufficient stability, that I was ready to stake on it.

PwC: For the last few years the prices of the equipment for data storage fell considerably. If this is not about crucial data, then how the company can be convinced that it does not spend for their storage more, than it is necessary?

DP: Perhaps, we not too common example as our work consists just in data analysis. We are ready to pay any price for an opportunity to receive more exact and prompt replies as we put these expenses in the cost of our services. A problem today that the latest tools not always work as it is necessary. This problem is relevant both for hardware, and for the software. Many suppliers finish testing of the applications at the level 80 or 85% of their theoretical readiness. In an operating mode we load them for 110% teoretiche- sky opportunities, and they refuse. I am not disturbed by tactical costs for technologies, which I expect to replace quickly. Such expenses arise constantly. But if I pay money, I calculate that this piece will work. And too often it turns out that it does not work.

PwC: You are forced to use only the checked technologies because of a concern to leave for limits of reliability?

DP: The dilemma for me is that technologies which are already checked in work cannot usually support scales necessary for us in terms of speed or volumes of data processing. I am forced to invest time, energy and dollars in technologies which are not checked yet, but which from the architectural point of view can provide sufficient efficiency rate. If option, which I will select, will not earn or will refuse, I can replace rather easily it with something another. That is why we prefer applications. While they well work at the network layer and have the standard and clear interface, it is not so important if time in one and a half-two years I have to refuse any of them for benefit of something new. I cannot act this way with all elements, but I am able to afford to make it in those areas where there is no settled commercial alternative.

PwC: Do you use something instead of Hadoop?

DP: In effect we apply a search method. We use Ab Initio, it is very successful system of parallelization of tasks of search. I mean certain properties in Ab Initio parallelization — taking, transforming and executing — such way I can shatter a task.

PwC: The most part of data with which you work belongs to transactions. These are only structured data or you also should sort texts?

DP: In fact, we deal with three data types. There are data on receivables from the organizations granting the loans. These are data on personal expenditure of clients. There are open data of the state organizations, for example data on bankruptcies, judicial materials, data on pledges which represent partially structured text. And, besides, there are data which include any additional information and which are, as a rule, integrated around well-known sets of identifiers. At the same time cost value of such data almost zero — we do not pay for them. These data strongly zashumlena. Therefore we spend computing powers for understanding whether these data suit us, and to find for them the place in working arrays which we conduct. I have many good engineers, but I do not want at all that they spent all the time for technical support of a product which would have to enter our architecture in finished form.

TransUnion annually receives 100 million updates of credit files. We update big the data warehouse containing all financial and accompanying information. Besides, daily we generate from 1 to 20 temporary storages on which use actually and our work is based. Our products integrate that we call indicative data — that information which identifies the specific person; structured data which we obtain from transactional records, and the unstructured data tied to descriptors. We store these information products in the course of work as data can change every day, sometimes in day several times.

One of the tasks facing us is in precisely to define the place for each fragment of data. For example, we have Joe Smith living to the address Main Street, house 13 and there is Joe Smith living on Main Street, 31. These are two different John Smith or it is just a typo? We should make such decisions on 100 million times in the course of the day using a number of special it is scarlet - goritm of search in a template and probabilistic algorithms.

PwC: With what of these three data types it is the most difficult to work?

DP: Before us there are two types of difficulties. The first type arises only because of scales of our work. We add to the file of credit data approximately on a half of terabyte in a month. Everything that we do, is integrated to difficulties because of volumes, update rate, speed or performance of databases. For equipment manufacturers and the software we both a gift, and a damnation. We are now there where there is all industry — to what all companies in two years or five years will come. We the good directional indicator of development of the industry, but at the same time we constantly bring their equipment and programs to failures. And the second difficulty — constantly growing share of an unstructured part of data.

PwC: It is more difficult to work with unstructured data because they arrive from a set of different sources and in a set of different standards, isn't it?

DP: Yes. We have 83 thousand data sources. Not all of them deliver us data every day. Data come approximately to 4 thousand standards, despite of the fact that we have own standards of information exchange. To have an opportunity to process data quickly enough, we should transfer them all to the uniform interchange format data corresponding to volume which we use in the company. All this is integrated to complex computing problems.

PwC: It is also those problems of data processing which the companies in other industries will face in three-five years?

DP: I think and.

PwC: What else problems, in your opinion, will be widely adopted?

DP: There are several simple practical examples. In general on our control there are 8.5 petabyte of data. When your amounts of data considerably exceed the 100th terabyte, drives of data it is necessary to change everyone chetyrepyat years. Transfer the 100th terabyte of data is the huge physical task taking a lot of time. It is a little facilitated by growth of connection speed, but arrays can move only with that speed to what there is a read and write, and it is impossible to increase this exchange rate. And the companies which are below us on the level of complexity of tasks cannot themselves the predst twist that the turnover cycle of data can take month. Let's put, the turnover cycle of computers can take months, but each its separate fragment occupies all a couple of hours. When I move data from one array to another, I can stop only when process is completely finished. And besides me it is necessary to deal with bugs and problems of stability.

Today TransUnion does not face problems of reservation of data as we constantly conduct reservation of each new layer of data. However we are faced by a data recovery problem. For recovery of essential amounts of data that we periodically should do, sometimes it is required up to several days as physical restraints of the technologies used by us do not allow to make it quicker. The average department of information technologies does not face similar problems. However take the amount of data which is managed by average department of IT, increase it much, and it will already become a vital issue.

We would like to see shrinking algorithms, more effective in terms of calculations, as two main groups of my costs are storage costs and data movement. For today I have no problems with computing power, however if I do not manage to change a trend of growth of storage costs and data movement, in several years I will have such problems. To make calculations during desirable time, I need to perform calculations in parallel. But beyond a certain limit the parallelization stops because I cannot move data further away.

PwC: The Cloudera company [the developer of solutions based on Hadoop] would suggest to transfer calculations to data.

DP: It suits only for certain data types. We already perform the increasing distributed computing on the basis of the system of files, but not databases. Besides, we provide computing cycles for archiving of data to move information less than bits, then we will extract data, we consider and again we archive them to save the place stored of data. Operating with the fourth in value commercial cluster of GPFS [the general parallel file system, the file system of distributed computing developed by IBM] in the world, we found out that when you exceed the limit of certain sizes, management tools a parallelization simply cease to work. For this reason I claim that Google works not for Hadoop. Perhaps, the command of Google also solved this problem, but if it and so, they are not going to tell how they made it.