Customers: Sberbank Moscow; Financial Services, Investments and Auditing Product: Apache HadoopProject date: 2016/07 - 2020/12
Number of licenses: 12 500 000
|
Content |
2019
Sberbank Huge Data Platform. The platform on work with data of Sberbank
The business challenges solved by Factory of Data
- Management reporting in real time
- Regulatory and tax statements
- Mass personalisation, secondary sales, AML
- Transaction scoring of AI in consumer crediting
- Package execution of models, graph platform, check of hypotheses
- Receiving a uniform picture of quality of data at the level of Group
Sberbank selected developers of supermarket of data
At the end of October, 2019 Sberbank summed up the tender results on software development for the Cloud of Data system and also control procedures of quality of data.[1] the Total initial cost of contracts made 280 million rubles. Based on the held tender the bank reduced this amount approximately by 8 million rubles.
Purchase consisted of three lots. The first is a development of control procedures of quality of these systems "Cloud of Data" and "Analytical data storage" (the cost of works was estimated at 120 million rubles). The second and third lots were devoted to development according to back-end and front-end of components of the portal of Supermarket of data (60 and 100 million rubles).
The first lot was won by the Techno Diasoft company which suggested to satisfy terms of the contract for 112.8 million rubles. Based on negotiations the total cost of works was reduced by 1% up to 111.7 million rubles. ITs I-Teco got the second lot, the third - Technology ADV companies. The total cost of works within these lots changed slightly.
Development and modification of the application software will be is carried out for the benefit of Department of data management (SberData) of the block "Technologies". AS "Analytical data storage" (ADS) technically is based on platforms Teradata and Informatica. AS "Cloud of data" - on technologies Hadoop (Hive Spark Cassandra HBASE ElasticSearch, Titan, FastGraph, RapidMiner, Impala and others).
Within development of control procedures of quality of data the contractor will have to create quality requirements of data, including, to develop criteria of quality, figures of merit of data and methods of their calculation. Besides, it is necessary to spend IT requirement analysis to quality of data taking into account architecture of data storage, to develop algorithms and to implement quality checks of data. It is also necessary to approve quality requirements of data and checks with business customers and suppliers of data.
The second and third lots are devoted to development of the portal of Supermarket of data. In the first case it is necessary to develop back-end a part of the portal, in the second - front-end. For the rest characteristics of works match: requirements analysis of users in new functionality of the portal, the organization and carrying out workshops and sessions of design thinking, development of business logic of the portal, development of test scripts and automated a unit tests, prototyping and development of UI design, etc.
All works will need to be performed until the end of 2020.
2018: The first results of the BigData-project
Experts Sberbank Technologies Boris Rabinovich, Ilya Pozdnyakov and Valery Vybornov in May, 2018 prepared article about how the bank stores for TAdviser and uses fast-growing data.
Requirements define solutions
Key to success of any business in ability most precisely to define needs of the client. In 2018 it is not necessary to explain any more, how exactly Big Data helps with it. The number of data in Sberbank is comparable to such IT companies as Twitter, Skype, Snapchat, etc. However our key difference - complexity of financial products and data structures where they are stored. Well and, of course, increased requirements to security. We know that money and personal data – very sensitive topics for our clients.
For the solution of different business challenges data in Sberbank are processed from one hundred information systems. In bank the issue of their more effective use is discussed long ago. In 2011 in Sberbank there was a project on creation of the data warehouse on the basis of the solution of Teradata that allowed to solve problems of formation of the management, financial and regulatory reporting, individual offers for clients, etc.
Mass penetration of the Internet and the growing popularity of social networks opened new sources of information and, respectively, new opportunities for use of data. If to enrich internal data of Sberbank external, then, undoubtedly, it is possible to understand better customer needs, to optimize internal processes and many other things.
The data warehousing framework on the basis of the solution of Teradata did not answer new calls. First of all, it was necessary to provide opportunities for the growth of amounts of data by 10 PB a year and implementations of tasks on the basis of Artificial Intelligence. Therefore two years ago the decision to change a data warehousing framework was made. The Factory of Data program within which the Cloud of Data and Laboratory of Data systems on Hadoop are created was for this purpose started.
Why Hadoop?
Hadoop offers technology which allows to process huge amounts of data and is optimal on a ratio price/quality. It is used by the largest world companies: Barclays, Lloyds Banking Group, Citi, Deutsche Bank, Google, Amazon, etc. In fact, this solution – the designer of whom it is possible to build the data warehouse under needs of business.
In new architecture of Sberbank key requirements to this technology became: daily incremental updating of data sources, creation of the uniform integrated data model, applied business solutions in Cloud of Data and also the industrial environment of execution of AI models. In 2016 development of new architecture was a difficult and ambitious task for all command "Factories of Data", considering amounts of data, a maturity of Hadoop and a staff shortage.
"Cloud of Data" as Sberbank platform for work with data
In SberTech specialists of Competence Center in Big data are engaged in creation of Cloud of Data and Laboratory of Data. "The cloud of Data" – Big Data a cluster under control of Apache Hadoop on the basis of the Cloudera distribution kit – was created in only two years. First of all, we developed the main infrastructure services for solving of tasks on security, audit, journalizing, etc. Daley solved one of key tasks of Factory of Data – created tools for incremental data loading from the high-loaded banking systems in Cloud of Data that allows to load several tens of TB a day.
"Cloud of data". Integration
Loading and maintenance in a jellied status of such amount of data it was a difficult technical task. For its solution by means of Hadoop ecosystem tools (Apache Spark, Sqoop and others) develops own product "Stork replicator". The idea of a replicator evolved from the analysis of paths of data loading from different sources. At the moment into "A cloud of data" using this replicator more than 30 key systems of bank are loaded.
In addition to data loading from classical relational DBMS "Cloud of Data" it is integrated with the Platform of Support of Business Development (PSBD) – one more large-scale program of Sberbank for creation of enterprise back office of the platform of new generation. Unlike the relational systems sources, data from PPRB are transferred to "A cloud of data" by a continuous flow. For implementation of such flow the linking of instruments of line processing of Apache Kafka and Spark Streaming was selected.
Also data get to Cloud through loading of normal text files from different sources. In such a way Big Data a cluster is enriched from external sources.
As of May, 2018 "The cloud of data" and flows in it can be described the following parameters:
- 2 PB – the total amount of data in a cluster;
- 10 PB – the planned growth of amount of data during 2018;
- 15.5 TB in day – the volume of a daily increment of data on the current sources;
- 2000 – 5000 transactions contain in a flow per second;
- 170 nodes of a cluster are used for processing of the arriving volume;
- 400 parallel tasks on data loading;
- 200 TB – the volume of every day updated information in remarks of systems sources;
On daily data-refresh about 6 hours are required, but we aim to come at the updating close to Real-time.
That Cloud of Data did not turn into "the swamp of data" creation of the uniform integrated data model which in Sberbank call "Uniform Semantic Layer" (USL) was provided in target architecture. It is under construction on the basis of "the Corporate Data model" - a logical data model of bank. A problem of this layer – to separate the consumer from "crude" data and to provide access to the integrated, consistent and qualitative this ESS that will allow to exclude duplication of works of different divisions of bank on date integration and also to use ESS for reporting.
Artificial intelligence
AI transformation – one of the key directions on process optimization of Sberbank. Historically AI and machine learning were developed in the block Risks. By 2018 solutions on the basis of AI are used practically in all spheres of work of Sberbank. For the beginning of year 259 AI projects were started. Several examples of the models created by us with colleagues from divisions of bank are given below:
- prediction of peak days and hours of visits of Sberbank branches that allows to lower waiting time of the client in queue and to optimize working schedules of operators;
- prediction of balances in cash in ATMs that saves to Sberbank considerable means in the form of cash which differently would lie "a dead load" in ATMs;
- prediction of outflow of clients of the corporate block of Sberbank which allows to minimize outflow using pro-active actions of client managers.
Polygon for execution of models
For start of production models in "A cloud of data" the industrial environment of execution of AI models – "subsystem of package execution of models" (PIM) works and also "libraries of models of machine learning" are created. The interrelation of these subsystems allows to solve a problem of accomplishment of models in the industrial environment on real data with required SLA. In the near future together with other initiatives of Data Science-community of Sberbank we will bring modeling in Sberbank to qualitatively new level.
Before becoming industrial, the model should be developed and a provalidirovana. For similar researches in Factory of data based on Hadoop and Cloudera Laboratory of Data where Data Scientist banish model through all stages of lifecycle works. In Laboratory of Data the subset of data from Cloud and other sources sufficient for experiments is used. In the same place models are developed and analyzed: what data are and what their quality that will be at intersection of user groups, etc. The laboratory possesses one of key places in a technology landscape of Sberbank. In fact is a research center in which models of machine learning and the system of artificial intelligence are born.
As you know, there is an essential gap between laboratory development and industrial accomplishment of models. In our architecture such gap solves PIM. The subsystem providing transfer of models from Laboratory of data in "A cloud of data" and providing accomplishment of operational requirements for monitoring, journalizing and management of execution of models from owners and escort services. Also in the next plans - a task of transfer of "a model environment" when Data Scientist is able to transfer to industrial Wednesday not only model, but also structures and data streams, necessary for its correct work.
About the future
We are going to develop the implemented subsystems of Cloud of Data and to resolve new issues. For example, how to provide replication and data access in the near Real time mode? As a rule, processing of the changes coming to Cloud from systems sources happens periodically and large portions ("packets"), and there will be no next processing yet, changes from a system source will not be reflected in a cloud. Replication and access in the near Real timepozvolyat mode to reduce time between change in a system source and reflection of this change in Cloud to several minutes. Also superbig graphs are of special interest for business. It is possible to solve with their help effectively many problems using the analysis of graph data with billions of communications in an interactive mode, beginning from search of affiliates and the organizations, and finishing with product recommendations. We already created a working prototype, we will surely tell about further development.
2016: Choice of the contractor and start of implementation
On June 6, 2016 Sberbank published the list of requests for the tender within which it needs to deliver distributed system of storage and processing of big data on the Hadoop platform[2].
At the starting price in 12.5 million rubles of Teradata offered the services for 12.24 million, to Glowbyte Consulting - for 11.56 million, Huawei - for 11.27 million, AT Consulting - for 11.17 million.
The minimum price - 0 rubles - to Sberbank was offered by IBM corporation. In the protocol of consideration of requests refining is given that necessary licenses were already purchased and paid by bank within the agreement with IBM of November, 2014. The cost of licenses at the rate of for June 5, 2016 is 3.275 million rubles.
According to the results of consideration of requests by the winner there was Teradata. In October, 2016 with the company the contract for the amount of 8.41 million rubles was signed.
What is bought by Sberbank
The Hadoop platform is selected from quality of the standard of Sberbank, said in the tender documentation.
Hadoop represents freely extended software set for development and accomplishment of the distributed programs working at the clusters consisting of hundreds and thousands of nodes. Duplication on a case of failure of nodes is provided in a system, it supports several work copies of data. Work of Hadoop is based on the principle of parallel processing of data that allows to increase work speed. Volumes of the processed information are measured by petabytes. The platform is written in the Java language.
A system which is ordered by Sberbank should meet requirements to the solutions imposed to a class of tasks Big Data, follows from terms of reference. Its structure should include the components following open-source performing storage and data processing:
- HDFS is the distributed failsafe file system;
- YARN – the manager of resource management and calculations;
- MAP/REDUCE is model of the distributed parallel computings;
- Apache SPARK – the distributed calculation in-memory;
- Kafka is the guaranteed delivery of messages;
- Sqoop is data loading in/from DB/Hadoop;
- Apache HIVE is SQL data access in HDFS;
- Hue is the browser interface of data analysis on Hadoop;
- Pig is the platform for the analysis of large volumes of data on Hadoop;
- Oozie – a task scheduler.
Within tender the bank purchases a system in volume of 61 nodes with warranty support. The contract organization will have to provide the non-exclusive license on Hadoop for the term of not less than one year. The winner will sign the framework license agreement with Sberbank, at the same time the bank has the right to determine volumes and terms of acquisition of a system at discretion.
Also among requirements of the tender - existence of implementations of such system in Russia, not less than two implementations in the largest world banks entering Forbes ratings 500 or Fortune 500 and in total existence not less than 200 system implementations from the moment of its release on the market.
Why to Sberbank of Hadoop
Sberbank told TAdviser that the platform is necessary for bank for storage and data processing of large volume and different structure. Representatives of Sberbank explained the reasons of the choice of Hadoop as the standard with the fact that it organically supplements other technologies which are already available in bank for data processing storage (MRR-and relational DBMS) and has important features. Among them - the low cost of cumulative ownership at the rate on 1 Tbyte of the stored data due to use of the commodity-equipment and an opportunity to provide machine learning on all set of the stored data.
It is not about replacement, and more likely about expansion of technological capabilities. Along with already available MRR-and relational DBMS we will begin to use Hadoop together with Spark and other tools for work with Big Data soon, - explained TAdviser in Sberbank. |
Within tender only the Hadoop platform implementing methods of distribution of data on computing nodes, their parallel processing and data of results is bought. All logic of work and specific analytical algorithms only should be developed, Vladimir Dubinkin, the head of department of network solutions in IBS company notices.
It is only the tool for developers, but not an analytical system in itself, - the representative of IBS notes. - The scale of the purchased system quite considerable - more than 60 nodes that allows it to process, in the presence of the appropriate hardware resources, petabytes of data |
In the solution of bank tasks representatives of Sberbank tell about advantages of application of Hadoop at profile conferences about three years, Roman Baranov, the head of a business intelligence of Croc company reminded TAdviser.
The functionality of the systems created by the Hadoop tools, according to Baranov, can be similar to what is implemented using Cloudera/MapR/HW, i.e. problems of a clustering of data and issue of the optimal offer for the client (Next Best Offer) which forms on the basis of such characteristics as perfect purchase, a client profile and behavior of similar clients are solved.
Also relevant are problems of assessment of credit risks, optimization of a remaining balance of a cache in departments and ATM networks, failure prediction of ATMs and other, he adds.
The field of bank application Hadoop is extremely extensive and supports such key directions as formation of effective model of assessment of individual client and partner risks, detection of fraudulent schemes in transaction and billing channels and also high-precision segmentation of all customer base for generation of the most target commercial offers and optimization of marketing communications, the marketing director of Aykumen of IBSAndrey Lysenko tells.
In addition to applied relevance, clusters of Hadoop are actively used by bank analysts for creation of a test environment at a research of new data types which variety grows in geometrical progression, he says.
The expected difficulties at development of solutions based on Hadoop
At implementation Hadoop bank can face a traditional set of difficulties, arising at implementation of new technologies, believe in Sberbank: it is need of accumulation of competences, embedding of new technology in internal processes and integration with the available IT landscape.
Technology implementation of the Hadoop platform is simple and comes down to deployment of standard modules on servers of a computing cluster, respondents of TAdviser experts note. Especially as in this case it is not about open-source the solution, and about the system of the specific producer which is followed by support of vendor. Also high requirements to existence of training courses and already quite large number of the certified specialists in Russia are imposed.
To Dubinkin from IBS the main complexity the subsequent software development for the solution of analytical tasks of Sberbank, including, optimization of a program code for effective use of hardware resources of the platform seems. Besides, tasks of the analysis of large volumes of data have the specifics and require involvement of profile specialists, so-called data scientist which so far in Russia are not enough.
Until Hadoop is widespread quite poorly, and the number of participants of each conference on Big Data only confirms it, adds Rams from Croc. Open installations the expert could count only about 10.
Practice of Aykumen of IBS defines the main difficulties of deployment in the organization of management, upgrade and condition monitoring of Hadoop-clusters when the account of agrarian and industrial complex goes on tens of machines. For example, open Cloudera Manager does not allow to control effectively a system from over 30 machines any more and requires additional expenses in the form of paid licensing, Andrey Lysenko says.