Raiffeisenbank found an original method sharply to increase reliability of the IT system
Growth of number of clients, penetration of mobile bank on the one hand, growth of amount of data and also variations of their use – with another, inevitably lead to increase in load of systems and, as a result, need of scaling of bank IT infrastructure. About development of bank platforms, methods of decrease in loading and reliability augmentation Mikhail Antonov, the head of development of the applied Core Banking platforms of Raiffeisenbank told.
Content |
The centralized automated banking system (ABS) of bank is located in a server rack you above human growth. The server and banking systems should be productive, reliable and fault-tolerant. It is one of the main living conditions of bank in the market. Therefore decrease in loading and scaling of systems – our constant and a priority. But to understand how to protect a system, at first it is necessary to understand the nature of the most frequent reasons of failures.
Types of loadings and that happens if it is wrong to combine them
We will take for an example one of the banking systems entering the ABS complex which allows bank to work 24/7, to create transactions, to calculate balances on accounts, to process debit card transactions in real time and many other things.
But before beginning a conversation about a system, it is necessary to take a step back – to bases. There are two main types of load of databases: OLTP (online transaction processing) and OLAP (online analytical processing).
OLTP is transaction processing in real time, a large number of transactions, their fast processing with fast record in the database is characteristic of it. Common examples of OLTP activities: an account balance request, authorization according to the card, creation of the movement on the account. For an example, the OLTP system considered by us can process 1.5 thousand requests of balance and 1.7 thousand card authorizations per second.
OLAP – interactive analytical processing — technology of data processing which means the summary (aggregated) information on the basis of data bulks. These are slow, advanced queries with the appeal to a large number of data.
The sign of bad design of applications is a combination of different types of loadings on one server because analytical requests easily lead to the increased consumption of the processor and disks, and often to their complete utilization. At the same time the server should have always power reserve, time of the answer should be predictable and minimum, depends on it, how fast the client will be able to make transaction or to obtain the necessary information from a banking system.
Typical case: some not urgent analytical report prepares, significantly raising load of a system. Because of it the database does not answer OLTP requests for a certain period. As a result the mobile bank, for example, can not wait for the answer and begin to send repeated requests, loading a system even more, or it is incorrect to process information, and information on the screen of mobile application will be displayed incorrectly. Thus, services can become unavailable to clients for some time.
It is necessary to consider that with transition of users to digital channels load of the systems of bank constantly grows. For example, only since the beginning of 2020 the amount of requests in core banking system subsystems from Raiffeisen Online grew by a third. Therefore decrease in loading and scaling of systems – our constant and a priority.
What to do?
There are several ways of performance improvement and decrease in load of systems or platforms
1. Purchase more resources, it is more than processors. It is the simplest way, but it is difficult to call it optimal if the company uses only it. The cost of each server is calculated by hundreds of thousands of dollars.
2. Optimize the code.
3. Transfer OLTP loads of other, cheaper servers.
4. Take out OLAP on the separate server.
We in Raiffeisenbank, of course, move in all four directions, but the second and third as for them the bank does not incur essential expenses are of the greatest interest as in a case with purchase of new powerful processors, at the same time the efficiency of existing increases many times.
Two principles of code optimization
In a question of performance improvement of systems the Pareto principle works: 20% of efforts yield 80% of result, and other 80% of efforts – only 20% of result. Permanent search and remedial action in processes and work of systems using code optimization – extremely effective method, it are also those 20% of efforts. The hardware gain, i.e. purchase of new servers, not always gives even directly proportional performance improvement therefore code optimization is very important.
So, we noticed that in our system there were unexpected delays at request processing of balance according to cards. These delays affected afterwards use of resources, considerably increasing their consumption. Search of the solution together with vendor, with professional community did not help us to cope with a problem, and often even led to deterioration in performance or other errors. Once we tried to rewrite 20% of the code, having passed c of use of the SQL language in our programs to proprietary language of a system. The new version of a system allowed to reduce at once load of the processor from requests of mobile bank twice.
There is also the second rule which should work if you decide to optimize the code: before to put all eggs in one basket, be convinced that you constructed really reliable basket. In other words, it is necessary to have reliably working not optimized version of the code which will insure you from collapse if in your completions you did not consider something. It should be always on call and join simple configuration change. Our program worked guaranteed well therefore we could experiment safely.
Quicker, more reliably, cheaper: how to cache OLTP requests
The solution about which the speech will go is directly connected with technical delays which table of values each programmer is obliged to know.
Reading from a processor cache | 7 nanoseconds |
Reading 1 MB of data from RAM | 250 microseconds (0.25 milliseconds) |
Reading 1 MB of data of SSD | 1 millisecond |
Reading 1 MB from the hard drive | 20 milliseconds |
You should not forget also about cross-network delays: the data packet sent to Vladivostok will also back travel around network of Moscow even longer, about 150-200 milliseconds.
Once Raiffeisen Online sent all requests of balance directly of core banking system subsystems for calculation of balances, they calculated the balance which is kept on tough and SSD disks and returned the result. At the same time the client can open several times in a day mobile application in which every time should be displayed relevant balance according to maps and accounts. The amount of such requests quickly grows because the number of clients and penetration of mobile bank steadily grows.
We carried out load analysis and understood that there are more than a half of the requests coming from online bank to us come repeatedly. It is connected with behavior model of the client in a system. Active clients come into online bank several times in day and will often switch between screen forms with lists of accounts.
To help growth of business and to lower load of disks and the ABS processor, we created the new caching layer and placed it between Raiffeisen Online and a system. The cache is an intermediate buffer with quick access to data which are had in RAM, containing information which can be requested with the largest probability.
When developing of the caching layer we used free Open Source of technology, a distributed database open source of Apache Ignite, and inexpensive iron. The solution was designed so that further it could be scaled. How does it work?
When the client of bank opens mobile application for the first time in a day, Raiffeisen Online creates a balance request on all available accounts and cards. The request gets to a cache which checks whether is not present in fast RAM of this balance. It is convinced that is not present, and sends a request in a system. That calculates balance, the cache saves it in memory and returns to the client. Only about 10 milliseconds are spent for all this way.
When the client comes into the application repeatedly, the request of balance comes to a cache, a system finds it in RAM and at once returns it to the client. Only 1 millisecond is spent for such route. Along with it a system monitors whether there was no movement on the account and if it occurs, then the old balance is removed from memory, and the inquiry of balance is again sent on a backend. We counted and were convinced that amount of requests which go along a long route only 15%, other 85% get to a cache. At start of this solution load of core banking system subsystems from Raiffeisen Online fell approximately to a third.
From a system to the user experience
Thus, having created an additional layer for OLTP loading, we lowered load of a system and had an opportunity to horizontally scale new service. Speed of a response increased approximately by 15 times, and the cost of the solution if to compare to simple building of infrastructure on a backend, was lower several times. What is important, we increased reliability of all solution as decrease in load of a backend gives additional margin of safety.
Reducing load of banking applications, optimizing the code and transferring a part of data to RAM, we can quicker process requests, and it means that clients of bank see results of requests at once and their user experience improves. Of course, we are not going to stop on it, and in our plans – further expansion of functionality of the caching layer, removal of OLAP load of the separate server, updating of iron at more modern and continuous optimization of our applications.