Customers: Hotels.ru Product: IT-Grad Cloud IaaS На базе: VMware Cloud Director Second product: Zabbix a System for monitoring of networks and applications Project date: 2014/03 - 2014/09
|
Content |
Choice of the cloud future
The Hotels.ru base in a year increases approximately by 50% together with number of unique users. So, it is regularly necessary to increase performance that is difficult from the administrative point of view. If we selected to service new "house", it was logical to make the main selection criterion flexibility of scaling.
Field tests showed that to the comfortable user of an access rate and recalculation of options it is necessary for achievement 1.5k for IOPS on a disk system. Besides the indicator should be guaranteed to provide a stable response of the website. At this stage nearly a half of offers as providers tried to avoid specific guarantees on input-output operations was eliminated.
The list of potential clouds was narrowed also on geographical sign. We were not ready to sacrifice speed for the sake of hosting perspective abroad (as showed recent trends — correctly made). In reliability calculations of possible losses showed that 99.5% and will be optimal SLA for our service above.
As service was based
We created Hotels.ru for the Russian user therefore on response time and availability Russia had a priority. The Web servers are farther from the end user, the longer he should wait for loading. Therefore we selected the Russian platform for placement, with access to trunk channels of the region. As a result the choice fell on one of cloud data centers with direct connection to a point of exchange of M9 traffic. So influence of network costs for User Experience was minimized.
At creation of the high-loaded system it is logical to select a simple configuration, without excess makeweights and difficulties. Service consists of two most parts: web applications and databases. We stopped on the Linux platform and a PHP-application with MySQL base.
Of course, hundreds of thousands of simultaneous connections from the whole country generate serious loading. Therefore the application is unrolled on a virtual farm of Web servers, load of which is balanced with means of vSphere. Under base the separate big virtual machine which is insured from failures using vSphere High Availability is selected. The question of reliability of an IT landscape is entirely assigned to cloud provider with the relevant SLA.
Long thought over disaster tolerance of the solution - it is in general now a popular subject. Decided not "do as all", and to approach a question intelligently and the calculator. After calculations of volume of losses at hypothetical accident we came to a conclusion that we for the company of loss become notable day of unavailability later. And at RTO in the amount of day there is no need for the systems of replication or something similar therefore we were limited to a backup and the reserve platform.
Here it is necessary to explain the scheme a little. Backups become several times a day, and the reserve platform is present at quality of the arrangement with other DPC on transfer of data in reasonable time there. As for network integrity, we rely on BGP and therefore the problem of mismatch of a name and IP when moving will not concern us. In five years of work of service of real large-scale accident it did not happen (fie-fie-fie), but at tests the scheme proved to be quite reliable. During another such rehearsal all process took less than 16 hours.
Colleagues often are interested: as it in general, is unusual to work with an IaaS-cloud and "is uncomfortable that data at someone there and not near by". Yes, we too once thought of it and worried, but the issue was resolved by enciphering of really critical information. Moreover, and other "hot" we simply do not store number of credit cards. When for an exception of risk it is done everything possible, there is no sense and to worry. So we sleep peacefully and nightmares we do not suffer.
But all charm of a cloud is realized when before holidays traffic on the website increases in ten times and it is necessary to lift power urgently. At the beginning of service it caused natural panic and a work involving all hands, and now the issue is resolved by adding of virtual processors and IOPS for a disk system. As soon as the peak fell down and, according to forecasts, repetitions are not expected — we disconnect temporarily selected capacities not to pay superfluous.
It is a little lyrics and IT philosophy
Our business entirely is based on IT technologies, and it is not a just beautiful figure of speech. Even it is more — we need to use all the newest and perspective to remain noticeable and useful to the user. The matter is that now the quality level for popular web service is extremely high. If earlier the issue was resolved by just beautiful design, then now the user is familiar with the good and bad interfaces convenient and not really the menu, etc. Taking into account the high competition in the market, we should not just conform to requirements of time, but exceed them.
In practice it develops into full-time employment over a frontend and search of methods to make service even better and more effective. Prototypes of future versions and a running in of new features often require separate stands of similar performance, and we with shudder remember self-combined "servers" and night works on their prizhivleniye. Progress here, of course, weakens: clicked a mouse and received the whole farm under tests. At the same time it is not necessary to go to the management behind the budget to test iron, and later — to think, where to attach the obsolete written-off equipment. Beauty.
We spend a lot of time and the broken copies for discussion of the new idea or wishes of users. The curious idea of one of testers is quite capable to generate long discussion and desire to try. The benefit A/B-testing was thought up by clever people, and we often add a new counter for a half of our users. And then we examine diagrams of statistics and we draw conclusions.
About the unhealthy competition
In a century of technologies and legal society methods of shadow fight against competitors did not lose the relevance — you for certain regularly meet messages about DDoS attacks in media. Growth of popularity of Hotels.ru drew attention of unfair persons, and on us that led recently to the large-scale attack.
DDoS in our case was directed to a system overload by huge number of requests though traffic at the same time was small. You remember that Hotels carries out the analysis of competitive prices and outputs aggregated data to users? Malefactors that "put" our base also decided to use this. Requests poured from covering four countries a botnet network, and to service practically at once it became bad.
Helped out the monitoring of a cloud which notified on problems the first 10 minutes of the attack. Our servers are under observation of provider ZABBIX which looks behind network at the same time. So timely detection of problems saved also from loss of reputation, and from more large-scale effects. It was necessary to react to threat quickly, and the decision to temporarily block the incoming traffic from the attacking countries, the benefit really large among them was made was not. In parallel increased number of connections and sighed quietly.
Just after this case thought of some special system of fight against DDoS because hands to cope with it not easy and not quickly. Always is what to improve, truly? But so far a question in development, many nuances.
Plans for the future
Development of Hotels.ru takes its course, and the regular gain of number of users confirms that we move in a right direction. The absence some noticeable difficulties with a server framework allows to think not of purchase of iron and hiring of new administrators, and of development. For example, now all command with enthusiasm works on big updating: the interface will seriously exchange and new opportunities will appear. Here I will not disclose especially details, we will leave the place to a surprise. I will tell only that changes will be also in "brains" of service: let's pass to HTML5, CSS3 and other modern pieces.
In parallel we are engaged in additional certification on the PCI DSS standard, regulating safety of card payments. Earlier we just used services of PCI DSS of a hosting at the level of placement of the equipment, and fulfilled other requirements independently that created excessive loading. Now we are going to tell this work to provider what requires a bit different level of certification of PCI DSS.
For this purpose we brought together separate group of specialists and we work in cooperation with "[IT-GRAD]]". The matter is that the supplier of cloud services can fulfill the most part of the requirements of the standard, for this purpose it should have a certification of PCI DSS with the limits of applicability including not only delivery of the equipment in lease, but also its administration.
In our case we are going to give such tasks to an area of responsibility of IaaS-provider as:
- configuration management of network equipment (firewalls, routers);
- control of validity periods of passwords to the server systems and virtualization;
- development of standards of a configuration of system components, change management;
- enciphering of the channel at administrative access to the equipment and the systems of virtualization;
- control of use of system accounts;
- search and detection of system vulnerabilities in the equipment and software of virtualization;
- and others.
Certification this difficult and quite long. Ask why it to us in general? To have an opportunity to store information on cards of clients and not to force their every time to enter the same payment information when armoring. Earlier we already implemented payment of armor via own website, without redirection on the intermediary. Users estimated convenience of such scheme at once (all actions on one website, with the familiar interface) and for image of the company it was useful (the doubtful company will not be engaged in similar difficulties). Now we decided to step further and to store all necessary information in the user profile.
Today in Russia IaaS-operators with the certificate of PCI DSS which on a scope covers administration, and not just placement of the equipment can be counted on fingers of one hand. In our opinion, business will add this point to the extensive list of "cloud requirements" sooner or later.