RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2
2017/11/14 12:46:40

Disaggregation of memory will allow to accelerate work of popular applications much

.

In article "Revolution in Tsodostroyeniya Approaches" need for disaggregation of resources at the level of the server and the level of DPC in general was proved.

But on the way to general disaggregation there is an obvious barrier. It is connected with the remaining restrictions on speed and on a delay time at data transmission between the processor, memory and DWH. The theoretical analysis of this problem is made in article "Network Requirements for Resource Disaggregation"[1]. Work is written by very authoritative group of authors from the University of California of Berkeley, among them Peter Gao and Scott Schenker are recognized fathers of the program focused approach to SDN networks. For 2017 SDN became classics, thus that else in 2010 the phrase "program focused" seemed exotic[2].

In article from Berkeley it is emphasized:

File:Aquote1.png
Networks as separation of processors from memory and disks will demand from interresource communications of performance not less that which was in servers will become the key blocking or providing factor for disaggregation. Network infrastructures should provide the transmission rate and a delay, comparable with those which are in servers. Silicon photonics with the corresponding switches, the PCIe s switches and many other things will be required. It is made new technologies, and they should compete with existing which thanks to mass character are offered at the affordable prices
File:Aquote2.png

About silicon photonics read the separate article of TAdviser "Why the silicon photonics is considered a source of the next information revolution".

Rice 1. Global disaggregation

If the performance concern of networks is solved, then it is possible to perform complete disaggregation, and then the updated DPC will look not assembly of servers as it is shown in fig. 1 (a), and in the form of three autonomous pools of resources (b). However on a status of network technologies for 2017 it is very difficult to perform complete disaggregation in the nearest future. Nevertheless, it is selected from quality of the pretentious purpose in the European dReDBox project, others are not known yet. We will return to dReDBox at the end of article.

At the same time, there are no big problems with the truncated option where disaggregation is limited to two pools — one is formed by the processors integrated with memory, and in the second drives of different types integrate. This approach is absolutely pragmatic, it can be implemented in the nearest future, and on this way there are several companies. It is implemented in projects: Intel RSD, HP TheMachine, Facebook Disaggregated Rack, Huawei Ericsson DC3.0, 8000. They are at the different levels of completeness, we will speak about them in the following article.

Here we will consider what is called disaggregation of memory (memory disaggregation). By means of disaggregation of this type it is possible to overcome a so-called memory wall (memory wall)[3]. At a memory wall two manifestations. The first — on speed memory always lags behind processors, besides acceleration of operation of processors happens quicker, than memory work acceleration. The second — the restricted speed of data exchange between the processor and memory (bandwidth wall).

For the present disaggregation on a watershed of CPU-memory remains more scientific, than an engineering problem, generally representatives of the academic community are engaged in it. Let's hope that their works will be able to find the practical application in not distant perspective. Most likely absolute disaggregation of CPU-memory is impossible, there will always be some rather small fragment of memory which will be near the processor and to play a cache role, and all other memory can be integrated into the general pool.

Solutions of a problem of disaggregation of memory can differ on radicalism degree, i.e. to be program using the existing hardware, to be development of the traditional hardware and to be absolutely new.

In conditions when need for satisfaction of requirements "greedy to memory" modern technologies, such as VoltDB, Memcached, PowerGraph, GraphX or Apache Spark is relevant, but there is no technological capability means of disaggregation to collect large volumes of memory in the general pool, it is necessary to go for artificial acceptances for increase in available memory which it is possible to call pseudo-disaggregation. These fairly forgotten acceptances are known for a long time, one of them — swapping, another — NUMA.

Swapping memory memory

Swapping call the memory virtualization mechanism serving for movement of inactive contents of memory of the external carrier, a disk or a flash, and for return to memory of the data which became relevant. The page organization of memory is the cornerstone of swapping. For the first time swapping was implemented in the British Atlas supercomputer brought into operation in 1962. It was one of the first-ever computers on germanic triodes. In the subsequent swapping was actively used in PDP minicomputers of DEC company. Then memory was extremely expensive (in Atlas and the first PDP it was on ferrite cores), for swapping there were economic reasons. Now, when there is no reasons for memory economy, swapping is intended to increase the amount of memory available to the processor therefore as space for swapping serves memory. Two options — the appeal to not used memory of other servers or creation of the separate array of memory used by servers are possible.

On the first way authors of the Infiniswap system developed in the University of Michigan went. Its essence is expressed in the name — unlimited swapping[4].

Infiniswap is the software solution with the open code which creates a big pool of memory available to applications using the Remote Direct Memory Access (RDMA) protocol. The protocol was initially for HPC and InfiniBand, but is already ported also on Ethernet. RDMA (remote direct memory access) allows to route data between servers live from memory of one application in memory of another without participation of central processors. On the experimental platform Infiniswap works running Linux 3.13 at a cluster from 32 servers integrated by adapter 56 Gb/sec ConnectX-3 InfiniBand from Mellanox.

Still Infiniswap can work on RDMA for Converged Ethernet (RoCE). RoCE (RDMA over Converged Ethernet) is effective the transmission method of data with very low delay on the Ethernet networks without loss.

On average thanks to Infiniswap use of memory improved for 42%, and the exchange rate increased by 16 times.

The second approach to swapping shows the joint solution of specialists from HP, AMD and the University of Michigan[5]. In it the standard, not modified servers edges in combination with the additional, created by authors of the project specialized edges of memory are used.

Authors of it to a large extent hardware, than the software solution, aimed to achieve high efficiency of use of the memory collected in a pool, applying the ready components which are available in the market to this purpose in the basic. A highlight in this case is programmed on the FPGA controller implementing swapping algorithms. In article results of check are given in 12 standard tests. They demonstrate that on average performance increases much, and performance in terms of invested funds — by 87%.

Fig. 2 Swapping memory memory

Development of the traditional solution NUMA

Experience of the past is the cornerstone also of the project of University of Edinburgh of Scale-Out NUMA (soNUMA)[6]. Creation of the general pool of memory is not something absolutely new, the first attempts are found out in the classical symmetric systems (Symmetrical Multiprocessing, SMP) where the scheme UMA (Uniform Memory Access), in which at processors the equal rights to access to total memory is implemented. But such systems assume physically close placement of memory and processors and cannot be scaled therefore they cannot be considered as a prototype.

Need for cheaper horizontal scaling led to the scheme with the distributed total memory (Disributed Shared Memory, DSM) — each processor has the local memory, and some part it is available to all other nodes to remote access, and the set of all public regions of memory forms the distributed total memory to which access time depends on that, will how far defend the corresponding node.

The architecture with non-uniform memory access of NUMA (Non-Uniform Memory Access) where time of a memory access is defined by provision in relation to the processor became a compromise between UMA and DSM. In pure form NUMA did not gain noticeable distribution. Modification of NUMA with a coherent cache, known as ccNUMA was more practical, but also it was rather difficult, was used only in separate high-performance computers.

In soNUMA the upgraded NUMA where unusual for model, but the horizontal scaling demanded by modern applications is supported by controllers of remote access of RMA (Remote Memory Controller) is implemented.

Fig. 3 Two-level memory model of soNUMA [7]


DReDBox project

All three the solutions described above are palliatives though they also allow to interpret some amount of memory as the general pool, but only logically, but not physically.[8] differs in the fact that here all "frankly" — is a pool of processors, there is a pool of memory and there are connections between them. Will be to tell more precisely as the project stays in a development stage, it is generously financed within the European program of perspective development[9].

More than 10 large European universities participate in dReDBox. The project is obviously non-commercial therefore it is perfectly documented, on the website www.dredbox.eu it is possible to find several tens of full articles[10]. The general idea about the project can be gained from survey material "Rack-scale Disaggregated cloud data centers: The dReDBox project vision".

In dReDBox two levels of disaggregation — at the level of a rack and at the level of DPC are conceived. Qualitative novelty of dReDBox — in failure from traditional motherboards, i.e. from a static paradigm of design of mainboard-as-a-unit, for benefit of the flexible, program focused paradigm where block-as-a-unit which figuratively call brick (brick) forms a basis. Bricks gather in a rack with use of trays (tray), of separate bricks there is a DPC building.

Rice 4. dReDBox bricks

Two types of bricks integrate the reconfigurable Optical Circuit Switch (OCS) switch. The computer brick, in addition to the processor, supports FPGA providing addressing to memory. The brick of memory receives from the FPGA address and provides data exchange in batch mode. This approach differs from accepted now when data are transferred on the fixed way and with the set capacity. Packetization allows to select an optimal way as it becomes on any packet switched networks, and capacity is flexible, it is defined by opportunities and their quantity.

On the grandness intention of dReDBox reminds the decision on construction of competitive airliners made in the seventies. If effective objectives are achieved, the united Europe will become the real competitor to present market leaders.

Read Also

Intel against all. War for network interconnect between the processor and memory began.

Notes