2017/12/13 16:32:54

Historical stage: How the principles of building the largest supercomputers have changed

A new generation of supercomputers focused on working with big data and artificial intelligence is created according to principles that are significantly different from those that were laid down in high-performance cluster systems until 2017.

Content

The advent of AI supercomputers
Limitations of classic supercomputers
New Approach - Responding to New Challenges
Chronicle
- 2023: A supercomputer for working with neural networks is presented. Its power is 1 Eflops
See also

The advent of AI supercomputers

Chinese hegemony in the Top500 of supercomputers of the world will continue until June 2018, until the publication of the 51st edition of this list at the European SC18 Conference, when IBM's Summit computer will take the stage with a record planned performance of 300 petaflops. This statement is completely reliable, because the installation of the first Summit racks has already begun in 2017 at the Oak Ridge national laboratory and the remaining time is enough to bring this supercomputer to full readiness.

In parallel with Summit in another Lawrence Livermore laboratory, IBM will commission the Sierra computer. Its planned performance is less - 100 petaflops. And in 2021, the first Aurora exaflops computer created by Intel will appear.

Although IBM and Intel products have little in common with each other, they are nevertheless united by a common new HPC paradigm - it turned out that future supercomputers with an exe prefix can only be created by the largest vendors that own the entire range of necessary technologies, that is, which produce processors and have huge system technical potential. Henceforth, products assembled from commercially available components (commodity) are ordered to the top of the Top500 route.

Summit and Sierra are built on a 2014 decision by the US Department of Energy to create two design and architecture-related supercomputers for nuclear research centers, which are called national laboratories. Under the program, the launch of both computers was planned for 2017, but then it was postponed to 2018, since by a given time three project participants - IBM, Nvidia and Mellanox - did not have time to put all the necessary components together - CPU, GPU and network equipment.

The proposed scope of Summit and Sierra is significantly wider than most existing supercomputers. Summit and Sierra are capable of showing not only record performance on traditional tasks: modeling and simulation, but also they are focused on such tasks that arose only in the latest period, when computers cease to be just devices for calculations, and turn into universal tools for working with data.

This transformation affected the entire computer, HPC is no exception. The laboratories intend to use Summit and Sierra for tasks related to weak artificial intelligence, mainly machine learning, so they are called AI supercomputers. Expanding the application of supercomputers is also rational from an economic point of view, it contributes to an increase in load, which increases the efficiency of investments in these machines, costing hundreds of millions of dollars.

Summit and Sierra are not only an order of magnitude more powerful than their predecessors, such as Titan, Mira and Sequoia. Their appearance is also significant in that they will not be built on the trivial cluster scheme that dominates HPC today, but on a new and more advanced architecture. It is distinguished by two main features:

Implementation of two alternative approaches to scaling - horizontal and vertical
Focus on large data volumes

To gain these qualities, it was necessary to overcome two inherent inherent limitations of clusters, which have been preserved since the first computers of this type, known as Beowulf, recognized as the founders of cluster architecture.

Limitations of classic supercomputers

Both limitations are due to the very attractive ability to assemble powerful and cheap configurations from simple modules. The restrictions had to be put up with, because with formally equal performance, clusters cost an order of magnitude less than real computers - what is called "big hardware." At first, assembling clusters was generally a student activity, and they looked something like the figure below, usually in the form of many PCs standing on racks. They combined into a network and the load was somehow distributed between them. Later, with the advent of thin servers and high-speed communications, clusters became mainstream in HPC.

Let's go back to the restrictions. The first is that all current Beowulf successor clusters are homogeneous in nature and scale only by increasing the number of nodes. The second limitation, following from the first - each of the cluster nodes can work only with its own memory. Such a memory organization system is called distributed. Parallelizing the count over many simple nodes of a distributed memory system (horizontal scaling, scale out, SO) is significantly cheaper than increasing the power of systems with common memory (vertical scaling, scale up, SU).

Nevertheless, clusters still dominate the Top500 and this is explained extremely simply - such systems are cheap, of course, relatively, and they can be assembled by companies that do not have a serious production base and a high level of technical expertise.

Companies with several dozen employees should not collect any of the components on the market, and give this a loud name, preferably by the name of an academician, and declare themselves supercomputing firms.

How it was - Beowulf cluster assembled from a PC

And yet the cluster is a forced engineering compromise, and like any compromise, it has a downside. Its essence long before the advent of clusters was formulated in the law named after him, computer designer Gene Amdal. In 1967, he postulated the presence of an insurmountable limit on productivity growth when parallelizing calculations:

In the case where a task is divided into several parts, the total time of its execution on a parallel system cannot be less than the time of execution of the longest fragment

Hence the consequence - simple clusters show good numbers on specially prepared tests or on such problems, where the model is divided into equal parts (a typical example is the finite element method in construction mechanics). But, if one of the nodes has a load more than all the others, then no matter how many they are, they will have to stand idle.

New Approach - Responding to New Challenges

Times change and loads change for supercomputers. Clusters suitable for parallel computing turned out to be unsuitable for working with large amounts of data due to the need to race data between nodes. Modern tasks require a different architecture, which can be called data-centric, that is, focused not only on the speed of counting, but also on the effectiveness of working with large amounts of data.

The idea of data value is quite simple, ultimately it boils down to the fact that data should be placed near the processors on which it is processed in order to reduce the amount of data moved. The processors themselves must have access to large amounts of memory and are equipped with high-speed data transmission channels. The advent of computers that perform not only calculations, but a wider range of different operations on data reflects the general trend of modern computing.

The appearance of Summit and Sierra is worth considering as a landmark event, as it disrupts the evolutionary process that has persisted at HPC since Beowulf. But, of course, it would be a mistake to directly oppose two computer models - the new data-centric and traditional computer-centric, both have the right to exist and the boundaries between them are blurred.

The cluster idea persists, but the more powerful cluster nodes combined by expressways give the idea a new sound. The appearance of exaphlop-scale data-centric systems can be considered as the highest stage of the evolutionary process taking place in computing.

Evolutionary process in computing

The researchers from IBM managed to formulate the main prerequisites for the creation of data-centric systems:

The amount of data increases and the movement of data within systems is constantly increasing in price
Therefore, it is necessary to develop hardware and software in such a way as to process data as close as possible to its storage location.
System design should be subordinated to the requirements of real applications
To improve the efficiency of data analytics, modeling, and simulation, ensure system-level compatibility
When evaluating systems, take into account not only algorithmic efficiency, but also the quality of work flows (workflow)

The figure below compares existing computing systems with what they should ideally be in the future.

Comparison of traditional and data-centric approaches

The task of creating systems with calculations distributed across different tiers of data storage is enormous, and it will be solved indefinitely for a long time, this is a matter of not even a close future. And as long as the storage and processing hierarchy remains unchanged for creating data-centric systems, obvious approaches should be taken: NUMA node architectures with the ability to scale vertically and high-performance communication channels between CPU and GPU with accelerators. The figure below shows the sequential movement of the Mellanox, Nvidia and IBM companies in this direction.

Although as of 2017, the Summit and Sierra device is no secret, it makes sense to return to it when these computers are announced, and their detailed technical characteristics will be formally announced. In the meantime, we can say that Summit and Sierra are the sum of technologies from three companies - IBM processors, Nvidia GPUs and Mellanox communications equipment, of which the key technology is Power9.

Coevolution of Mellanox, Nvidia and IBM companies products in this direction

Power9 processors are manufactured in two versions, which allows you to implement both scaling methods - SO and SU. In the Power9 SO version, the processor connects directly to DDR4 memory, roughly the same as the Intel Xeon E5, while the older version of the Power9 SU connects to memory via a buffer, which speeds up exchange and increases the available area of memory common to several processors, something similar to the Intel Xeon E7. Air-cooled servers, two Power9 CPUs and four Nvidia Tesla VT100 (Volta) GPUs can be used as nodes. In early 2018, an IBM Power System S922LC server with water cooling and six VT100 will appear.

POWERAccel Data Exchange Acceleration Technologies

The figure above shows the interconnection of technologies that speed up data exchange. For interface and acceleration protocols of the microarchitecture, the common name POWERAccel Power9 proposed. The key point of POWERAccel is the Coherent Accelerator Processor Interface (CAPI) offered with the Power8. The high-quality novelty of this interface is that it opens up the possibility of creating heterogeneous systems at the board level and complements the CPU with accelerators, primarily GPUs from Nvidia, as well as FPGA.

In the original version, the CAPI interface was implemented in the Power8 processor, where it ran on top of PCIe Gen 3. Since then, two new versions have appeared - New CAPI and CAPI 2.0. The first runs on top of PCIe Gen 4, and the second in combination with the new 25G Link standard. The same standard supports interaction with GPU-based accelerators via NVLink. POWERAccel includes tools that support cryptography and data compression.

The creation of Summit and Sierra in a sense is a fact of historical scale, it returns to supercomputers the role they now lost as a technological locomotive that they played in the sixties and eighties of the XX century. Now, guided by the same principles by which these very powerful computers are built, it is possible to extend data-centric solutions to smaller systems created as part of the OpenPower initiative .

Chronicle

2023: A supercomputer for working with neural networks is presented. Its power is 1 Eflops

At the end of May 2023, the company Nvidia introduced the DGX GH200 supercomputer, which, according to the developers, is unique in its kind, since it is tailored for the work of generative models (artificial intelligence these are the basis of the popular neural network). ChatGPT More. here