Developers: | Nvidia |
Date of the premiere of the system: | 2017/05 |
Last Release Date: | 2021/06/28 |
Technology: | Cloud Computing, Server Platforms, Data Centers - Data Center Technologies |
Content |
Main articles:
2021: Nvidia A100 80G PCIe, Nvidia NDR 400G InfiniBand, Nvidia Magnum IO
On June 28, 2021, NVIDIA announced the growth of the NVIDIA HGX AI platform thanks to modern technologies that combine AI with high-performance computing to make computing more affordable for even more industries.
According to the company, to bring the advent of industrial AI and HPC applications closer, NVIDIA has added three key technologies to the HGX platform: GPU NVIDIA A100 80GB PCIe, network technologies NVIDIA NDR 400G InfiniBand ON and NVIDIA Magnitude IDI. Together, they provide extreme performance for industrial innovation.
As of June 2021, Atos, Dell Technologies, Hewlett Packard Enterprise (HPE), Lenovo, Microsoft Azure and NetApp and dozens of other partners are using the NVIDIA HGX platform to create systems and solutions.
The HGX platform the company uses, Electric specializing in the field of high technologies in promyshlennostigeneral: the company applies achievements in HPC to simulations in the field of computing dynamics of liquids (CFD) and development of large gas turbines and jet engines. The HGX platform has received an order of magnitude acceleration and can apply CFD methods in the GE GENESIS code. He uses the large vortex method to study the effects of turbulent flows inside turbines, which consist of hundreds of separate blades with complex geometry.
The HGX platform also optimizes scientific HPC systems worldwide, including a next-generation supercomputer at the University of Edinburgh, which was also announced on June 28, 2021.
The NVIDIA A100 Tensor Core graphics processors provide HPC computing for complex AI, data analysis, model training, and industry simulations. The A100 80GB PCIe graphics processors have 25% wider bandwidth than the A100 40GB - up to 2TB/s - and are equipped with 80GB of high-speed HBM2e memory.
The A100 80GB PCIe memory and wide bandwidth allow more data and larger networks to be stored in memory, minimizing communication between nodes and reducing power consumption.
The A100 80GB PCIe is based on the NVIDIA Ampere architecture, which supports Multi-Instance GPU (MIG) technology to accelerate small workloads such as interference. MIG allows HPC systems to optimize computation and memory. In addition to PCIe, there are four and eight-module NVIDIA HGX A100 configurations.
NVIDIA partners in the A100 80GB PCIe systems are Atos, Cisco, Dell Technologies, Fujitsu, H3C, HPE, Inspur, Lenovo, Penguin Computing, QCT and Supermicro. The HGX platform based on the A100 GPU with NVLink switching is also available through cloud services from Amazon Web Services, Microsoft Azure and Oracle Cloud Infrastructure.
HPC systems that require a certain data rate are enhanced by NVIDIA InfiniBand, a fully offloaded interconnect that supports network computing. NDR InfiniBand scales performance to solve complex problems on industrial and scientific HPC systems. NVIDIA Quantum-2 fixed-configuration switching systems have 64 ports with a NDR rate of 400Gb/s InfiniBand per port (or 128 ports over NDR200).
The NVIDIA Quantum-2 modular switches can have up to 2048 NDR 400GGB/s InfiniBand ports (or 4096 NDR200 ports) with a total bandwidth in both directions of 1.64 petabits per second, which is 5 times higher than the previous generation. A switch with 2048 ports has 6.5 times the scalability compared to the previous generation, and is able to connect over a million nodes in three steps using the DragonFly+ network topology.
The third generation of NVIDIA SHARP In-Network Computing data compression technology optimizes the performance of industrial and scientific applications with 32 times faster AI compared to the previous generation.
Management capabilities include network self-healing capabilities and NVIDIA In-Network Computing acceleration engines. Data center downtime is further reduced by the NVIDIA UFM Cyber-AI platform.
Based on industry standards, the NVIDIA Quantum-2 switches, which will be delivered by the end of 2021, have direct and backward compatibility, allowing easy migration and expansion of existing systems and software.
Infrastructure manufacturers, including Atos, DDN, Dell Technologies, Excelero, GIGABYTE, HPE, Lenovo, Penguin, QCT, Supermicro, VAST and WekaIO, plan to integrate Quantum-2 NDR 400Gb/s InfiniBand switches into their enterprise PC systems. Cloud service providers, including Azure, also use InfiniBand technology.
The Magnum IO GPUDirect Storage technology establishes a direct link between the GPU memory and the drive. Direct access reduces application latency and takes full advantage of network adapter bandwidth, reducing CPU load and controlling data consumption .
2018: Announcement of Nvidia HGX-2
On May 30, 2018, Nvidia introduced the Nvidia HGX-2, a unified computing platform for high-performance computing and artificial intelligence computing. HGX-2 is part of the Nvidia family of GPU-accelerated server platforms - an ecosystem of certified servers designed for a wide range of AI, HPC and accelerated computing with optimal performance.
Cloudy server The multi-precision HGX-2 platform provides the flexibility needed for future computing. It allows FP64 and FP32 high-precision calculations for research and modeling, and supports FP16 and Int8 for AI training and interference. This versatility meets the requirements of a growing number of applications that combine HPC computing and AI, the company explained.
The world of computing has changed, "said Jensen Huang, founder and CEO of Nvidia. - The effect of CPU scaling has decreased markedly, while demand for computing is growing at a dramatic pace. The Nvidia HGX-2 platform based on GPU with tensor cores offers powerful universal capabilities that allow you to perform both HPC and AI calculations to solve pressing global challenges. |
According to the developers, the speed of training AI networks on the HGX-2 platform reaches 15.5 you. images per second in the benchmark ResNet-50, which allows you to replace up to 300 servers based on CPU.
The platform supports advanced features such as the Nvidia NVSwitch interface, which allows combining 16 Nvidia Tesla V100 GPUs with tensor cores, turning them into a single giant graphics processor with 2 petaflops computing speed in AI tasks. The first system based on the HGX-2 platform was the recently announced Nvidia DGX-2 system.
As expected in Nvidia, HGX-2 will be a key component of manufacturers' advanced computing systems for HPC and AI tasks. So, four server manufacturers - Lenovo, QCT, Supermicro and Wiwynn - announced plans to release their own systems based on HGX-2 in 2018. In addition, four global ODM manufacturers - Foxconn, Inventec, Quanta and Wistron - are also preparing for the release of systems based on HGX-2, designed to be installed in the largest cloud data centers.
2017: Launch of Nvidia HGX
In May 2017, Nvidia launched a partner program with leading ODM manufacturers - Foxconn, Inventec, Quanta and Wistron to quickly meet the market demand for cloud computing for artificial intelligence (AI) tasks.
As part of the Nvidia HGX Partner Program, Nvidia provides each ODM manufacturer with early access to the Nvidia HGX reference architecture, GPU computing technologies, and design guides. The HGX model is the same as used in Microsoft Project Olympus, Facebook Big Basin systems and supercomputers for NVIDIA DGX-1 AI tasks.
HGX is a reference architecture for cloud vendors who want to migrate to the new Nvidia GPU Cloud platform. The Nvidia GPU simplifies access to fully integrated and optimized deep learning frameworks, including Caffe2, Cognitive Toolkit, MXNet, and TensorFlow.
Using HGX as a foundation, ODM partners, in collaboration with Nvidia, can quickly create and market a range of GPU-accelerated systems for hyperscaled data centers. As part of the program, NVIDIA engineers will help ODM manufacturers reduce the time to both design and deploy systems.
With the new Nvidia Volta architecture-based GPUs that deliver triple the performance of the previous architecture, ODMs can meet market demand by launching new products based on the latest Nvidia technologies.
Flexible retrofit system
Nvidia created the HGX reference architecture to provide the performance, efficiency, and scalability required for hyperscaled cloud environments. HGX supports a wide range of load-based configurations and allows you to combine graphics and central processors in different combinations for high-performance computing, deep network learning, and interference.
The standard HGX architecture includes eight Nvidia Tesla accelerators in the SXM2 form factor, combined by the cube mesh topology using the high-speed Nvidia NVLink interface and optimized PCIe topologies. Thanks to the modular design, HGX systems can be installed in existing data centers around the world, using hyperscaled CPU nodes if necessary.
Both Nvidia accelerators, the Tesla P100 and V100, are compatible with HGX. Therefore, systems based on HGX can be updated immediately as soon as the V100 processors enter the market.