The Supercomputing Conference 2023 took place in Denver, Colorado, from November 12th to 17th. For the Women in HPC workshop, we submitted a paper, which focused on benchmarking different accelerators for AI. The paper was accepted and I was invited to hold a lightning talk to show the work, spun off our OpenGPT-X project.

Given the multitude of AI accelerators available in the market, our objective was to start establishing a framework for benchmarking these accelerators systematically. The currently selected benchmarks enable a evaluation of performance and energy efficiency in two key AI domains: Computer Vision and Natural Language Processing (NLP). As an initial step in assessing hardware capacities, we evaluated two benchmarks using resources available in the JURECA DC supercomputer, specifically the JURECA DC Evaluation Platform. We evaluated

NVIDIA A100 GPUs (40 GB memory, SXM),
NVIDIA H100 GPUs (80 GB, PCIe),
AMD MI250 GPUs (64 GB), and
Graphcore GC200 IPUs (IPU-M2000 POD-4, ≈260 GB).

ResNet Benchmark

The first benchmark employs TensorFlow for a ResNet-50 benchmark, providing insights into the overall machine learning capacity of the hardware. The second benchmark, derived from the OpenGPT-X fork of Megatron-LM, offers insights into large language model (LLM) training. A fork by our colleagues from Helmholtz AI of the ResNet-50 benchmark was executed on the NVIDIA and AMD GPUs. For Graphcore IPUs, a device-optimized version provided by the vendor was used, as a naïve TensorFlow setup is not compatible with the IPU architecture.

The results for the ResNet-50 benchmark are displayed as heat maps in the following, with the number of devices on the y axis and the global batch size on the x axis. The color map depicts the throughput measured in images/sec. The throughput scales in relation to the global batch size and the number of devices used. As the ResNet-50 model fits into a single device for all the tested hardware, it implies that the degree of data parallelism matches the number of devices employed.

By considering the NVIDIA A100 GPU as a reference point, several insights can be inferred. Specifically, upon comparing the last row of the heatmaps for the NVIDIA A100 and H100, a nearly ≈ 1.4 − 2× increase in throughput is observed. This increase aligns with expectations from the latest generation of hardware. As seen from the last five columns of all the heatmaps, NVIDIA GPUs appear to deliver the highest throughput for larger batch sizes >128.

Conversely, upon examining Graphcore, the initial four columns, corresponding to batch sizes below 256, demonstrate superior performance relative to other accelerators. This observation suggests that the unique memory architecture of Graphcore IPUs, with SRAM distributed into an organized set of small independent memory units, contributes to increased in-processor memory. This enhancement is particularly beneficial with the MIMD (Multiple Instructions, Multiple Data) architecture, especially when batches fit into the in-process memory.

Regarding the AMD MI250, we’ve faced challenges in reaching the GPUs’ peak performance and intend to investigate it further.

NLP Benchmark

Shifting our focus to the second benchmark, we trained an 800 million parameter GPT model using the OpenGPT-X fork of Megatron-LM on a single node (4 GPUs) of NVIDIA A100 and H100 GPUs. The following bar graph illustrates the compute performance per GPU on the y axis across different batch sizes for data parallelism of 4. Similar to the ResNet results, we observed a 1.5× increase in performance.

We also provide insight into energy efficiency, an aspect of growing importance in modern computing. To study energy consumption, we train the model from above for 1 h to achieve representative results. The total energy consumed per device in a node is calculated using power values logged with nvidia-smi. H100 GPUs consume less energy than A100 GPUs, with an 18% decrease for a global batch size of 16. We’ve added an energy efficiency to the figure, dividing performance (in FLOP/s) by consumed energy (in Wh).

Conclusion

In conclusion, from our results we observe that the IPU architecture excels when dealing with small batch sizes that can efficiently fit into in-processor memory, showcasing remarkable performance gains in these scenarios. Conversely, the GPU architecture demonstrates its superiority when it comes to accommodating large batch sizes and scaling, where its parallel processing capabilities shine. The H100 is expectedly providing better performance compared to A100, especially if energy is considered as well.

OpenGPT-X is funded by the Federal Ministry for Economic Affairs and Climate Action (BMWK) of Germany for the period 2022-2024.

Projects:

SC23 WHPC Workshop Paper: OpenGPT-X – Novel Architecture Exploration

ResNet Benchmark

NLP Benchmark

Conclusion