From November 17th to 22nd, 2024, HPC professionals and researchers gathered in Atlanta, Georgia, for the Supercomputing Conference 2024. We presented a paper at the 2024 International Workshop on Performance, Portability, and Productivity in HPC where we introduced CARAML, a reproducible AI benchmarking framework, and jpwr, a custom energy assessment module. The presentation slides are embedded at the bottom.

As AI models become more pervasive and computationally intensive, the need to benchmark their performance across different hardware platforms has become essential. Such benchmarks enable researchers and practitioners to understand trade-offs between performance, energy efficiency, and portability as well as guide hardware and software optimization efforts while ensuring reproducibility in AI workloads. However, benchmarking in AI is often hindered by fragmentation across tools, inconsistent methodologies, software and hardware dependencies, and challenges in integrating energy measurements.

Addressing this we introduce Compact Automated Reproducible Assessment of Machine Learning workloads (CARAML) on accelerators. CARAML provides training benchmarks from two mainstream fields: Natural Language Processing (NLP) and Computer Vision (CV) using state-of-the-art models on various accelerators with integrated energy assessment, using the JWPR module. Both parts are published online on GitHub. The benchmarking setup is curated using JUBE and deployed using Apptainer containers making it compact and reproducible. More details can be found in the paper (preprint on arXiv, PDF at IEEE).

Overview of studied devices.To demonstrate the capabilities of CARAML and JPWR, we conducted a benchmarking study across seven different accelerators. These devices varied by generation (new vs. older hardware), vendors (e.g., NVIDIA, AMD), architecture (GPU vs. IPU) and system integration (interconnects, memory architecture, and GPU form factor). See the table on the right for the tested hardware (Table 1 in the paper).

The results highlight significant variations in performance, energy efficiency, and scalability across different architectures and configurations. Newer GPU generations, such as NVIDIA GH200 nodes, demonstrate exceptional performance, with the NVIDIA H100 (SXM) surpassing its PCIe counterpart due to its enhanced NVLink bandwidth and optimized design. For example, in training an 800M GPT model using Megatron-LM, the NVIDIA GH200 nodes lead with 2.4× the performance of the NVIDIA A100. While the NVIDIA H100 (SXM) outperforms the H100 (PCIe) in raw performance, the PCIe variant excels in energy efficiency.

Addressing the challenges of benchmarking across diverse AI accelerators remains complex due to architectural differences and inconsistent vendor optimizations. In CARAML, the benchmarks strive for consistency without loosing characteristic hardware specific optimizations. Additionally, CARAML benchmarks can be used for hyper-parameter optimizations for efficient training. Looking ahead, our focus is on expanding support for emerging accelerators and incorporating a broader range of AI workloads, including inference and communication benchmarks, to enhance the framework’s versatility and applicability.

The work was conducted in the OpenGPT-X project (funding: BMWK) and MAELSTROM project (funding: EuroHPC/BMBF).