ISC23 Project Poster: OpenGPT-X – Training Large Language Models on HPC Systems

Poster publication: http://hdl.handle.net/2128/34532

The ISC High Performance Conference 2023 was held at Hamburg, Germany from 21st May to 25th May. At the conference, we presented a project poster on the OpenGPT-X project, outlining the progress and initial exploration results. The poster was even featured in HPCWire’s May 24 recap of ISC within the AI segment!

The poster gives a high-level outlook of the structure and goals of the project, along with some background knowledge on training large language models. It is embedded at the end of the post.

In order to train and scale large language models of transformer architecture, different parallelization schemes such as Data Parallelism (DP), Pipeline Parallelism (PP), and Tensor Parallelism can be used.
Data parallelism is when the training data is distributed in batches across the replicated full models, with timely averaged gradient updates of the model via allreduce. Pipeline Parallelism is when the model layers are distributed across ranks, with asynchronous pipe scheduling for gradient accumulation. Further, when tensor operations are partitioned across ranks, tensor parallelism comes into play.

Finding the correct level of parallelism by combining the three parallelization schemes is important for efficient training and utilization of the hardware.

In the poster, an example for a 13.6 Billion parameter model trained on English-German data is shown. Using a fork of the Megatron-DeepSpeed library, the size of the model amounts to 56 GB ( parameters + gradients + optimizers states + ZeRO Stage 1). To fit into the memory of an NVIDIA A100 (40 GB), it is partitioned using pipeline parallelism of 2. Now the full model is split across 2 GPUs, occupying 28 GB each.

With a data parallelism level of 80 and global batch size of 960 on 160 GPUs (40 nodes) of JUWELS Booster, we see 96% average GPU utilization with close to full memory usage. Furthermore, the decreasing model loss attests to the quality of the model training.

In the poster, we show results of benchmarking novel hardware architectures by using a ResNet-50 TensorFlow benchmark. Three different accelerators are evaluated: NVIDIA A100 GPU, AMD MI250 GPU, and Graphcore GC200 IPU. All are part of JURECA DC or the JURECA DC Evaluation Platform.

The column graph shows the performance in images per sec for global batch sizes varying from 16 to 2048 on a single node of the accelerators. Small batch sizes aids to the hardware architecture and on-tile memory feature of Graphcore IPUs. For larger batch sizes, the A100’s potential can be seen. Further optimizations and evaluation are required to address the unexpectedly slow performance of the MI250. The three heatmaps shown on the poster – which are representations of the same data as in the column plot – are shown to the right.

Recent advancements such as Sequence Paralleism and FlashAttention could further improve the training performance. We are evaluating them currently. Finally, the poster also outlines some challenges faced in the regions of data, hardware robustness, energy and model biases.

OpenGPT-X is funded by the Federal Ministry for Economic Affairs and Climate Action (BMWK) of Germany for the period 2022-2024. Compute time on the GCS Supercomputer JUWELS Booster at JSC is provided through the Gauss Centre for Supercomputing e.V