1. MSA Concept
  2. MSA Software Building Blocks
  3. Workshop
    1. Exercises
      1. 1: Hello World!
      2. 2. GPU Hello World!
      3. 3: CPU-GPU Ping Pong
    2. Slides

On May 29, we held a workshop about using the Modular Supercomputing Architecture (MSA) together with project partners from ParTec. The audience were collaborators from two projects funded in the same SCALEXA funding call by BMBF: MExMeMo and IF(CES)2.

MSA Concept

Slide from MSA Workshop slidedeck showing JSC's Supecomputer strategy (page 6)While CPU-based system are versatile and can be used for various workloads, accelerated systems (think GPUs, FPGAs, AI chips) can be quite efficient, but only suited for a sub-set of workloads. The MSA concept foresees a combination of HPC systems with distinct hardware (modules) into one combined super-system, usable for execution by a program. That enables heterogeneous workloads, in which parts map to the best-suited modules. The MSA concept was developed to production-grade in the EU-funded DEEP projects, of which the DEEP system is still being used to develop further features. By now, also many production systems are built-up in an MSA fashion, like JURECA and JUWELS at JSC, but also for example Leonardo at CINECA.

Slide from MSA Workshop slidedeck showing, amongst other things, the JUWELS network topology.In MExMeMo, the project we are involved in, we want to use MSA for a heterogeneous multi-scale physics simulation: two application cooperatively simulate a soft matter process – one takes over the large scales using CPUs, and the other simulates certain parts very detailed and verifies the large-scale model on GPUs.

MSA Software Building Blocks

Slide from MSA Workshop slidedeck with an example srun call.From a software perspective, MSA utilizes many pre-existing components. We usually use the heterogeneous job features of Slurm to launch HPC jobs crossing module boundaries; we use MPI to communicate with the components of the job. In addition, it can be useful to employ tools like xenv, env, or even renv to manipulate the module-specific environments in an ad-hoc fashion. Through the setup, application parts or even entirely disparate application, can be executed on the respective modules and exchange messages with each other, solving a scientific problem collaboratively. Advanced MSA features are implemented in ParaStationMPI, like MPI collectives which take module affinity under consideration.

Workshop

Just like this blogpost, the workshop started with a theoretical introduction to the topic. In slides, ParTec shared details about the MSA concept and usage with Slurm 1. Following that, we took over to present our MSA setup at JSC – globally and the specific software environment. Find the slides embedded at the end!

The second part of the workshop were actual hands-on exercises in fill in the blank style. Augmented with TODO comments and descriptions in readmes, the participants were steered to solving MSA software challenges of increasing complexity.

Exercises

We have released the exercises material on Zenodo, but you can also find it in our JSC Gitlab repository. In the following, a quick run through the three exercises.

1: Hello World!

The first exercises is a simple MPI hello world program which prints the hostname of the node it is executed on. See the following (incomplete) listing for a sketch:

MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from processor %s, rank %d out of %d processors\n",
        processor_name, world_rank, world_size);

The source code is not different from a standard MPI hello world. However, this task should teach how to launch an MPI application to two modules. That is why the focus is on the job script. Using Slurm’s heterogeneous job support enables specifying requirements for each component of a job using the : (colon) syntax. In addition, we are using xenv to load required modules, which might be different for the individual components.

An example batch jobs definition reads:

#!/bin/bash -x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=cpu
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=gpu

srun \
  xenv -P -L GCC -L ParaStationMPI ./mpi_hello_world.cpu.out : \
  xenv -P -L GCC -L ParaStationMPI ./mpi_hello_world.gpu.out

Executing the script with the example application on the Cluster and Booster modules of JUWELS, the two different machines can be easily identified:

$ cat slurm-out.7577116 
Hello world from processor jwc00n024.juwels, rank 0 out of 2 processors
Hello world from processor jwb0001.juwels, rank 1 out of 2 processors

Of course we are not limited to only two job components in MSA, but we could add as many as needed by the modular application!

Now that we have learned how we can launch an MSA job, we can continue and sprinkle a little GPU on top.

2. GPU Hello World!

In the second exercise, a message is sent from a CPU node directly to a GPU buffer on a GPU-equipped node:

cudaMalloc((void**)&d_payload, 6);
cudaMemcpy(d_payload, payload, 6, cudaMemcpyHostToDevice);

hello<<<1, 1>>>(d_payload);  // contains a `printf()`

if (rank == 1)
    MPI_Recv(d_payload, 6, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &stat);

hello<<<1, 1>>>(d_payload);

The job script from task 1 needs to be slightly extended by using GPU-specific modules for the second job component:

srun \
  xenv -P -L GCC -L ParaStationMPI ./hello.cpu.out : \
  xenv -P -L GCC -L ParaStationMPI -L MPI-settings/CUDA ./hello.gpu.out

The out is then showing the combined message from both parts:

$ cat slurm-out.7916615 
hello world!

3: CPU-GPU Ping Pong

Finally, the third exercise not only sends a single “world” to a GPU buffer, but a continuous exchange between CPU buffer and GPU buffer is implemented across the module boundaries (i.e. JUWELS Cluster ↔ JUWELS Booster). The message size is increased step-wise and the bandwidth printed:

Transfer size (B):          8, Time (s): 0.000002580, Bandwidth (GB/s):  0.002888259
Transfer size (B):         16, Time (s): 0.000002594, Bandwidth (GB/s):  0.005745580
...
Transfer size (B): 1073741824, Time (s): 0.086975408, Bandwidth (GB/s): 11.497502804

Slides

The slides were presented at the beginning of the workshop and were accompanied by slides from ParTec focusing on the perspective from a MSA solutions provider; the ParTec team introduced the benefits of the MSA concepts and details about the MSA features of ParaStationMPI.

Download our slides or browse through them here:

  1. ParTec allowed us to upload the slides as well, please find them here