Poster: GEMMERATOR – GEMM Kernel Generator | JSC Accelerating Devices Lab

Poster in institute repository: http://dx.doi.org/10.34734/FZJ-2023-03437

During the RISC-V Summit Europe 2023 in Barcelona we presented our work generating highly optimized RISC-V and ARM GEMM microkernels for BLIS using a custom software tool.¹ We presented results on the Fujitsu A64FX processor, the in-development RISC-V VEC processor from the EUPILOT project using an FPGA SDV (RVV 0.7.1, later RVV 1.0.0)², and a commercially available, non-HPC Allwinner D1 processor (RVV 0.7.1).³

The presented poster is embedded at the end of this post; it is accompanied by an extended abstract.⁴

One of the reasons for selecting a code generation approach is that Analytical Modeling is enough for High-Performance BLIS. We assume that parameters that shape a compute kernel can be derived from micro-architectural parameters of the system it will be running on resulting in (nearly) optimal performance.

The generator accepts the aforementioned kernel parameters and outputs the O(n³) inner part of the BLIS GEMM microkernel - we chose the term nanokernel. With some boilerplate code on top, a full BLIS microkernel can be created. Our tool can generate assembly code for multiple ISAs, like different x86 flavors, ARM NEON or SVE, and RISC-V RVV (0.7.1/1.0). It also accepts parameters for tuning the implementation, for example using different instructions as shown below for a RISC-V DGEMM example.

Part of RISC-V Assembly from a generated DGEMM kernel

RISC-V ASM (vv-FMA form) RISC-V ASM (vf-FMA form)

Part of RISC-V Assembly from a generated DGEMM kernel
RISC-V ASM (vv-FMA form)	RISC-V ASM (vf-FMA form)
`vle64.v v2, (t0) add t0,t0,t5 vfmacc.vv v24,v0,v4 vle64.v v3, (t0) add t0,t0,t5 vfmacc.vv v25,v1,v4 vlse64.v v4, (t1), zero add t1,t1,8 vfmacc.vv v26,v0,v5 vfmacc.vv v27,v1,v5 vlse64.v v5, (t1), zero add t1,t1,8 vfmacc.vv v28,v0,v6 vfmacc.vv v29,v1,v6 vlse64.v v6, (t1), zero add t1,t1,8 vfmacc.vv v30,v0,v7 vfmacc.vv v31,v1,v7 vlse64.v v7, (t1), zero add t1,t1,8`	`vle64.v v2,(t0) add t0,t0,t3 vfmacc.vf v24,f0,v0 vle64.v v3,(t0) add t0,t0,t3 vfmacc.vf v25,f0,v1 fld f0,0(t1) vfmacc.vf v26,f1,v0 vfmacc.vf v27,f1,v1 fld f1,8(t1) vfmacc.vf v28,f2,v0 vfmacc.vf v29,f2,v1 fld f2,16(t1) vfmacc.vf v30,f3,v0 vfmacc.vf v31,f3,v1 fld f3,24(t1) add t1,t1,32`

vle64.v   v2, (t0)
add       t0,t0,t5
vfmacc.vv v24,v0,v4
vle64.v   v3, (t0)
add       t0,t0,t5
vfmacc.vv v25,v1,v4
vlse64.v  v4, (t1), zero
add       t1,t1,8
vfmacc.vv v26,v0,v5
vfmacc.vv v27,v1,v5
vlse64.v  v5, (t1), zero
add       t1,t1,8
vfmacc.vv v28,v0,v6
vfmacc.vv v29,v1,v6
vlse64.v  v6, (t1), zero
add       t1,t1,8
vfmacc.vv v30,v0,v7
vfmacc.vv v31,v1,v7
vlse64.v  v7, (t1), zero
add       t1,t1,8

vle64.v   v2,(t0)
add       t0,t0,t3
vfmacc.vf v24,f0,v0
vle64.v   v3,(t0)
add       t0,t0,t3
vfmacc.vf v25,f0,v1
fld       f0,0(t1)
vfmacc.vf v26,f1,v0
vfmacc.vf v27,f1,v1
fld       f1,8(t1)
vfmacc.vf v28,f2,v0
vfmacc.vf v29,f2,v1
fld       f2,16(t1)
vfmacc.vf v30,f3,v0
vfmacc.vf v31,f3,v1
fld       f3,24(t1)
add       t1,t1,32

In the poster, we focused on RISC-V RVV and ARM SVE. Both are newer ISAs that share the feature of being vector-length-agnostic (VLA), i.e. the vector size/SIMD size is not known at compile time. And both ISAs promise exciting applications in future HPC-grade chips.

To assess the quality of the generated code, the tool can also generate standalone benchmarks that can measure the per-clock-cycle performance. Of course, the benchmarks are configurable as well, allowing for running the code entirely in the L1 cache, redirecting the memory access to the same memory location, or eliminating memory accesses completely to investigate the compute performance independently of the memory architecture.

While multiple kernel parameters are supported, for this work we focus on the microkernel size - the size of the microtile of the C-matrix that the microkernel is working on. Since we focus on VLA ISAs, the size is given in numbers of vector registers times elements. For example, 2Vx10 on A64FX (SIMD size of 512 bit), would mean that the SGEMM microkernel works on a C microtile 32 elements tall and 10 elements wide, as 32 elements with 32 bit per element (32 × 32 = 1024) account to 2 times the SIMD (2 × 512 = 1024).

You can see the results of our benchmarks in the following plots:

The graphs show the performance of various microkernel sizes in #FLOP/cycle. Different line colours stand for different M dimensions of the microkernel, while the N dimension is on the x axis. The code generation is excellent for A64FX, reaching over 99% efficiency. For the VEC processor we get above 90% efficiency and for the D1 above 80%.

Since the Summit, we managed to improve the performance on the VEC processor by another FLOP/cycle and in collaboration with the chip developers also helped uncover some hints towards a potential performance bug.

The work on the tool is ongoing and we are interested in expanding its scope to other compute kernels and libraries. One of our next software targets is FFT, as it is also quite important for the work in EUPILOT – just like BLAS.

We also plan to release a BLIS version with our generated microkernels. Stay tuned!

The European PILOT project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No.101034126. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Spain, Italy, Switzerland, Germany, France, Greece, Sweden, Croatia and Turkey.

The tool is actual at the core of my PhD! ↩
BSC published a related paper on SDVs for co-design ↩
One of the few commercially available RISC-V vector ISA implementations, albeit feature a pre-final version. ↩
Poster and abstract can also be found on the Summit website ↩