I present, the GPU Vendor/Programming Model Compatibility Table!
Read below for some caveats and technical background! There is also a PDF and an SVG version available.
CUDA | HIP | SYCL | OpenACC | OpenMP | Standard | Kokkos | ALPAKA | etc | |||||||||
C | F | C | F | C | F | C | F | C | F | C | F | C | F | C | F | Python | |
NVIDIA | ^{1} | ^{2} | ^{3} | ^{4} | ^{5} | ^{6} | ^{7} | ^{8} | ^{9} | ^{10} | ^{11} | ^{12} | ^{13} | ^{14} | ^{15} | ^{16} | ^{17} |
AMD | ^{18} | ^{19} | ^{20} | ^{4} | ^{21} | ^{6} | ^{22} | ^{23} | ^{24} | ^{24} | ^{25} | ^{26} | ^{27} | ^{14} | ^{28} | ^{16} | ^{29} |
Intel | ^{30} | ^{31} | ^{32} | ^{33} | ^{34} | ^{6} | ^{35} | ^{35} | ^{36} | ^{36} | ^{37} | ^{38} | ^{39} | ^{14} | ^{40} | ^{16} | ^{41} |
Although the table and its descriptions does a decent job in summarizing the state-of-the-art (I think), there are some caveats going along with it.
As the origin of the table is in slides (which I, of course, create with LaTeX), but I also want to present it here (in HTML form), I looked for a way to generate one from the other. Nothing really worked perfectly – LaTeXML looks great, but is still a little complicated. So, I did what any reasonable programmer would do and spend way too much time to script my way out of things.
I recreated the table as a machine-readable YAML file which is transformed to TeX and HTML by using respective templates with Jinja. Jinja is really amazing and I’m a huge fan. All the data, all files, and all scripts are in a GitHub repository: https://github.com/AndiH/gpu-lang-compat. Feel free to remix, it’s MIT!
hipfort
, fixed a wrong symbol for ALPAKA, added OpenMP for ALPAKA (commit for both)In MAELSTROM, we connect three areas of science: 🌍Weather and climate simulation with 🤖Machine Learning methods and workflows using 📈HPC techniques and resources. Halfway into the project, we held a boot camp at JSC to teach this Venn diagram to a group of students a few days ago. Some were ML experts, but had never used a HPC system. Others came from climate science, but had never applied ML methods to the problem. Using the applications of MEALSTROM as examples, participants of the boot camp could hands-on learn about all these cool things - at once. In addition, to give participants some context, lectures were held to introduce weather and climate simulations, ML methods (especially focusing on large scales), and HPC. Guess what I presented? Right! HPC!
As I’ve never had the opportunity to introduce the general field of HPC (I’m usually doing just the GPU stuff), I needed to create a presentation from scratch. It was quite some work, but I’m really happy with the result. There is much more to teach about HPC, but one can only do so much in 60 minutes.
As a hook, I tried using a definition of HPC I came up with: High Performance Computing is computing with a powerful machine using the available resources efficiently. It might be a little contrived for this talk at hand, but I wanted to focus both on the powerful machines themselves and using them efficiently. The latter part is sometimes forgotten, but ever so important, especially in times of sky-rocketing energy prizes. The slides start by first comparing personal computers with HPC computers, getting interactive feedback from the audience on the way and assessing their experience with HPC. Then, I focus on a few historical important supercomputers, making the way to our JSC machines and finally to Frontier. The latter I use as an example to explain a little about GPUs. To focus on the software-side of things (using resources efficiently), I came up with a weird, inverted pyramid of resource utilization: 1) exploit all capabilities of a processing entity, 2) parallelize, 3) distribute. For each point, the slides show an example on how to achieve it and important technologies involved.
Just as usual, I made the slides with LaTeX Beamer; which I particularly enjoy when I’m able to use \foreach
to create little boxes and repeating graphics – and there are plenty of these ones in this deck. TikZ is an amazing package which I use more and more of^{1}, to the detriment typesetting durations… overlay, remember picture
is basically in my muscle memory by now. For a first time, I also used tikzexternalize
to save the diagram of a HPC node to a file and re-use it afterwards; LaTeX wouldn’t want to generate it 96 times (boooh), so I inserted a hidden slide before, generated the image with tikzexternalize
there, and then re-used it with an \includegraphics
96 times – with \foreach
, of course.
Find the slides embedded below^{2} and in referable form as hdl.handle.net/2128/32001 at our library.
It makes placing things free-floating on a slide so much easier. ↩
This is actually a minified version of the slides, using low-res versions of the images; add this to your shell function minify-pdf () {in="$1"; out="${1%.*}"; gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$out--minified.pdf" $in;}
! ↩
Poster publication: http://hdl.handle.net/2128/32006
The 14th JLESC workshop (JLESC: Joint Laboratory for Extreme-Scale Computing) was hosted by the National Center for Supercomputing Applications (NCSA) in Urbana, Illinois from 28th September to 30th September.
We had the opportunity to present the OpenGPT-X project in form of a poster.
On it, you can find information about the project partners within OpenGPT-X, its goals and the use cases of large language models it explores. Various available language models are presented.
Recent breakthroughs became possible due to the novel neural network architecture called transformer, based on so-called self-attention layers. These allow for the parallel processing of input using highly efficient matrix products.
In order to scale to a full supercomputer, three dimensions of parallelism are intertwined: Data parallelism, pipeline parallelism, and tensor parallelism. The total number of used GPU devices is given by multiplying these three parallel degrees.
Novel AI architectures explored at the Jülich Supercomputing Centre include AMD Instinct MI250 GPUs and Graphcore IPUs.
Using the Megatron-Deepspeed training framework one can easily achieve about 50% of peak performance on Nvidia A100 GPUs. In our tests, including 32 GPUs on 8 nodes of JUWELS Booster, the highest throughput (in terms of TFLOP/s) is achieved when the focus is given to data parallelism, and pipeline parallelism is used to reduce the memory footprint of the model.
Project challenges include hardware related spurious errors and energy consumption.
Our runs in OpenGPT-X are made possible through compute time on JUWELS Booster, given by the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) through the John von Neumann Institute for Computing at JSC.
]]>In the spirit of Open Science, wouldn’t it be great to acknowledge these little bits of science blog posts and have the option to refer to them in a scientifically sound way? Like… with DOIs, Digital Object Identifiers; the gold standard for referring to scientific work? Well, 🥁, the posts in this blog have DOIs now, including their metadata stored in a metadata repository!
Thanks to help from our Forschunsgzentrum Library, we are able to use DataCite as DOI provider. I built a little Python tool which uses the DataCite API to register metadata of a blog post and mint a DOI. The DOI is shown in the header of each post, next to the license of the post (which is also new); the first part of the suffix of the DOI always containing xdvblg
. I have released the Python tool as Open Source software as well, with a Zenodo DOI attached. It should be suitable for any other Jekyll-based blog as well!
I created DOIs retroactively for all previous blog posts, allowing us to link and refer to them a little bit more properly in scientific contexts^{2} from now on and have the metadata discoverable. Let’s see, if it sticks!
Read on for some technical details and design decisions.
DataCite is a service to store metadata of publications and create an optional DOI for it. Metadata can be viewed through their website or queried via APIs.
The Python tool, which I call doi-jekyll
, is hosted on GitHub and released with an MIT license. Snapshots at Zenodo and according DOIs are automatically created for every GitHub release via the Zenodo GitHub integration. With this blog post I released v1.0!
doi-jekyll
is a command line application which can be installed via pip
^{3}. It parses metadata of different locations within a Jekyll blog tree structure, assembles them as a validating instance according to the DataCite Metadata Scheme, submits the metadata to DataCite, and registers an auto-generated DOI. To build the metadata, data from an individual blog post (like title, license, abstract), from an author file (like name, ORCID ID), and from the blog itself (like blog title, blog DOI, but also API endpoint) are collected. The blog DOI^{4} is given as a Collection metadata of which every blog post inherits, creating a relationship between blog posts and the blog itself. The API with the latest Schema version is only available via an XML API (the JSON API is stuck on an older version, which doesn’t support the cool relational info). Because of this, the metadata is assembled in doi-jekyll
as a Python dictionary and then internally converted to XML via xmltodict
; a little bit of an extra effort, but working with Python dictionaries is so much easier compared to XML^{5}.
It took me a little bit of trial and error to assemble a validating metadata package which conforms to the DataCite Metadata Schema; luckily, DataCite has a test instance (called Fabrica Test) to fiddle around. While the interface of doi-jekyll
is made for this blog, it should work for any Jekyll blog; it has plenty of command line (and other) options to configure usage. A few examples: For testing, --dry-run
allows to skip communication with DataCite, but do all the rest; and --skip-url
will just register metadata at DataCite, but not mint a DOI. With --additional-metadata
, further metadata can be specified to integrate into the to-be-uploaded metadata; an according key doi-additional-metadata
in the YAML front matter of the post is available. To document and show some examples, the GitHub repository features an example Jekyll blog which has an example of all necessary files.
Let’s see where this weird journey of sciencifying blog posts leads us. The first person to place a xdvblg
DOI reference in a paper gets a cupcake!
Alternative title of the post: DOIs “R” Us. It did not make the cut. ↩
Or as fancy shortlinks with attached metadata! ↩
Not from the Python Package Index, yet; one needs to use the GitHub URL directly. ↩
The parent DOI of the blog itself is https://doi.org/10.34732/xdvblg-mn, mn
like main
, you know? ↩
Sigh, XML, amirite? ↩
We’ve deployed the nodes somewhat silently in the spring and are polishing them and getting to know them ever since. Starting off with a pre-GA software stack, by now we run the publicly available ROCm 5.2. There are still some minor issues with the nodes, but the GPUs themselves are running reasonably well to finally show some very basic benchmarks!
Still on pre-GA software, we also held an AMD Porting Workshop, in which we worked together with application developers and AMD to enable first users for the system. Despite the unfinished, preliminary software environment, we could achieve some interesting results. Check them out on the workshop’s Indico!
But now, let’s understand the devices better by looking at the OSU bandwidth micro-benchmark and a GPU variant of the STREAM benchmark. Plenty of graphs follow, click on them to enlarge. Find some technical details at the end.
First off, the one-directional bandwidth micro-benchmark from the OSU microbenchmark suite, osu_bw
. It is usually used for testing MPI connections, but can also be abused to get a glimpse of inter-device bandwidths. See a dedicated section at the end for technical details.
The picture shows bandwidth data for two message sizes, large (64 MiB, left) and small (4 MiB, right). Each color-coded box contains the bandwidth of a message going from GPU with certain ID to another GPU with a certain ID. Also included are messages going from the GPU to itself – for example from GPU 0 to GPU 0^{2}.
One immediately sees that there are not four GPU IDs but eight. That is a feature of the MI250 GPUs: Each MI250 is built as a multi-chip module (MCM) with two GPU dies contained in each MI250 device package. Each GPU die is very similar to an AMD Instinct MI100 GPU and it has access to half (64 GB) of the total memory. From a software perspective, each MI250 GPU is actually displayed as two GPUs and needs to be used as such. For most practical purposes, it is much simpler to think of the system with four MI250 GPUs as a system of eight MI250 GPUlets. The proper name for GPUlet is GPU Complex Die (GCD), which is displayed in the picture.
Even on a birds-eye view one can immediately see the clusters of two GCDs which belong together and form a GPU; like GPU 0 and 1, displayed in a blue 2-by-2 box, and GPU 2 and 3, etc., all on the main diagonal. The reason: GCDs on one GPU are connected well to each other with many links and have great bandwidths; for the large message size usually around 155 GiB/s.
Implicitly, the clusters tell us even more about the inter-GPU connections: There are not only blue 2-by-2 boxes, but also green and yellow boxes. Focusing on the first row with bandwidths from GCD 0 to other GCDs, one can see that to GCD 2+3 and GCD 6+7 the bandwidths are each around 40 GiB/s, and to GCD 4+5 the bandwidths are around 80 GiB/s.
The entire structure is the result of the complex connection topology of the GPUs. Each GCD has eight Infinity Fabric ports, with each Infinity Fabric link having a peak bandwidth of 50 GB/s^{3} in one direction. On a GPU, the two GCDs are connected with four Infinity Links, amounting to a peak bandwidth of 200 GB/s (or 400 GB/s, if you add up both directions). Going out of the MCM, things are a bit more convoluted. There are GCDs which are connected to other GCDs with two direct links (like GCD 1 → GCD 4) and GCDs connected to other GCDs with one direct link (like GCD 0 → GCD 2). Through their respective partner GCD, there might be other indirect links. And in addition, there are Infinity Fabric links going to the PCIe switch and then to the network or CPU. If you look closely, you can also see the indirect connections in the bandwidth pattern of the picture (like GCD 0 → GCD 4 being slightly faster than GCD 0 → GCD 5, although 4 and 5 are part of the same package).
All in all, it’s a hell of a complex pattern and I’m curious about the load imbalances of future Multi-GCD applications…
Now that we know how the patterns come to be, we can look at bandwidth usage relative to the various peaks. Enable relative numbers by clicking on the “Relative” toggle below the picture up top. We can see that there’s good utilization around 90% for the direct connections, and 80% for the indirect connections. For the smaller message size it’s somewhat similar compared to the larger message size, albeit 20 percentage points (pp) lower for the direct connections (indirect: 10 pp).
I also ran the micro-benchmark in the same fashion on a usual GPU node of JURECA DC with four NVIDIA A100s.
The first thing to notice is the uniformity of the connections. In the node design we deploy on JURECA DC, there are always four NVLink 3 connections between each GPU – 87 GiB/s for all possible connections (for large message sizes). Using the memory on the same GPU, 592 GiB/s are reached; roughly 130 GiB/s more than on an MI250 GCD. In terms of relative performance – which can be viewed when flipping the switch below the picture – the links to other GPUs can be utilized by 93%, or 41% for the own-memory accesses.
Time will tell if there is more software tuning room available for the MI250s or if the difference is part of the architectural choices. Noteworthy: One MI250 (i.e. two GCDs) has a TDP of 560 W, while one A100 has 400 W.
Another simple benchmark to test certain aspects about a device’s memory is the STREAM benchmark, of which I ran my own GPU variant on the MI250s. I used an old CUDA code which I HIPified with the hipify-perl
tool; it ran without a single further change. Quite amazing.
One GCD reaches around 1.42 TB/s for the copy kernel and about 1.34 TB/s for the triad kernel when the message size is large enough, as the inset view of the above linear plot shows (left). For triad, this is about 82% of the theoretically available peak. The double-logarithmic plot (right) shows well that the increase to the maximum bandwidth is regular (and according to a power law) and that the maximum is reached around \(2^{26}\) Byte (64 MiB).
Below the plot, there’s a switch to show results for A100. The GPU has a lower peak bandwidth compared to MI250, but reaches nearly identical values for copy (1.42 TB/s) and triad (1.35 TB/s) kernels of the benchmark – resulting in utilization of 87% of the available peak. The data point at \(2^{23}\) Byte (8 MiB) is a weird, systematic outlier which reaches the peak (or even beyond).
It is interesting, how closely a MI250 GCD matches the performance of an A100 GPU. In the following plot, I compare the triad bandwidth behaviors directly.
Especially in the double-log plot one can see that the A100 is always a tiny amount faster. After the weird outlier, it much closer matches MI250 GCD bandwidth. Still, for the final value, the A100 is about 3,6% faster than the MI250 GCD.
To understand how well the memory can be accessed depending on the number of threads per block (work items in a workgroup in AMD terminology), the picture above shows four plots – one for each of the STREAM kernels. On the x axis, always three data sizes are shown; 0.5 GiB, 2 GiB, and 8 GiB – values on the larger side of things and on the plateau in the previous STREAM plots. On y, four semi-typical values for threads-per-block are chosen.
It appears that 256 threads per block is always a good choice. So that’s going to be my go-to default for the future. You can view relative values for the link usage by flipping the switch below the picture – the usage is between 88% and 76%. It’s worthwhile to run a simple test like this once for your actual application, as the number of threads can in most cases be chosen somewhat freely, and may offer improvement of up to 7 pp (see add kernel for 2 GiB).
On a first glimpse, the behavior of the A100 looks very similar. And – as expected – it is able to achieve higher bandwidths and higher relative usage. Note the different color scales: The lower bound for A100 is 1270 GiB/s and not 1140 GiB/s of MI250. On a second look, there seem to be some different underlying trend in the behavior on the A100. For the one-vector kernels (copy, scale), the A100 seems to prefer fewer threads and larger messages. For the two-vector kernels (add, triad), the last column for 8 GiB is interesting, as the bandwidth drops by 20 GiB going from 128 threads to more threads. All of this is probably not very relevant for real-world applications, but fun to see!
AMD Instinct MI250, the GPU design which breaks the Exascale barrier in Frontier^{4}, are quite powerful GPUs, featuring up to 90 TFLOP/s performance in FP64. We deployed two nodes with four MI250s in JURECA DC as part of an Evaluation Platform at beginning of 2022. After some setup time, the nodes can now be used for tests. Results from an early Porting workshop can be found online and Moritz Lehmann has just published a paper with results obtained on the machine.
I used the bandwidth experiment of the OSU Microbenchmarks to study connections between the GPUs of a node with MPI. One can see that each MI250 consists of two Graphics Compute Dies (GCD) which are basically two individual GPUlets on a GPU. The obtainable bandwidths are diverse, due to the complex connection matrix between the GCDs. Bandwidths between GCDs on the same GPU are usually about 150 GiB/s, and between GCDs of different GPUs between 80 GiB/s and 40 GiB/s. I also showed results for A100 GPUs which have much more homogeneous connections, with always 87 GiB/s between the GPUs.
As a second experiment, I ran a CUDA variant of the STREAM benchmark, which I HIPified easily for AMD. When increasing the data size, one can see that the memory bus is saturated at around 64 MiB data sizes, and eventually a 1.42 TiB/s bandwidth is reached – about 87% of the available peak of the GCD^{5}. Looking at different number of threads per block, 256 threads seems to be a good choice, memory-wise. In comparison to A100 GPUs, one sees that the obtained bandwidth is surprisingly similar (the A100 is slightly faster, though) – but with a peak bandwidth a little lower for the A100.
Each GCD seems to be similar to an A100 in many ways. For the connection-targeted benchmarks shown, a MI250 GCD is usually a little slower and less efficient than the A100. But using 30% less power. Quite interesting devices.
Benchmarks were performed on the AMD Instinct MI250 nodes of JURECA DC’s Evaluation Platform. While the systems run publicly available software and firmware versions, the benchmarks were run while we still got to know the systems. Please let me know if you discover errors or have significantly different results on another machine. The evaluation notebooks are linked below.
The following software and versions were used
UCX_TLS=rc_x,self,sm,rocm_copy,rocm_ipc
for ROCm and UCX_TLS=rc_x,self,sm,cuda_ipc,gdr_copy,cuda_copy
for CUDA)Version 5.9; compiled as per official OpenUCX instructions:
./configure --enable-rocm --with-rocm=/opt/rocm CC=$(which mpicc) CXX=$(which mpicxx) LDFLAGS="-L$EBROOTOPENMPI/lib/ -lmpi -L/opt/rocm/lib $(hipconfig -C)" CPPFLAGS="-std=c++11"
Run by setting HIP_VISIBLE_DEVICES=A,B
, like:
HIP_VISIBLE_DEVICES=0,1 \
srun -n 2 mpi/pt2pt/osu_bw -d rocm -m 4194304:4194304 D D
Base code from my GitHub – github.com/AndiH/CUDA-Cpp-STREAM – and then compiled the following for AMD
hipify-perl CUDA-Cpp-STREAM/stream.cu > stream.cu.hip
HIP_PLATFORM=amd hipcc --offload-arch=gfx90a -o hip-stream stream.cu.hip
Run by looping through data sizes:
./stream -n $((2**0)) -t --csv -f | tee file.csv && \
for i in {1..28}; do \
./stream -n $((2**$i)) --csv -f; \
done | tee -a file.csv
The graphs presented here are created in Jupyter Notebooks with Pandas, Matplotlib, and Seaborn. Find the Notebooks here for reference, including the evaluation and raw data.
Since publication of this blog post, the following edits were made
Actually, the Evaluation Platform was created together with the AMD nodes! ↩
The data rate on each GPU itself gives only a rough idea about the memory bandwidth; it’s not a proper memory benchmark because of the implementation and indirections – STREAM is much better suited for that. For STREAM, see further down in the text. ↩
Infinity Fabric is also called xGMI. One xGMI lane can do 25 Gbit/s, and there seem to be 16 lanes per link. So, one Infinity Fabric connection can do 50 GB/s. ↩
Actually, Frontier does not deploy MI250s but MI250Xs. The difference is mainly in the number of compute units: MI250X has 220 and MI250 has 208. There are performance difference because of this (like 95.7 TFLOP/s peak vs. 90.5 TFLOP/s), but no direct differences relating to memory. An additional difference in the design of Frontier is relating to the CPU: The GPUs are directly connected via a coherent Infinity Link to a single CPU – not PCIe, no two CPU sockets. ↩
The advertised 3276.8 GB/s peak memory bandwidth are actually for the full GPU. I divided by two to get the per-GCD bandwidth; 1638 GB/s. ↩
OPTIMA is an EU-funded project whose goal is to prove that several HPC applications can take advantage of the future highly heterogeneous FPGA-populated HPC systems. In addition, by using newly introduced tools and runtimes, application porting/development can be almost as simple as developing software for conventional HPC systems incorporating GPUs.
Deliverable 3.5 is the first version of an Open-Source library called OOPS (Optima Open Source) for FPGA-based HPC systems. This library contains a set of optimised software routines for industrial and scientific applications, taking advantage of OPTIMA hardware platforms.
The OOPS library follows a standard C-based application programming interface (API) and supports the latest Xilinx Alveo FPGA cards such as U55C and U280. This first version of OOPS contains the following kernels. Initial tests show similar or better performance of a single compute unit in comparison to single-thread CPU versions for most of the kernels. In fact, as shown in details in the deliverable, this first version uses just a fraction of the FPGA resources.
The library will continue to receive more updates and bug fixes in the future, the immediate ones focusing on optimisation to achieve excellent energy-performance-rations. Later updates will include adding device-specific implementations such as utilising High Bandwith Memory, adding more solvers such as Jacobi Preconditioner and allowing massive parallel processing using more compute units.
More details about OPTIMA and the deliverable are available in PDF form here.
]]>This blog post is based on a presentation I held at the “New Trends in Computational Science in Engineering and Industrial Mathematics” workshop in Magdeburg on 01/07/2022. My goal is to give a brief introduction to the state of current large language models, the OpenGPT-X project, and the transformer neural network architecture for people unfamiliar with the subject.
The audience at the workshop had a mathematics background and is assumed to have a good understanding of linear algebra, but not necessarily of neural networks. Basically, the target audience is past me from before I started working on this project with the goal of understanding the math behind transformers. The questions I want to answer are:
If you find any mistakes or unclear points feel free to let me know in order to improve this post.
Natural language processing deals with making the human language accessible for computations.^{1} ^{2} Having a computer understand what you say can help in many situations. Applications of NLP include intelligent speakers, chatbots, translation, text generation, summarization and much more.
A language model forms the back bone of these applications. A language model is just a probability distribution. Given a sequence of words \(w_{1:(t-1)}=(w_1,\dots,w_{t-1})\), a language model gives the probability of all the words in your vocabulary \(V\) to follow this sequence,
\[P(w_t| w_{1:(t-1)}),\qquad w_1,\dots,w_{t-1},w_{t}\in V.\]With such a language model one can generate new texts: Start with a sentence, then choose the word with the highest probability (or sample according to probabilities) and feed the new appended sequence back into the model to generate the next word. The language model can be used to assign a probability to a sentence (using the chain rule of conditional probabilities) as
\[P(w_{1:n}) = \prod_{i=1}^{n} P(w_i|w_{1:(i-1)}).\]One can imagine this to be helpful in grammar corrections for example.
There are different ways to arrive at such a language model. One could think about putting all rules of grammar and the meaning of words into a computer program. However, this is extremely difficult to do. The approach that caught on in recent years and produced very impressive language models does not require encoding explicit grammar or world knowledge. Instead, neural networks are trained on huge amounts of text and learn to form proper sentences just from the data they see.
In order to understand the broader context of the transformer architecture in NLP applications, we clarify some terms related to training and application of large language models.
The learning methodology described by the first two steps (pre-training followed by fine-tuning) is called sequential transfer learning.^{3}
All these steps need computing resources. The computational device of choice is typically the GPU due to the massive parallelism it provides and hardware features that make it extremely efficient in performing matrix multiplications. We will see below (in the section Attention please!) how matrix multiplications form the core of training the model. Pre-training of large models is the most computationally demanding step and happens on a supercomputer such as JUWELS at Forschungszentrum Jülich using lots (hundreds) of GPUs in parallel. Fine-tuning and inference may happen on server systems with a handful of GPUs.
Neural networks are everywhere. You might be familiar with the basic ideas. There are many great resources to learn the foundations.^{4} ^{5} The goal of training a neural network is to learn input-output relations from data. When a neural network is well-trained, a vector representing input data is fed to an input layer. In illustrations this is on the left (like the one to the right by Dake & Mysid on Wikimedia Commons). Then it is processed by passing several hidden layers until it reaches an output layer. Moving from one layer to the next means multiplying the vector with a matrix, adding another vector and applying a non-linear activation function. This is called a forward-pass or forward-propagation.
The elements of the matrices are called weights, the elements of the additive vector are called biases. Weights and biases are the parameters that are learned during training. For your training data, the output given by the network should closely match the real desired output, i.e. the loss function (measure of difference between network’s output and desired output) should be minimal. If this is not yet the case, we change the parameters to achieve a smaller loss. This is done using gradient descent. The gradient of the loss function with respect to the parameters is computed. The parameters are updated by adding the gradient multiplied by a step size (called learning rate). The actual computation of the gradients uses the chain rule from calculus and involves starting at the output layer and moving backwards through the network. This is why computing the gradients is called backward propagation.
In practice, more useful heuristics are added to this process, and it works very well for many tasks. However, it is difficult to use the fully-connected neural network for NLP tasks. One problem is that the input size is fixed, and we would like to process longer as well as shorter word sequences as input. In general, a dense neural network does not represent the nature of language very well.
Luckily, this standard feed-forward neural network is only the most basic neural network architecture of many that were devised over the years for various applications.
In the field of NLP and language modelling, until recently, sequential models were the state of the art. These include recurrent neural networks (RNNs) and long short-term memory (LSTM) networks.^{6}
RNNs apply the same neural network (with learned parameters) to every word in a sequence of words. Additionally, this neural network takes an internal state as input, which comes as output from the neural network associated to the previous word. This way the network can learn to use information from earlier words in the sequence. When one writes down the gradient of the loss function with respect to the parameters using the chain rule, one can see that the newest word has the most influence. The influence of the previous words diminishes exponentially. Intuitively, this makes sense: For choosing the next word, the most recent word is on average more important than a word further in the past. However, in practice, language is more nuanced. Some specific words in the past can be very important for choosing future words, and a smart neural network should know how to look for them. Just think of a very long relative clause for example. Older words having less influence on the gradients is therefore more of a bug than a feature, and this is called the vanishing gradients problem.
LSTMs alleviate this issue by introducing an extra cell state (serving as “memory”) whose exact influence is determined by gates that are defined by more learnable parameters.
One drawback remains: Both RNNs and LSTMs process their input data sequentially. Consider the forward pass: In order to apply the neural network (a series of matrix multiplications) on an input word vector \(x_i\) we also need the result from applying the network on the previous word vector \(x_{i-1}\). We can not stack the word vectors together in a matrix and apply a neural network all at once.
Formulating algorithms to use matrix-matrix products as main computational element is a good step forward towards the efficient use of modern compute hardware. This is true on the small scale of a single processor to the large scale of supercomputers using thousands of GPUs. Matrix-matrix products are the key.
Realizing this need, researchers started “having intuitions” about neural network architectures that employ these operations to learn to pay attention to other relevant words.
The so-called attention mechanism had been employed in the context of sequence models to give the model the opportunity to learn which words are relevant for the next word. The landmark paper “Attention is all you need” (2017) ^{7} showed that you do not need a recurrent network structure, and that the attention mechanism (together with some other tricks like positional encoding) is powerful enough for impressive results. The resulting neural network architecture is called a transformer.
In the following we describe a forward-pass through a (self-)attention layer, which forms the central element of a transformer block. A neural network architecture is called a transformer when it consists of several transformer blocks. Backpropagation is taken care of by using the automatic differentiation engines of frameworks such as PyTorch or TensorFlow.
Consider a sequence of input tokens \(x_1,\dots, x_n\in\mathbb{R}^{n_\text{model}}\) represented by vectors. Tokens are the smallest building blocks into which word sequences are divided for processing. The process of getting a sequence of tokens (represented as a series of integers referring to a vocabulary) from a text string is called tokenization. The vector representation of a token is called an embedding and spatially encodes the meaning of tokens and their relationship towards each other. In the case of transformers, word embeddings are also learned during pre-training. You can think of this as a matrix with learned entries being multiplied with a one-hot vector, i.e. choosing row \(i\) when the token is encoded as integer \(i\). A one-hot vector is called a (standard) unit vector in numerical linear algebra.
The processing of the first three input vectors \(x_1, x_2, x_3\) to generate an output vector \(y_3\) is seen in the following diagram:^{2}
Among the learned parameters of a transformer block are three matrices \(W_k\), \(W_q\) and \(W_v\). They transform an input vector \(x_i\) to generate three vectors \(k_i\), \(q_i\) and \(v_i\). The convention is to treat the vectors as row vectors and apply the matrix from the right:
\[k_i \leftarrow x_i W_k\in\mathbb{R}^{1\times d_k} \quad q_i \leftarrow x_i W_q \in\mathbb{R}^{1\times d_k},\quad v_i \leftarrow x_i W_v \in\mathbb{R}^{1\times d_v}, \\ \text{for } i=1,\dots, n.\]The vectors \(k_i\), \(q_i\) and \(v_i\) are called queries, keys and values. There is some intuition behind these names that imagines the attention mechanism as retrieving information similar to a database. But I did not find this very helpful in understanding what is going on, so I will not go into more detail here.
To compute the output vector \(y_3\), one first computes scalar products of the query vector \(q_i\) and all previous key vectors \(k_1,\dots, k_i\). In order to prevent numerical overflow, the results are scaled by \(\sqrt{d_k}^{-1}\). Then the softmax activation function is applied.
\[\alpha_{i,j} \leftarrow \frac{q_i k_j^{T}}{\sqrt{d_k}}\quad \text{for }j=1,\dots, i\\ \alpha_{i,j} \leftarrow \text{softmax}(\alpha_{i,j}) = \frac{\exp{(\alpha_{i,j})}}{\sum_{j=1}^i{\exp{(\alpha_{i,j})}}}\quad \text{for }j=1,\dots, i\]The softmax function, applied on a set of \(n\) values, returns \(n\) values between 0 and 1 that sum up to one. Larger values are mapped closer to one and smaller values are mapped closer to zero following a sigmoid function. In a regular “max” function the largest value is mapped to 1 and all smaller values are mapped to 0. The name “softmax” comes from it being a “softer” version of this.
Now the output vector is given as a sum of the scalars \(a_{i,j}\) and the value vectors.
\[y_i \leftarrow \sum_{j=1}^i \alpha_{i,j} v_j \quad \text{for }j=1,\dots, i.\]The beauty of the attention mechanism is now that we can consider all input vectors at once by stacking them on top of each other forming a matrix
\[X = \begin{bmatrix} - x_1 -\\ \vdots\\ - x_{n} - \end{bmatrix}\in\mathbb{R}^{n\times n_\text{model}}.\]Keys, queries and values of all input vectors are computed via matrix-matrix multiplication as
\[K= \begin{bmatrix} - k_1 -\\ \vdots\\ - k_{n} - \end{bmatrix} \leftarrow XW_k \in\mathbb{R}^{n\times d_k},\quad Q=\begin{bmatrix} - q_1 -\\ \vdots\\ - q_{n} - \end{bmatrix} \leftarrow XW_q \in\mathbb{R}^{n\times d_k}, \\ V=\begin{bmatrix} - v_1 -\\ \vdots\\ - v_{n} - \end{bmatrix} \leftarrow XW_v\in\mathbb{R}^{n\times d_v}.\]The scalars \(\alpha_{i,j}\) can now be computed as a softmax applied to the rows of a matrix-matrix product
\[A = [\alpha_{i,j}]_{i,j=1,\dots,n} \leftarrow \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \in\mathbb{R}^{n\times n}.\]The next step is the summation of value vectors, weighted with the values \(\alpha_{i,1},\dots,\alpha_{i,n}\) (line \(i\) of \(A\)). This is realized for all vectory \(y_1,\dots,y_n\) at once by – you guessed it – another matrix-matrix product. So in total we have
\[Y = \begin{bmatrix} - y_1 -\\ \vdots\\ - y_{n} - \end{bmatrix} \leftarrow \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \in\mathbb{R}^{n\times n_v}.\]Further remarks on simplifications we made for clarity in the equations:
gemm
fashion), which is not stated here explicitly.Transformer neural networks arrange attention layers and other network layers in various configurations. A number of \(h\) attention layers (attention heads) are connected in parallel to form multi-headed attention. Every head has independent training parameters. The attention heads’ outputs (matrices of dimension \(n \times n_v\)) are concatenated, forming a matrix of dimension \(n\times h n_v\). This matrix is brought back into the right form by multiplying it with another trained matrix \(W_O\in\mathbb{R}^{hn_v\times n_\text{model}}\):
\[Y \leftarrow \begin{bmatrix} Y_1&\cdots & Y_h\end{bmatrix} W_O \in\mathbb{R}^{n\times n_\text{model}}.\]Multi-headed attention together with normalization layers, feed-forward layers, and residual connections forms a transformer block. The input and the output of a transformer block have the same shape, so they can be connected in series. For example for GPT-1 a transformer block is repeated 12 times. In order to generate a probability distribution for the next word in a sequence, one more linear transformation layer and a softmax is employed at the very end.
The exact transformer architecture can vary and depends on the training objective. The original paper (Attention is all you need) considered machine translation. Here, an encoder-decoder structure makes sense: First the sentence in the original language is encoded using a stack of transformer blocks as described above. Both directions of information flow are allowed. The decoder’s structure is mostly similar except that the self-attention is masked and there is a second (multi-head) attention layer in each transformer block. In contrast to the forms of attention we discussed before, this is not self-attention, but instead attention is paid to the outputs of the encoder: The output vectors of the encoder are used to compute key and value vectors which serve as input for the decoder’s attention block.
I would suggest not to think too much about wether a network architecture is an “encoder” (BERT)^{8} or a “decoder” (GPT)^{9} and not try to relate them to the encoder-decoder architecture from the Attention is all you need paper. They are similar in the main ideas, and details vary anyway. The main difference is the masking during training as described above. My theory is that BERT decided to call itself an encoder, mainly to get an “E” for its acronym, to keep this running gag about sesame street characters going.
In 2018 the GPT (Generative Pre-trained Transformer) model ^{9} by the company OpenAI started an avalanche of publications describing pre-trained neural networks based on the transformer architecture. Now models could become more powerful just by throwing more compute power and data at them. Larger and larger models were trained. The BERT (Bidirectional Encoder Representations from Transformers)^{8} model by Google followed in the same year (2018). Both have similar architectures corresponding to a series of transformer blocks, making them more simple than the encoder-decoder architecture presented in Attention is all you need.
Each year, larger and more powerful models followed. GPT-2 ^{10} was published in 2019. GPT-3 ^{11} followed in 2020 and showed great powers in solving a variety of language related tasks. Modern large language models (since GPT-3) already show impressive performance on downstream tasks even without the fine-tuning step. To achieve this, in-context learning is incorporated in the pre-training loop and at inference time. This is called meta-learning in the GPT-3 paper.^{11} Here, examples of the task and solution (e.g. sentiment analysis) are shown as part of the input at the forward pass (in pre-training or at inference). Showing few examples at inference time is called few-shot learning. One-shot learning shows just one example and zero-shot learning shows no example.
Even though GPT-3 was developed by a company with “Open” in its name, the trained model is not in fact open, but only accessible for a fee.
In 2022 the OpenGPT-X project, funded by the German Federal Ministry of Economics and Climate Protection (BMWK), was launched with the goal to provide an independent and open large language model based in Europe and trained on English and German data. Other efforts to provide models of similar capabilities as GPT-3 more openly include the BigScience Research Workshop and OPT (Open Pretrained Transformer) by Meta.^{12}
I recently moved from numerical linear algebra, developing algorithms for solving structured eigenvalue problems, towards natural language processing with a focus on high performance computing. In my native language I would call a principal component analysis a singular value decomposition. This is why I have an instinct to look for matrices everywhere. I want to conclude by sharing some of my personal learnings from switching fields.
Coursera course by Andrew Ng: Sequence models ↩
Book by Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft) ↩ ↩^{2}
Presentation by Thomas Wolf: An Introduction to Transfer Learning in NLP and HuggingFace ↩
Lecture series by Sebastian Raschka: Deep learning lecture videos by Sebastian Raschka, in particular lecture L19: Self-attention and transformer networks ↩
Lecture series by MIT: Introduction to Deep Learning, in particular lecture 2 by Ava Solemany Deep Sequence Modeling ↩
Blog post by Christopher Olah: Understanding LSTMS ↩
Original transformer paper: Attention is all you need, 2017 ↩
BERT paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018 ↩ ↩^{2}
GPT-1 paper: Improving Language Understanding by Generative Pre-Training, 2018 ↩ ↩^{2}
GPT-2 paper: Language Models are Unsupervised Multitask Learners, 2019 ↩
GPT-3 paper: Language models are few shot learners, 2020 ↩ ↩^{2}
Paper: OPT: Open Pre-trained Transformer Language Models, 2022 ↩
We invited a set of application owners with which we worked together during that time to present past developments, recent challenges, and future plans. On top of that, we had two other GPU-focused talks: Markus Hrywniak from NVIDIA gave a presentation about some distinct features of NVIDIA’s next-generation GPU (Hopper H100) and how they can be used for applications. And Damian Alvarez presented the current state of JUWELS Booster and highlighted the work done in the lab to identify issues and shortcomings of the machine, seen and analyzed in close collaboration with specific users.
I also held a presentation – the opening presentation about all the things we did in the last ten years within the lab. Among other things, I counted 32 trainings held – with 11 additional trainings on conferences – and 18 workshops. I did not dare to count the optimized applications, in fear of forgetting one… Browsing through old material, I found a report about creation of the lab in the GCS InSiDE magazine Spring 2013 (link to PDF)^{1}. An interesting snippet: “For many applications, using a single GPU is not sufficient, either because more computing power is required, or because the problem size is too large to fit into the memory of a single device. This forces application developers to not only consider parallelization at device level, but also to manage an additional level of parallelism.” – it seems to be a universal fact, still true today.
From application developers, we heard about quantum computer simulators – general simulators (Hans de Raedt) and simulators targeting specific aspects (Dennis Willsch) – which all have their own challenges, be it limited memory and extensive communication, or complicated communication patterns. Alexander Debus presented recent developments of PIConGPU, a plasma physics simulator capable to scale to various large machines (including JUWELS Booster, of course) by using many sophisticated abstractions under the hood. In two talks held virtually from North America, we heard about current work done in brain image classification (Christian Schiffer) and about simulations of polymeric systems (Ludwig Schneider). Christian presented a whole set of applications, which all work towards the goal of enabling semi-automatic, partly live classification of brain regions. Ludwig Schneider presented SOMA, which uses OpenACC for acceleration, and was recently augmented by guidance functionality through Machine Learning. In a talk about our fresh^{2} OpenGPT-X project, Stefan Kesselheim highlighted the importance of large-scale language models and what exciting plans we have using JUWELS Booster for training open models.
On the second day, a group of talks about weather and climate (W&C) simulations was started with a talk about MPTRAC by Lars Hoffmann. MPTRAC is another OpenACC application we worked on in the past, which was recently augmented with sophisticated ideas to deal with the large amount of input data. Another W&C code is MESSy – or rather a whole infrastructure for simulations – which we extensively work together with for quite some time now, but there are many pieces to this GPU puzzle, as shown in the talk by Kerstin Hartung. Finally, our ParFlow GPU work was presented by Jaro Hokkannen, who now works at CSC in Finland, but was so kind to share his past developments remotely. ParFlow uses a custom, embedded DSL, to hide a specific backend behind pre-processor macros; with that, targeting different accelerators is comparable easy. Finally, two talks shared experiences with respect to handling Lattice-Boltzmann (LB) algorithms. For one, Fabio Schifano presented about D2Q37, an LB application which has a long history with GPUs, but ventures into FPGAs right now. Funnily, Fabio and the D2Q37 code were already part of the very first Kick-Off Workshop 10 years ago! And as the last presentation, we heard about M-AIA (previously known as ZFS) and the efforts to port the application to GPUs using the parallel STL by Miro Gondrum and Moritz Waldmann; it was quite interesting to hear their views on portability.
All in all, it was an amazing workshop, seeing the fruits of many years of work on applications and how developers progressed after the various forms of collaborations we had with them.
Let’s do that again!
For the lab creation, also a press release was made. It contains a pretty cool picture, which Jiri Kraus (lab member of day 1) reminded me of. ↩
Actually not so fresh anymore. Time passes by so quickly! ↩
Check it out at
→ fz-juelich.de/en/ias/jsc/about-us/structure/atml/atml-x-dev
]]>The work presented highlights of some of the results obtained in the MAELSTROM Deliverable D3.4 and was shown following a short presentation by Daniele Gregori (E4), who introduced the systems at E4 and JSC that were used to gather the benchmark results.
We have seen that the applications have quite different performance behaviors and have identified multiple areas of interest for further and deeper investigation. We are looking forward to the months to come in MAELSTROM!
]]>