SIMD

All materials posted here are for personal use only. Material will be added incrementally throughout the Spring 2022 semester.

SWAR Architecture & Concepts

The following overview SIMD Within A Register (SWAR).

Multimedia Extensions For Microprocessors: SIMD Within A Register (HTML, PDF)
One of the first talks on the concepts of SWAR... originally presented in February 1997 at Purdue University. The HTML is a little ugly, but this is the original HTML, and the server it was on supported different server-side processing....
Compiling for SIMD within a Register (PDF)
One of the best generic descriptions of the concepts of SWAR. The above link is direct from Springer-Verlag.
@inproceedings{663771,
 author = {Randall J. Fisher and Henry G. Dietz},
 title = {Compiling for SIMD Within a Register},
 booktitle = {LCPC '98: Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing},
 year = {1999},
 isbn = {3-540-66426-2},
 pages = {290--304},
 publisher = {Springer-Verlag},
 address = {London, UK},
 }

Basic SIMD Architecture & Concepts

The following overview some of the key ideas behind traditional SIMD architecture.

Architecture of a massively parallel processor (PDF)
This paper describes Ken Batcher's SIMD MPP design at Goodyear Aerospace.
@inproceedings{285977,
 author = {Kenneth E. Batcher},
 title = {Architecture of a massively parallel processor},
 booktitle = {ISCA '98: 25 years of the international symposia on Computer architecture (selected papers)},
 year = {1998},
 isbn = {1-58113-058-9},
 pages = {174--179},
 location = {Barcelona, Spain},
 doi = {http://doi.acm.org/10.1145/285930.285977},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }
DAP -- a distributed array processor (PDF)
This paper describes the ICL DAP, another early SIMD machine.
@inproceedings{803971,
 author = {S. F. Reddaway},
 title = {a distributed array processor},
 booktitle = {ISCA '73: Proceedings of the 1st annual symposium on Computer architecture},
 year = {1973},
 pages = {61--65},
 doi = {http://doi.acm.org/10.1145/800123.803971},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }
Thinking Machines CM-2 (PDF)
A (relatively late) version of the "Connection Machine Model CM-2 Technical Summary, Version 6.0, November 1990." This includes description of the (CM-200) floating-point hardware to the design.
Activity Counter Implementation Of Enable Logic (PDF)
This paper describes a clever method for handling nested tracking of nested SIMD enable/disable without use of a bit stack.
@inproceedings{ keryell93activity,
    author = "Roman Keryell and Nicolas Paris",
    title = "Activity Counter: New Optimization for the Dynamic Scheduling of {SIMD} Control",
    booktitle = "Proceedings of the 1993 International Conference on Parallel Processing",
    volume = "II - Software",
    publisher = "CRC Press",
    address = "Boca Raton, FL",
    pages = "II--184--II--187",
    year = "1993",
    url = "citeseer.ist.psu.edu/keryell93activity.html" }

Basic GPU Architecture & Concepts

The NVIDIA Developer CUDA education site has many nice links, including this set of slides from Mark Harris
Lots of good stuff here. I'm using the above slides from Mark Harris to introduce CUDA C/C++, starting with the October 30, 2020 lecture
An Introduction to Modern GPU Architecture
A very nice set of oldish slides from NVIDIA....
Introduction to the CUDA Platform
Very minimal overview slides from NVIDIA, but points at everything....

GPU Programming Tricks

Our MIMD On GPU work. The 2009 paper giving the details isn't freely available, but for this course, here's an unofficial copy and here are slides for it. An interesting little bit to look at is mog.cu, which is a later version of the MOG interpreter core.

Synchronization across multiple little SIMD engines within a GPU is described in our Magic Algorithms page

The latest (CUDA 9) CUDA Warp-Level Primitives are described here.

The atomic primitives are described in this section of the CUDA-C programming guide. Here are slides from NVIDIA overviewing their use.

Cooperative Groups: Flexible CUDA Thread Programming is an API for groups within a block.

Mark Harris 2007 slides on reduction optimization
It is useful to note that there is now even better efficiency possible using warp shuffle, and lots of optimized functions are now available using CUB

NVIDIA's developer site on using OpenCL

Here is a nice summary of OpenCL support in GPUs/CPUs (not FPGAs)

Intel's FPGA SDK FOR OPENCL (remember, Altera is now part of Intel)

OpenACC (and OpenMP for GPUs)

Both these sets of directives (pragmas) allow you to get code running on a GPU without much fuss, but that doesn't mean they're simple. Pragmas are part of the C/C++ languages, but they're not really integrated. The rule is that a program should still work if compiled ignoring all pragmas, and that's mostly true for OpenACC and OpenMP programs in C/C++.

That said, both sets of pragmas are supported by GCC. There are lots of similarities with strikingly unnecessary differences. For example, what OpenACC calls a "gang" is pretty much what OpenMP calls a "team" -- although there are lots of differences, both roughly correspond to what NVIDIA calls a "block". In any case, tools like nvprof still work with the code they generate... because it all ends up being kernels to run on NVIDIA GPUs. Of course, both OpenMP and OpenACC are intended to run code on Intel and AMD GPUs too, but those targets are currently less well supported by the free implementations.

Dr. Dobb's Easy GPU Parallelism with OpenACC

OpenACC (yeah, it should really be OpenAcc, but that's not what they call themselves) and here's their reference card (which isn't too bad, really)

OpenMP was really designed for shared-memory, multi-core, processors... but now includes support similar to OpenACC; Here is a little summary of the OpenMP 5 support for GPUs.

Graphics and OpenGL

There are lots of overview slides out there. These slides by Daniel Aliaga at Purdue CS are about as good an overview as I've found of both history and the basic graphics pipeline.

Learn OpenGL is a website with a nice intro tutorial

What Every CUDA Programmer Should Know About OpenGL

The Open-Source OpenGL Utility Toolkit, better known as freeglut

OpenGL- GLUT Program Sample Code... which isn't explicitly using CUDA

MIMD On SIMD/GPU

Lots of stuff at MOG (MIMD on GPU)

H. G. Dietz and F. Roberts, "Execution Of MIMD MIPSEL Assembly Programs Within CUDA/OpenCL GPUs," 2012 (PDF)


EE599/699 GPU and Multi-Core Computing