References: EE599/699 GPU Computing

All materials posted here are for personal use only. This material is being massively restructured for Fall 2018.

Basic SIMD Architecture & Concepts

Papers describing basic (pronounced "old") SIMD architecture are linked here. Notice that traditional SIMD is often bit-serial and extremely simple per processing element.

SWAR Architecture & Concepts

The next step after big SIMD machines was SIMD Within A Register (SWAR). This is used in nearly all modern processors. References are linked here.

GPU Architecture & Concepts

Modern GPU Architecture Modern GPU Architecture

Inside the Volta GPU Architecture and CUDA 9

Kepler GK110/210 white paper

First CUDA program video and course materials
The first CUDA program there uses host memory mapping; here's a version that doesn't

GPU Programming Tricks

Our MIMD On GPU work. The 2009 paper giving the details isn't freely available, but for this course, here's an unofficial copy and here are slides for it. An interesting little bit to look at is mog.cu, which is a later version of the MOG interpreter core.

Synchronization across multiple little SIMD engines within a GPU is described in our Magic Algorithms page

The latest (CUDA 9) CUDA Warp-Level Primitives are described here.

The atomic primitives are described in this section of the CUDA-C programming guide. Here are slides from NVIDIA overviewing their use.

Cooperative Groups: Flexible CUDA Thread Programming is an API for groups within a block.

Mark Harris 2007 slides on reduction optimization
It is useful to note that there is now even better efficiency possible using warp shuffle, and lots of optimized functions are now available using CUB

NVIDIA's developer site on using OpenCL

Here is a nice summary of OpenCL support in GPUs/CPUs (not FPGAs)

Intel's FPGA SDK FOR OPENCL (remember, Altera is now part of Intel)

OpenACC and OpenMP

Both these sets of directives (pragmas) allow you to get code running on a GPU without much fuss, but that doesn't mean they're simple. Pragmas are part of the C/C++ languages, but they're not really integrated. The rule is that a program should still work if compiled ignoring all pragmas, and that's mostly true for OpenACC and OpenMP programs in C/C++.

That said, both sets of pragmas are supported by GCC. There are lots of similarities with strikingly unnecessary differences. For example, what OpenACC calls a "gang" is pretty much what OpenMP calls a "team" -- although there are lots of differences, both roughly correspond to what NVIDIA calls a "block". In any case, tools like nvprof still work with the code they generate... because it all ends up being kernels to run on NVIDIA GPUs. Of course, both OpenMP and OpenACC are intended to run code on Intel and AMD GPUs too, but those targets are currently less well supported by the free implementations.

Dr. Dobb's Easy GPU Parallelism with OpenACC

OpenACC (yeah, it should really be OpenAcc, but that's not what they call themselves) and here's their reference card (which isn't too bad, really)

OpenMP was really designed for shared-memory, multi-core, processors... but now includes support similar to OpenACC; here's their 12-page reference card

Graphics and OpenGL

There are lots of overview slides out there. These slides by Daniel Aliaga at Purdue CS are about as good an overview as I've found of both history and the basic graphics pipeline.

Learn OpenGL is a website with a nice intro tutorial

What Every CUDA Programmer Should Know About OpenGL

Not-yet-updated stuff follows....

GPU Computing In General

GPGPU (HTML)

This site contains a variety of news, paper links, etc., about use of GPUs (Graphic Processing Units) for General-Purpose computing -- commonly known as GPGPU. Note that general-purpose is a misnomer; it is really about programming GPUs for tasks that are not entirely graphical.

A Performance-Oriented Data Parallel Virtual Machine for GPUs (PDF)

The first paper on ATI's CTM (Close To the Metal) software interface to GPUs (Graphics Processing Units) for general-purpose computing. Referenced directly from ATI's site, which is now part of AMD's site. There are also slides and a full manual at the ATI/AMD site.

GPU Programming Support

We'll be starting with NVIDIA's CUDA environment. The latest version is 4.0. Note that the version numbers are different for the various components of the CUDA system, and do not have any obvious relationship to the Compute Capability levels that are supported. However, version numbers are consistent across the supported platforms.

NVIDIA's Parallel Programming Education Materials
NVIDIA CUDA C Programming Guide, Version 4.0 (PDF), the primary reference for the latest CUDA
Fermi Tuning Guide (PDF), a short document saying how Fermi (CUDA 2.0) performance differs from that of earlier versions
AMD's guide to porting CUDA applications to OpenCL
CU2CL tool for source-to-source conversion of CUDA applications to OpenCL
AMD's introductory tutorial to OpenCL
NVIDIA GPU Computing Documentation
SC07 CUDA Tutorial
High Performance Computing with CUDA Tutorial from NVIDIA, SC09
Optimizing Parallel Reduction in CUDA from Mark Harris, NVIDIA
CUDA Parallel Programming Tutorial from Richard Membarth, University of Erlangen-Nueremberg
Course on CUDA Programming on NVIDIA GPUs
KHRONOS group, the folks behind both the OpenGL and OpenCL standards
OpenCL "Quick Reference Card" (PDF), a 6-page list of OpenCL calls, types, etc.
KOAP (Kentucky OpenCL Application Preprocessor) (homepage, readme, PDF reference card)

GPU Computing