All materials posted here are for personal use only. Material will be added incrementally throughout the Spring 2022 semester.

Shared Memory Programming

From very low level to very high level....

A little collection of different MIMD code to compute Pi
OpenMP (aka, OMP)
Here are my OpenMP overview slides, as presented in class. OMP pragmas are understood by recent GCC releases (GOMP is built-in), but must be enabled by giving -fopenmp on the gcc command line with no other special options; my Pi computation example for OMP is mppi.c. Normally, environment variables are used to control things like how many processes to make
Mutex (exclusive lock) vs. Semaphore (signaling mechanism)
Don't yet have a great reference for this, but they're everywhere. Basic Mutex operations are lock(m) and unlock(m), withe many implementations. Basic Semaphore operations are classically called P and V (wait and signal). The simplest counting semaphore would be something like void p(semaphore s) { while (s<=0); --s; } and void v(semaphore s) { ++s; }.
Many short, yet still confusing, descriptions of Futexes are available and here's probably the best early overview (PDF); the catch is that various Linux kernels have different futex() implementations with 4, 5, or 6 arguments
Barrier synchronization
There are various atomic counter algorithms; alternatively, here is GPU SyncBlocks algorithm from my Magic Algorithms page
That's basically the same as used in The Aggregate Function API: It's Not Just For PAPERS Anymore
Direct use of System V shared memory
My System V shared memory version of the Pi computation is shmpi.c -- note that this version uses raw assembly code to implement a lock, which has far less overhead than using the System V OS calls (unless you're counting on the OS to schedule based on who's waiting for what)
POSIX Threads
POSIX Threads (pthreads) is now a standard library included in most C/C++ compilation environments, and linked as the -lpthread library under Linux GCC; my Pi computation example for pthreads is pthreadspi.c
UPC (unified parallel C)
UPC (Unified Parallel C) is an extension of the C language, and hence requires a special compiler. There are several UPC compilers; the fork of GCC called GUPC must be installed as described at the project homepage (in my systems, it is installed at /usr/local/gupc/bin/gupc). My Pi computation example for UPC is upcpi.upc; compilation is straightforward, but the executable produced processes some command line arguments as UPC controls, for example, -n is used to specify the number of processes to create.

Basic MIMD Architecture & Concepts

A little about historically how this has evolved...

Fetch-&-Add in the NYU Ultracomputer
A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir, "The NYU Ultracomputer -- Designing an MIMD Shared Memory Parallel Computer" in IEEE Transactions on Computers, vol. 32, no. 02, pp. 175-189, 1983. doi: 10.1109/TC.1983.1676201 (URL, local copy)
"An Overview of the NYU Ultracomputer Project (1986)" (PDF) is a better, but more obscure, reference
Explanation of the "Hot Spot" problem for RP3
G. F. Pfister and V. A. Norton, "``Hot spot'' contention and combining in multistage interconnection networks," in IEEE Transactions on Computers, vol. C-34, no. 10, pp. 943-948, Oct. 1985. (URL, local copy)
Memory consistency models
"Shared Memory Consistency Models: A Tutorial" (PDF) -- Sarita Adve has done quite a few versions of this sort of description
Modern atomic memory access instructions
AMD64 atomic instructions
Many short, yet still confusing, descriptions of Futexes are available and here's probably the best early overview (PDF); the catch is that various Linux kernels have different futex() implementations with 4, 5, or 6 arguments
Transactional memory
Transactional Memory has been a hot idea for quite a while. Intel's Haswell processors incorporate a hardware implementation described in chapter 8 of this PDF (locally, PDF); but there were (still are) problems.
Wikipedia has a nice summary of software support for transactional memory.
There is a version of software transactional memory implemented in GCC.
Replicated/Distributed Shared Memory
A very odd one is implemented in AFAPI as Replicated Shared Memory
The best known is Treadmarks, out of Rice University
One of the latest is DEX: Scaling Applications Beyond Machine Boundaries, which is part of Popcorn Linux

Distributed Memory Programming

One-page MPI reference card
This one-page reference card I wrote isn't everything you need to know about MPI, but it'll do for most things....
MPICH (MPI over CHameleon)
One of the earliest complete MPI implementations, MPICH was layered on top of another library, and hence had some performance issues. The latest versions are highly tuned and no longer suffer significant layering costs.
This is one of many MPI implementations. It grew out of LAM MPI, which was more efficient than MPICH, and arguably still is.

EE599/699 GPU and Multi-Core Computing