Assignment 1: Traveling Salesperson Genetic Algorithm

The goal of this CUDA project is to write a simple GA to solve the traveling salesperson problem. It's really not a very hard problem to understand... but solving it takes factorial effort. So how bad is that? Well, 5! is 120. 10! is 3,628,800, 15! is 1,307,674,368,000, and 20! is 2,432,902,008,176,640,000. Your program is to handle up to 64....

To Begin

This project involves parallelizing a Genetic Algorithm (GA). Here's a tgz of some searching code leading up to GAs. GAs give a way to write code that can solve problems you don't have an algorithm to solve; they are known as a powerful form of machine learning... or maybe just a useful tool for efficiently searching for an answer? Either way, we're going to have you parallelize one here. The serial GA code in C is ts.c. It's pretty straightforward, as discussed in class. The only differences are that the cross() and mutate() are a little more complex in order to give somewhat better performance... including cross() converting into a mutate() if the crossover wouldn't change anything. The #ifdef VERBOSE stuff is just for the serial version; to get lots of tracing output, compile it with:

cc ts.c -o ts -lm -DVERBOSE

The command line arguments are simple enough. The first is the population size, with a maximum of 65536 (64K). The second and third arguments determine the relative frequency of crossover vs. mutation. The fourth argument is the number of individuals (steady-state "generations") to execute. The final argument is the number of locations for the traveling salesperson problem. A positive number causes random locations to be created; a negative number creates a fixed set of locations with an obvious optimal solution. A good test case is:

./ts 64 2 1 10000 -10

Get used to how this code works....

Functional Requirements

Your task is simply to get some speedup by re-writing this code to use CUDA. Start by making a new directory called NVIDIA_CUDA-9.2_Samples/0_Simple/tsga and copy the scaling files into it. Change all the references to "scaling" to "tsga", recompile using make, and run the newly-compiled ./tsga. Ok. Now put a copy of ts.c into that directory. It's time to start changing everything....

How do you parallelize this GA code? Well, you simply run many copies of it on different GPU PEs (or SIMD engines). Each one should have its own population -- what is known as an "island model." The population data structure is too big to fit in local memory, so you'll have a global memory data structure something like:

typedef struct {
    place_t p[MAXISLAND][MAXPOP][MAXPLACE];
    float   m[MAXISLAND][MAXPOP];
} alldata_t;

In other words, you're adding a MAXISLAND dimension (not necessarily as the first dimension -- that choice should match the memory banking to how you parallelized). Note, however, that the global data structure is not coherently shared. Basically, each island will access only global data within its island dimension -- there is no communication between islands inside the GPU.

As stated in class, the code run inside each GPU PE can essentially be identical to what would be run sequentially. However, there is a slight complication: you must declare the functions so that they are available to code running on the GPU. For the kernel entry function, that means you define it as returning a __global__ void. The __global__ part makes the function visible to both the host and the GPU; the void part is required because GPU code invoked by the host cannot directly return a value. If a function will only be run in the GPU, called only by GPU code, you should declare it as __device__, and it can return a value. Although this looks quite normal, you should be aware that GPUs have very limited support for actual function calling, so recursion is either limited to a few levels or completely unsupported: GPU functions should generally be declared as inline __device__. Note that you can also explicitly mark functions as running only on the host using __host__, although that is essentially the default behavior anyway. Combining __host__ __device__ can also make sense, declaring a function as really being two copies: one for the host and another for the GPU.

Simply run for a while within each island on the GPU, copy the data back to the host, note the global best and perhaps print it, copy some data between islands, and repeat. There are many ways to communicate between islands. A very simple way is to simply copy the global best into each island's population. Other alternatives involve copying a "good" individual from each island to some other -- perhaps the island to the right.

Island model GAs usually work better than a GA with a single population of the same size because the islands preserve diversity while allowing fast convergence within each island. It doesn't really make much difference how infrequently islands interact or how they interact, so pick a method that is efficient for you to implement.

Details

There is one little trick here: rand(). You need a good quality one on the host for the initial data, but you also need one that can give a decent random sequence that is different on each PE. Fortunately, the parallel one doesn't need to be as high quality. There is fairly high quality random number generation support in the CUDA libraries, documented here, but you can easily make your own that is faster and good enough. You can use a very simple generator of the form:

rand = (a * rand) % m

If m is 2 to the 32nd power, the modulus is a free operation, since a 32-bit integer value is naturally clipped that way. In such a case, a decent choice of the value for a is ((8 * 16807) + 5). Of course, the bottom few bits don't look very random, so you'll probably want to ignore some of the least significant bits rather than using the rand value directly.

That generator by itself would give the same value sequence for every processing element. Instead, you can use the same seed for all processing elements, but also use the trick that for NPROC virtual processing elements, you start each processing element with

rand = (b * seed) % m

where b is a to the IPROC power. Each processing element then substitutes a to the NPROC power for a in the original formula. For example, b on different PEs would be 134461, 134461*134461, 134461*134461*134461, etc. Note that we have to avoid the value 0. If NPROC is 64, then the formula for a 32-bit rand in each PE would be rand = (1352027393 * rand);. It's probably better to use (rand >> 8) for your random value rather than rand itself so that the not-very-random low bits are ignored. Simple enough, right?

Due Dates

The recommended due date for this assignment is before class, Thursday, October 18, 2018. This submission window will close when class begins on Thursday, October 25, 2018. You may submit as many times as you wish, but only the last submission that you make before class begins on Thursday, October 25, 2018 will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. Your implementors notes for this project must include your observations about the execution time as the parameters are changed; your observations can be stated in paragraph for, as a table, or as a graph of an appropriate type. For this particular project, place everything in the scaling directory and submit a tarball of that directory.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

GPU Computing