Assignment 4: Coherently TACKY

There was a (minor) error in the Verilog code for slowmem64, slowmem64.v. I had sloppily used `LINE for both the number of bits in a line and the size of a line address. This meant 14-bit addresses (of lines containing 4 16-bit words each) were being stored in 64-bit registers. Here's a fixed version: slowmem64fix.v.

Remember the pipelined implementation of TACKY you did for Assignment 3? Yeah, this one. Well, in this project you're going to implement at least two of them -- each with its own coherent data cache.

How do you make a multiprocessor system? It's pretty simple. What the previous projects suggested you call a processor, we now suggest you call a core. You must instantiate at least two cores. However, the one data memory cannot be defined inside either core; it must be a single instance of a slowmem64 module with an appropriate interface to both cores defined in the processor module. Thus, your new processor module will look something like:

module processor(halted, reset clk);
...
slowmem64 DATAMEM(mfc, rdata, addr, wdata, rnotw, strobe, clk);
core PE0(halt0, reset, clk, ...);
core PE1(halt1, reset, clk, ...);
...
endmodule

in which the "..." stuff needs to provide appropriately arbitrated access to DATAMEM from within both PE0 and PE1. Arbitration? Yup. The interface to DATAMEM doesn't allow multiple cores to access the memory simultaneously, so you'll need code in processor that determines who gets to do what when. Incidentally, one core might halt before the other does; the simulation shouldn't stop until both cores have halted.

Before we dive into how the slowmem64 works, let's make it clear that each core can contain its own copy of instruction memory, which you may access directly within a single clock cycle by indexing a local register array. It is up to you if these instruction memories have identical contents in all cores. If they do (which I recommend), then you'll need some way to ensure that the cores can be executing different paths through the code: the easiest is to initialize one or more registers with a different value for each core. Here's a table summarizing what I'd suggest you initialize registers to:

Register Number Register Name PE0 Initial Value PE1 Initial Value
PC 16'h0000 16'h8000
$7 $sp 16'hffff 16'hbfff

The Slow Data Memory

The Verilog code for the slowmem64 module is slowmem64.v. It uses line-based addressing with a 64-bit word size, for a total of 16,384 64-bit lines of memory. It takes MEMDELAY (4 by default) clock cycles to complete a memory line read. The interface is pretty much the same interface that the memory had back in EE380, but with separate data in and out busses:

This slow memory module counts clk cycles to delay completion of a memory read for MEMDELAY cycles. During that time, the memory will not accept another read request. However, it will accept write requests and perform them immediately. In fact, if a memory write to the same address being loaded happens while waiting for the load to complete, the load will immediately complete and return the value being written.

Note also that the slow memory is designed so that you do not need to wait for a read to complete before issue of another load -- but the newer load request will abort the earlier one. This could be useful for aborting a prefetch if a real cache miss occurs, but I don't expect you to be doing prefetch in this project.

Also note that you are allowed (and expected) to insert memory initialization code in the slowmem64 model. That's how you'll get initialized data into your multi-core processor.

The Caches

As stated above, I don't expect you to use the slow memory module for instructions; you can simply have a copy of the instruction memory as a register array inside each core. However, each of the cores must contain an L1 cache for data. Here are the rules for organizing the data caches:

That's all there is to it.

Well, almost all. In truth, you have a lot to figure-out in terms of timing of cache operations and slow data memory access across the two cores. For example, suppose both PE0 and PE1 have the line from address 0 in their caches and PE0 tries to write into memory location 0 while PE1 tries to write into memory location 1? Suppose both want to write into memory location 3 at the same time... with potentially different values? You'll want to work all that out before you start writting the Verilog code....

Oh yeah. Just one more thing. You need to be able to disable the caches. Why? Well, I could say it was for touching I/O devices, but this is actually for a much cruder purpose here: I want you to be able to compare performance directly accessing slow data memory vs. using the caches. Be sure to comment on how performance changes with the caches enabled vs. disabled.

Test Plan

Yes, you still need one. Your project needs to include a test plan (best described in your Implementor's Notes) as well as a testbench implementing the planned test procedure. The key difference here: your test plan should clearly demostrate how both false and true sharing are handled.

Keep in mind that you only need to write one test program with one code and one data segment, but have the two cores execute different paths through the code. Given how I suggested you should initialize the registers, you can write code like:

	.text
	.origin 0
PE0:	code for PE0

	.text
	.origin 0x8000
PE1:	code for PE1

	.data
DATA:	data for both

Of course, your testing should also include disabling the caches and showing that the code works either way, while also measuring the performance gained by using cache.

Due Dates

The due date for this assignment is before the final exam, Tuesday, April 30, 2019. You may submit as many times as you wish, but only the last submission that you make will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

You should submit a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, the tarball should include the Verilog and AIK source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases (e.g., source TACKY code and .VMEM files), sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't tacky.v or the AIK file isn't tacky.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Your team name is .
Your password is


EE480 Advanced Computer Architecture.