Assignment 4: Coherently TACKY

There was a (minor) error in the Verilog code for slowmem64, slowmem64.v. I had sloppily used `LINE for both the number of bits in a line and the size of a line address. This meant 14-bit addresses (of lines containing 4 16-bit words each) were being stored in 64-bit registers. Here's a fixed version: slowmem64fix.v.

Remember the pipelined implementation of TACKY you did for Assignment 3? Yeah, this one. Well, in this project you're going to implement at least two of them -- each with its own coherent data cache.

How do you make a multiprocessor system? It's pretty simple. What the previous projects suggested you call a processor, we now suggest you call a core. You must instantiate at least two cores. However, the one data memory cannot be defined inside either core; it must be a single instance of a slowmem64 module with an appropriate interface to both cores defined in the processor module. Thus, your new processor module will look something like:

module processor(halted, reset clk);
...
slowmem64 DATAMEM(mfc, rdata, addr, wdata, rnotw, strobe, clk);
core PE0(halt0, reset, clk, ...);
core PE1(halt1, reset, clk, ...);
...
endmodule

in which the "..." stuff needs to provide appropriately arbitrated access to DATAMEM from within both PE0 and PE1. Arbitration? Yup. The interface to DATAMEM doesn't allow multiple cores to access the memory simultaneously, so you'll need code in processor that determines who gets to do what when. Incidentally, one core might halt before the other does; the simulation shouldn't stop until both cores have halted.

Before we dive into how the slowmem64 works, let's make it clear that each core can contain its own copy of instruction memory, which you may access directly within a single clock cycle by indexing a local register array. It is up to you if these instruction memories have identical contents in all cores. If they do (which I recommend), then you'll need some way to ensure that the cores can be executing different paths through the code: the easiest is to initialize one or more registers with a different value for each core. Here's a table summarizing what I'd suggest you initialize registers to:

Register Number Register Name PE0 Initial Value PE1 Initial Value

PC 16'h0000 16'h8000

$7 $sp 16'hffff 16'hbfff

Register Number	Register Name	PE0 Initial Value	PE1 Initial Value
	`PC`	`16'h0000`	`16'h8000`
`$7`	`$sp`	`16'hffff`	`16'hbfff`

The Slow Data Memory

The Verilog code for the slowmem64 module is slowmem64.v. It uses line-based addressing with a 64-bit word size, for a total of 16,384 64-bit lines of memory. It takes MEMDELAY (4 by default) clock cycles to complete a memory line read. The interface is pretty much the same interface that the memory had back in EE380, but with separate data in and out busses:

mfc signals when memory fetch is complete
rdata is the data value you can read when mfc is 1
addr is the address to read or write from
wdata is the data to be written
rnotw is the signal that requests memory to do something; it is 1 to request reading from memory, 0 for writing to memory
strobe needs to be 1 for rnotw to be examined; a value of 0 says neither reading nor writing is initiated this cycle
clk needs to toggle with the processor clock; memory events are triggered by the positive edge of the clock

This slow memory module counts clk cycles to delay completion of a memory read for MEMDELAY cycles. During that time, the memory will not accept another read request. However, it will accept write requests and perform them immediately. In fact, if a memory write to the same address being loaded happens while waiting for the load to complete, the load will immediately complete and return the value being written.

Note also that the slow memory is designed so that you do not need to wait for a read to complete before issue of another load -- but the newer load request will abort the earlier one. This could be useful for aborting a prefetch if a real cache miss occurs, but I don't expect you to be doing prefetch in this project.

Also note that you are allowed (and expected) to insert memory initialization code in the slowmem64 model. That's how you'll get initialized data into your multi-core processor.

The Caches

As stated above, I don't expect you to use the slow memory module for instructions; you can simply have a copy of the instruction memory as a register array inside each core. However, each of the cores must contain an L1 cache for data. Here are the rules for organizing the data caches:

The cache line size should be 64 bits. That means each line contains four 16-bit words... although the core accesses only a single 16-bit word at a time. This implies that a word write does not provide a full line of data, so it might be necesssary to read a line before writing a word into it in cache. On the other hand, it will only take one line transfer to copy a line between cache and slow data memory. It is up to you whether you should implement any performance optimizations involving handling of individual words; you don't have to in order to get full credit for the project. You might find it useful to treat each cache line as the [63:0] structure interfaced with slow data memory rather than an array of four [15:0], but it depends on how you organize things, and that choice is entirely up to you.
The total cache size within a core should be 16 lines. This means the data cache in each core holds at most 64 16-bit words of data.
The cache mapping and associativity is up to you. I would strongly recommend starting out using direct mapping (set size of one) to keep things simple. That also keeps replacement policy trivial.
How do you handle a write? Well, that's up to you. However, the caches in the two cores must behave as a coherent system in the presence of false or true sharing. In other words, if PE0 sees the line starting at address 0 contains {16'h0123, 16'hffff, 16'h4567, 16'h0000} and PE1 writes 16'h89ab to address 2, PE0 and PE1 should from then on agree that the line at address 0 contains {16'h0123, 16'h89ab, 16'h4567, 16'h0000}. In general, the coherence can be handled either by write invalidate or by write update. Invalidating simply tells the other cache that the line with the specified address is no longer valid, and must be fetched from slow data memory. Updating, which is potentially more efficient, notifies the other cache not only that the specified address's value has been changed, but also tells it the new value. Note that the caches inside each core can't talk directly to each other; they must pass signals through the processor, which is why I left those "..." operands in the description earlier.
In your TACKY pipeline, the data memory access actually had two clock cycles to work without slowing things down. Feel free to take advantage of this.

That's all there is to it.

Well, almost all. In truth, you have a lot to figure-out in terms of timing of cache operations and slow data memory access across the two cores. For example, suppose both PE0 and PE1 have the line from address 0 in their caches and PE0 tries to write into memory location 0 while PE1 tries to write into memory location 1? Suppose both want to write into memory location 3 at the same time... with potentially different values? You'll want to work all that out before you start writting the Verilog code....

Oh yeah. Just one more thing. You need to be able to disable the caches. Why? Well, I could say it was for touching I/O devices, but this is actually for a much cruder purpose here: I want you to be able to compare performance directly accessing slow data memory vs. using the caches. Be sure to comment on how performance changes with the caches enabled vs. disabled.

Test Plan

Yes, you still need one. Your project needs to include a test plan (best described in your Implementor's Notes) as well as a testbench implementing the planned test procedure. The key difference here: your test plan should clearly demostrate how both false and true sharing are handled.

Keep in mind that you only need to write one test program with one code and one data segment, but have the two cores execute different paths through the code. Given how I suggested you should initialize the registers, you can write code like:

	.text
	.origin 0
PE0:	code for PE0

	.text
	.origin 0x8000
PE1:	code for PE1

	.data
DATA:	data for both

Of course, your testing should also include disabling the caches and showing that the code works either way, while also measuring the performance gained by using cache.

Due Dates

The due date for this assignment is before the final exam, Tuesday, April 30, 2019. You may submit as many times as you wish, but only the last submission that you make will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

You should submit a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, the tarball should include the Verilog and AIK source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases (e.g., source TACKY code and .VMEM files), sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't tacky.v or the AIK file isn't tacky.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Advanced Computer Architecture.