Assignment 5: Transactional Memory For Pipelined AXA

This project is not really about AXA, but about building a memory system that can provide the necessary support for transactional memory. The AXA processor you'll be using is actually just a forward-only pipelined version like you built in Assignment 3. You're not building the support for reverse execution; that's being done by the teams assigned Assignment 4, and you are encouraged to talk with them about interfacing. You also will want to read Assignment 4, because it has the best explanation of what the instructions are supposed to do with respect to handling signals... like SIGTMV, which is what your project is all about.

How Reverse Execution Works

Let me say it again: you're not building support for reverse execution. However, you are building a data memory system that can generate memory access stalls and also a SIGTMV. When a signal is raised for which there is no handler -- and for your machine that's any signal -- the correct action to take is essentially treating the signal like a sys instruction and halting. In other words, in addition to halting when fail raises a signal, your forward-only processor should still detect SIGILL and SIGTMV and respond to them by halting. In order to do that, you'll have to recognize when a jerr tried to install a handler for SIGTMV and when com uninstalls that handler.

Welcome To Multi-Core Processing

The catch is that SIGTMV is about failing a memory transaction, and that's something that only happens when there is at least one other thing touching memory: you need to build a multi-core processor.

That's not really all that bad. You mostly just need to make at least two instances of your processor module. Incidentally, one core might halt before the other does; the simulation shouldn't stop until both cores have halted. Thus, what was your processor module becomes a module called core and your new processor module will look something like:

module processor(halted, reset clk);
...
slowmem16 DATAMEM(mfc, rdata, addr, wdata, rnotw, strobe, clk);
core PE0(halt0, reset, clk, ...);
core PE1(halt1, reset, clk, ...);
...
endmodule

in which the "..." stuff needs to provide appropriately arbitrated access to DATAMEM from within both PE0 and PE1. You'll need to have a little arbitration logic to determine who gets to touch DATAMEM when, because there's only one of it and there are two cores that could simultaneously want access.

Before we dive into how the slowmem16 works, let's make it clear that each core can contain its own copy of instruction memory, which you may access directly within a single clock cycle by indexing a local register array. It is up to you if these instruction memories have identical contents in all cores. If they do (which I recommend), then you'll need some way to ensure that the cores can be executing different paths through the code: the easiest is to initialize one or more registers with a different value for each core.

I'd suggest initializing the PC in PE0 to 16'h0000 and PC in PE1 to 16'h8000. That way, they can easily be executing different programs despite having identical copies of instruction memory. Similarly, I'd suggest $sp be initialized respectively to 16'hbfff and 16'hffff... which is essentially reserving 1/4 of the memory space for each program's stack. It wouldn't be good for them to hit each others stacks, would it? Similarly, I'd suggest starting static structures in data memory at 16'h0000 and 16'h4000 repspectively. Keep in mind that both processors can freely access anything in data memory as long as they know it's address, and both programs can easily know all addresses if you run the code for both together through your AIK-generated assembler.

The Slow Data Memory

The Verilog code for the slowmem16 module is slowmem16.v. It uses line-based addressing with a 16-bit word size, for a total of 65,536 16-bit lines of memory. Yes, that's basically a single word per line. It takes `MEMDELAY (4 by default) clock cycles to complete a memory line read. The interface is pretty much the same interface that the memory had back in EE380, but with a few tweaks:

rdy signals when memory fetch is complete and, hence, when another request can be accepted
rdata is the data value that you can read when rdy is 1; the Verilog code deliberately sets these bits to 16'hxxxx when the memory is not yeat ready
addr is the address to read from, and also possibly to write to
wdata is the data to be written
wtoo is the signal that requests memory to write as well as read; 0 means read only, 1 means read the old value and write wdata as the new value
strobe needs to be 1 for addr, , and wtoo to be examined; a value of 0 says neither reading nor writing is initiated this cycle
clk needs to toggle with the processor clock; memory events are triggered by the positive edge of the clock

This slow memory module counts clk cycles to delay completion of a memory read (and potential write) for `MEMDELAY cycles -- by default, 4 cycles. During that time, the memory will not accept another request. (Note that this is very different from the slowmem64.v used in a previous semester, and thus requires very different handling.)

Also note that you are allowed (and expected) to insert memory initialization code in the slowmem16 model. That's how you'll get initialized data into your multi-core processor.

The Caches

This project is primarily about the caches and data memory access control logic... so let's talk about the caches:

You are expected to create one cache for each core. These could be two instances of a cache module, but I have no problem with the caches being local to the processor module. In any case, the two caches must behave in a coherent manner; I would suggest using cache update messages rather than invalidates, but read about the transactional memory support before trying to decide how you'll handle this.
The cache line size is 16 bits. Yup; both cache lines and slow main memory are accessed one word at a time. Why? Simply to keep your work in handling transactions easier. Things get more complex if a single cache line could hold multiple references within a single transaction, so we will not do that.
The total cache size for each core is just 8 lines. Yup, it's tiny.
Why are the caches so small? Because each must be fully associative and managed using a particular type of LRU replacement policy. Least recently used means that the next line to flush from cache if a new line is needed is the one that hasn't been touched in the longest time. It is also worth noting that both the associativity and LRU selection should be possible within a single clock cycle. This can be done fairly straightforwardly using multiple comparators and by implementing LRU using a single-bit timestamp. You'll see more about the LRU trick below....
The main memory interface can only serve a single request at a time, and takes four cycles before the data read is rdy. Thus, you must implement arbitration logic that ensures only a single cache is using the main memory interface at a time.
Normally, there would be some way to bypass the cache for access to memory-mapped I/O devices. Not in this project....

Of course, the caches will help performance in the usual ways, and each core may be accessing its cache simultaneously. The catch is that a cache miss will require at least 4 clock cycles (more if the slow memory interface is busy), so your pipelined core must be able to stall waiting for a memory reference to complete.

Transactional Memory Request Handling

A memory transaction is defined as any data memory references that happen while a handler is installed for SIGTMV. Thus, memory references that happen before a jerr establishing a handler for SIGTMV are not transactions, neither are ones that happen after a com. Here's how memory references during a transaction work:

A memory read request is processed somewhat normally. If the line is in cache, it is returned from there. If not, the oldest entry in the cache is flushed and the desired value is fetched either from the other core's cache (if it is there) or from slow memory. The cache line read is also marked as involved in a transaction.
A memory exchange request does NOT write to memory during the transaction. If the line is in cache, it returns the old value, overwrites that cache line with the new data, and marks that line as "dirty" and involved in a transaction. If the line isn't in the cache, the reference is first processed like a read miss (described in the previous bullet).
The maximum number of memory accesses in a transaction is limited to the size of the cache, in this case, 8 entries. You may assume that number is never exceeded. This is important because, combined with full associativity and the right type of LRU replacement, it ensures that no memory writes will ever need to be flushed to slow memory during a transaction.
The LRU mechanism is needed to ensure that no cache entries made during a transaction are flushed from cache until the transaction has completed (or failed, although your core will simply halt in that case). A true LRU replacement policy would do what we want, but it's hard to implement. Instead, you should use the classic 1-bit timestamp approximation:
- The timestamps on all lines are initialized to 0. This is also done when all timestamps are 1 or when a transaction begins.
- Whenever a line is accessed, its timestamp becomes 1.
- When a line must be flushed, pick any line that has a timestamp of 0. It doesn't really matter how you pick between lines that have a 0 timestamp. The sneaky thing here is that implies you'll never flush a line that is active in the current transaction....
If a core is inside a transaction and the other core attempts to read or write any address that is in this core's cache and marked as part of a transaction, the transaction fails if either was trying to change the value; in other words, it fails is either:
- The other core is writing.
- This core's cache entry is dirty.
If a core is inside a transaction an attempts to read or write any address that is in the other core's cache, the transaction fails is either core was writing. In other words, it fails the transaction if either:
- This core is writing.
- The other core's cache entry is dirty.
Whenever a com instruction is executed to end a transaction, that processor is granted uniterrupted access to slow memory for writing-back any dirty values in its cache from the transaction. In fact, although you could allow non-interfering read references by the other processor, it is acceptable for you to simply stall any cache/memory access by the other processor. Basically, the logic is that a "commit" should be an uninterruptible update of slow memory.

I know that seems very complex... and it sort-of is. However, this is trying to accomplish something very researchy -- hardware recognition of transactional memory is not something very many people have successfully implemented. I think the above scheme works, but if I'm wrong, that's definitely something you'll want to point-out in your Implementor's Notes. After all:

Basically, you're really just building a pair of coherent, full-associative, approximate LRU caches. The magic that makes it possible for that to detect transactional failures is that the combination of features ensures parts of a transaction don't get kicked-out of cache prematurely... which would be a huge problem in that the cache entries are really how the hardware ensures the transaction is accomplished without conflicts.

Executive Summary

You are to instantiate two cores, each with its own data cache, but maintained coherent as they vie for access to the slow data memory... which might cause a memory access to stall the core. The slow memory access is mediated by arbitration logic you will build. Each line in each cache needs to have a 16-bit address (used for fully-associative matching), a 16-bit line of data, a 1-bit approximate LRU timestamp, a dirty bit, and a bit recording if it is part of a transaction (or perhaps just one such mark for the whole cache?). You also need to have the cores tell their caches when a handler for SIGTMV is installed (by jerr) or removed (by com) so the cache can detect transactional memory violations. However, that's really all you're doing... in other words, this project is not really about building a processor, but just a somewhat fancy memory interface.

Testing

The test coverage plan and testbench from Assignment 3 are not all that close to what you want here. You can't reuse the old test program. In fact, you'll need more than one test program.

Why? Well, you really need to have two separate programs for the two cores. However, beyond that, whenever a transaction fails, that processor doesn't really have a handler (that's what the Assignment 4 teams are making), so it will halt. Since there is more than one case you need to check for a transaction failing, you'll need to run multiple test programs.

On the bright side, I'm not requiring you to test all the instructions in the cores. This project is about the cache and transaction handling, so you just need to test the instructions you need to use to ensure that your caches work as they are supposed to. In other words, you want coverage of the cache operations and transaction handling, not the cores per se. Thus, you'll need more than one test program, but they each can be quite simple.

Due Dates

The recommended due date is Friday, December 13, 2019. By that time, you should definitely have at least submitted something that includes the assembler specification (axa.aik), and Implementor's Notes including an overview of the structure of your intended design. That overview could be in the form of a diagram, or it could be a list of top-level modules, but it is important in that it ensures you are on the right track. Final submissions will be accepted up to just before the final exam at 8AM on Thursday, December 19, 2019.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't axa.v or the AIK file isn't axa.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Advanced Computer Architecture.