Assignment 5: Transactional Memory For Pipelined AXA

This project is not really about AXA, but about building a memory system that can provide the necessary support for transactional memory. The AXA processor you'll be using is actually just a forward-only pipelined version like you built in Assignment 3. You're not building the support for reverse execution; that's being done by the teams assigned Assignment 4, and you are encouraged to talk with them about interfacing. You also will want to read Assignment 4, because it has the best explanation of what the instructions are supposed to do with respect to handling signals... like SIGTMV, which is what your project is all about.

How Reverse Execution Works

Let me say it again: you're not building support for reverse execution. However, you are building a data memory system that can generate memory access stalls and also a SIGTMV. When a signal is raised for which there is no handler -- and for your machine that's any signal -- the correct action to take is essentially treating the signal like a sys instruction and halting. In other words, in addition to halting when fail raises a signal, your forward-only processor should still detect SIGILL and SIGTMV and respond to them by halting. In order to do that, you'll have to recognize when a jerr tried to install a handler for SIGTMV and when com uninstalls that handler.

Welcome To Multi-Core Processing

The catch is that SIGTMV is about failing a memory transaction, and that's something that only happens when there is at least one other thing touching memory: you need to build a multi-core processor.

That's not really all that bad. You mostly just need to make at least two instances of your processor module. Incidentally, one core might halt before the other does; the simulation shouldn't stop until both cores have halted. Thus, what was your processor module becomes a module called core and your new processor module will look something like:

module processor(halted, reset clk);
...
slowmem16 DATAMEM(mfc, rdata, addr, wdata, rnotw, strobe, clk);
core PE0(halt0, reset, clk, ...);
core PE1(halt1, reset, clk, ...);
...
endmodule

in which the "..." stuff needs to provide appropriately arbitrated access to DATAMEM from within both PE0 and PE1. You'll need to have a little arbitration logic to determine who gets to touch DATAMEM when, because there's only one of it and there are two cores that could simultaneously want access.

Before we dive into how the slowmem16 works, let's make it clear that each core can contain its own copy of instruction memory, which you may access directly within a single clock cycle by indexing a local register array. It is up to you if these instruction memories have identical contents in all cores. If they do (which I recommend), then you'll need some way to ensure that the cores can be executing different paths through the code: the easiest is to initialize one or more registers with a different value for each core.

I'd suggest initializing the PC in PE0 to 16'h0000 and PC in PE1 to 16'h8000. That way, they can easily be executing different programs despite having identical copies of instruction memory. Similarly, I'd suggest $sp be initialized respectively to 16'hbfff and 16'hffff... which is essentially reserving 1/4 of the memory space for each program's stack. It wouldn't be good for them to hit each others stacks, would it? Similarly, I'd suggest starting static structures in data memory at 16'h0000 and 16'h4000 repspectively. Keep in mind that both processors can freely access anything in data memory as long as they know it's address, and both programs can easily know all addresses if you run the code for both together through your AIK-generated assembler.

The Slow Data Memory

The Verilog code for the slowmem16 module is slowmem16.v. It uses line-based addressing with a 16-bit word size, for a total of 65,536 16-bit lines of memory. Yes, that's basically a single word per line. It takes `MEMDELAY (4 by default) clock cycles to complete a memory line read. The interface is pretty much the same interface that the memory had back in EE380, but with a few tweaks:

This slow memory module counts clk cycles to delay completion of a memory read (and potential write) for `MEMDELAY cycles -- by default, 4 cycles. During that time, the memory will not accept another request. (Note that this is very different from the slowmem64.v used in a previous semester, and thus requires very different handling.)

Also note that you are allowed (and expected) to insert memory initialization code in the slowmem16 model. That's how you'll get initialized data into your multi-core processor.

The Caches

This project is primarily about the caches and data memory access control logic... so let's talk about the caches:

Of course, the caches will help performance in the usual ways, and each core may be accessing its cache simultaneously. The catch is that a cache miss will require at least 4 clock cycles (more if the slow memory interface is busy), so your pipelined core must be able to stall waiting for a memory reference to complete.

Transactional Memory Request Handling

A memory transaction is defined as any data memory references that happen while a handler is installed for SIGTMV. Thus, memory references that happen before a jerr establishing a handler for SIGTMV are not transactions, neither are ones that happen after a com. Here's how memory references during a transaction work:

I know that seems very complex... and it sort-of is. However, this is trying to accomplish something very researchy -- hardware recognition of transactional memory is not something very many people have successfully implemented. I think the above scheme works, but if I'm wrong, that's definitely something you'll want to point-out in your Implementor's Notes. After all:

Basically, you're really just building a pair of coherent, full-associative, approximate LRU caches. The magic that makes it possible for that to detect transactional failures is that the combination of features ensures parts of a transaction don't get kicked-out of cache prematurely... which would be a huge problem in that the cache entries are really how the hardware ensures the transaction is accomplished without conflicts.

Executive Summary

You are to instantiate two cores, each with its own data cache, but maintained coherent as they vie for access to the slow data memory... which might cause a memory access to stall the core. The slow memory access is mediated by arbitration logic you will build. Each line in each cache needs to have a 16-bit address (used for fully-associative matching), a 16-bit line of data, a 1-bit approximate LRU timestamp, a dirty bit, and a bit recording if it is part of a transaction (or perhaps just one such mark for the whole cache?). You also need to have the cores tell their caches when a handler for SIGTMV is installed (by jerr) or removed (by com) so the cache can detect transactional memory violations. However, that's really all you're doing... in other words, this project is not really about building a processor, but just a somewhat fancy memory interface.

Testing

The test coverage plan and testbench from Assignment 3 are not all that close to what you want here. You can't reuse the old test program. In fact, you'll need more than one test program.

Why? Well, you really need to have two separate programs for the two cores. However, beyond that, whenever a transaction fails, that processor doesn't really have a handler (that's what the Assignment 4 teams are making), so it will halt. Since there is more than one case you need to check for a transaction failing, you'll need to run multiple test programs.

On the bright side, I'm not requiring you to test all the instructions in the cores. This project is about the cache and transaction handling, so you just need to test the instructions you need to use to ensure that your caches work as they are supposed to. In other words, you want coverage of the cache operations and transaction handling, not the cores per se. Thus, you'll need more than one test program, but they each can be quite simple.

Due Dates

The recommended due date is Friday, December 13, 2019. By that time, you should definitely have at least submitted something that includes the assembler specification (axa.aik), and Implementor's Notes including an overview of the structure of your intended design. That overview could be in the form of a diagram, or it could be a list of top-level modules, but it is important in that it ensures you are on the right track. Final submissions will be accepted up to just before the final exam at 8AM on Thursday, December 19, 2019.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't axa.v or the AIK file isn't axa.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Your team name is .
Your password is


EE480 Advanced Computer Architecture.