Assignment 4: Pipelined Float + Posit Gr8BOnd

Note: when I presented this in lecture on April 8, I forgot about the float/integer and posit/integer conversion operations. Ooops. Those need to be there too. As of later that evening, I believe the omission has been corrected.

In this project, your team is going to build a much more complete pipelined implementation of Gr8BOnd, including both 16-bit floating point scalar and 8-bit Posit SWAR (two field SIMD Within A Register parallel execution). The basic instruction set is the same as used in previous projects, but there are several improvements. Because of those, you cannot directly reuse any of the old assemblers you built in previous projects, but the general structure of both the assembler and pipelined implementation can probably remain the same.

One Little Fix

Completely unrelated to the float and posit arithmetic, I've added a dup instruction. Copying the value of a register to another took two instructions without this, and that made the compiled code ineffcient. It's a very easy instruction to implement, basically an ALU operation that ignores the %d value and returns the value from %s.

16-bit Floating Point

As is now shown in red in the Gr8BOnd instruction set reference, one major change is that you'll now be implementing 16-bit floating point arithmetic. The 16-bit form you'll use is not the IEEE standard for "half precision," but is the rather more useful top 16 bits of an IEEE 754 compliant 32-bit single-precision float. Aside from preserving the dynamic range of a 32-bit float, this 16-bit format is convenient because most modern computer processors do not understand IEEE's half-precision format, but this format can be converted to IEEE 32-bit by simply adding 16 zero bits at the end. Similarly, converting from 32-bit to this 16-bit format is simply a matter of ignoring the last 16 bits of the 32-bit float bit pattern representation.

Before getting into the details of these floating point operations, there is a trivial change we need to make in support of 16-bit values being floating point instead of posits (which was the original plan). The integer negate instruction, negi would work fine for posits, but floating point uses a sign+magnitude notation, so negating a float is simply flipping the sign bit. Thus, we had to add a negf instruction to do that -- and it's as easy for you to implement as it sounds.

As for the other floating point arithmetic, although we've discussed it fairly well in class, and on the floating point reference page at the course website, it's still not easy to implement. Thus, I've given you this Verilog code implementing a variety of floating-point operations. The code isn't perfect; for example, it omits rounding modes. However, it is good enough for this project, and you can use any portion of it in this project (with appropriate citation in your Implementor's Notes). Most importantly, in part by not implementing rounding and few other "details" of the IEEE 754 standard, this Verilog code implements all the operations you need as reasonably fast combinatorial logic.

Do not try to reuse portions of your integer ALU from the previous project in implementing the 16-bit float arithmetic: the operations are different enough that very little is shareable. However, like the integer ALU, the ALU implementing the gr8bond floating-point operations can easily operate as a combinatorial logic circuit performing any operation within a single clock cycle. Thus, the integer and floating-point ALUs can be separate function units in the same pipeine stage.

Of course, since 16-bit floating-point has replaced 16-bit Posit arithmetic in the instruction set, several instructions need to change their names to allow for "f" rather than "p." To be precise, addp becomes addf, mulp becomes mulf, and invp becomes invf. There are also two new type conversion operations, pp2f and f2pp. Also keep in mind that there is no support for floating-point constants in AIK -- you'll need to enter any float constant as a hexidecimal pit pattern, and you can transate values into hexadecimal using this WWW form script.

8-bit Posit SWAR

Posit arithmetic is very similar to floating point arithmetic. This similarity becomes obvious if you examine the C/C++ source code for the bfp - Beyond Floating Point library, which is essentially the reference implementation of posit arithmetic. However, the unpacking and packing are a bit messy, and in the most obvious translation to Verilog hardware specification would probably result in a multi-cycle posit ALU. We don't want that. Although you may decide to implement the posit arithmetic any way you wish, you are encouraged to do it using simple lookup tables.

The 16-bit floating-point reciprocal algorithm given does use a 128-byte lookup table, but 16-bit float operations need to be implemented by random logic algorithms because direct table lookup of the results would require too large a table: for example, addition of two 16-bit floats would require a table with 4,294,967,296 16-bit entries -- 8GB of memory! However, that's not a problem for 8-bit posits, because operations like addition only need 65,536 8-bit entries. Another benefit is that by using the bfp library to create the tables, we ensure that the implementation produces exactly the same behaviour as that reference implementation. In fact, there's even one more benefit: if we wish to change from, say Posit(8,0) to Posit(8,1) or even Posit(8,3), all we need to change is the lookup table initialization. However, we're not talking about just one lookup table -- there are conceptually 7:

8-bit posit addition: the pair of 8-bit posit operands together form a 16-bit index to a table of 65,536 8-bit entries.
8-bit posit multiplication: the pair of 8-bit posit operands together form a 16-bit index to a table of 65,536 8-bit entries.
8-bit posit reciprocal: the 8-bit posit operand becomes an 8-bit index to a table of 256 8-bit entries.
8-bit posit to 16-bit float conversion: the 8-bit posit operand becomes an 8-bit index to a table of 256 16-bit entries.
16-bit float to 8-bit posit conversion: the 16-bit float operand becomes a 16-bit index to a table of 65,536 8-bit entries.
8-bit posit to 8-bit integer conversion: the 8-bit posit operand becomes an 8-bit index to a table of 256 8-bit entries.
8-bit integer to 8-bit posit conversion: the 8-bit integer operand becomes an 8-bit index to a table of 256 8-bit entries.

That's all fine, except it's getting to be a lot of hardware if you build it like that. Why? Well, in a single clock cycle you actually need to be able to do two of any of the first three and last two posit operations above at a time, or one of conversions to/from 16-bit float, so an obvious implementation would use a total of 12 lookup tables! Don't do that. You really only need two tables for all that posit arithmetic. (In fact, you could combine those two, and the 128-entry float reciprocal lookup, into a single table with variable-size cells... but Verilog makes that rather awkward.) All three 16-bit-index tables can be combined to form a table that has 24-bit entries: 8-bit addition result, 8-bit multiplication result, and 8-bit posit converted from a 16-bit float. Similarly, the four 8-bit index tables can be combined to frm a single table that has 40-bit entries: 8-bit reciprocal result, 16-bit float converted from 8-bit posit, 8-bit posit to integer, and 8-bit integer to posit. The final trick is that each of those two tables needs to have two address decoders so that two posit addition, multiplication, reciprocal, and posit/integer conversion operations can be done in a single clock cycle. Easy peasy table squeezy. ;-)

So, where are the tables? Well, the 16-bit index one is posit1624.vmem and the 8-bit index one is posit840.vmem. In the interest of full disclosure, here's the code I used with the bfp library to compute those tables: mkposittables.cpp. From the top-level bfp library directory, it can be compiled using:

g++ -o mkposittables -std=c++11 -Ilib -Itest -O2 -Wall -g mkposittables.cpp lib/libbfp.a

Is it normal to be implementing posits this way? No. Then again, there really isn't a normal -- as far as I know, Gr8BOnd is literally the first geeral-purpose processor to implement posit arithmetic. I'm also not really thrilled with these tables. Despite being of a workable size, I'd really like to see them compressed to a much smaller size and have been trying a variety of scary, researchy, methods to do that... including a genetic algorithm that tries to implement smaller tables by computing an addressing hash function that allows multiple 16-bit index vaues that map into the same value to share the same table entry. If you think about it, each of the 16-bit index operations is really selecting which one out of just 256 possible outputs to produce, so it is possible that each of those tables could have as few as 256 entries instead of 65,536. One of my PhD graduates developed and used this idea of compressive hashing in her 2010 thesis: Muthulakshmi Muthukumarasamy, Extraction And Prediction Of System Properties Using Variable-N-Gram Modeling And Compressive Hashing. In any case, thus far, the genetic programming system I wrote (specifically in support of this project) to find cheap compressive hashing circuits hasn't found anything worth using instead of the obvious 16-bit indexing, but I'll let you know if it comes up with a better option....

In any case, only the 16-bit-index table is big enough to be any concern, and it's a size that would still fit in block RAM of most FPGAs. I was originally going to force you to write the Verilog code in a style that would be guaranteed to generate a block RAM implementation (e.g., like this or this) rather than using logic cells to construct the memory, but that could add a clock cycle and the lookup tables would really be ROM in a custom VLSI implementation anyway, so I've decided to let you access an array of registers to directly implement each table. Just structure your code so that each lookup table is indexed by no more than two unique expressions. In other words, if the 16-bit table is indexed by a in one place and b in another, to ensure no more than two ports are created, it should never be indexed by anything other than a or b.

Pipeline Structure

You can and should have a pipeline structure very similar to that for the previous project. The catch is that you now have a lot more code implementing the ALU functionality, with correspondingly more wiring and multiplexors to shuffle data where it needs to go. However, none of this is really qualitatively different... it's just more of the same, fitting into the same four-or-more-stage pipeline you used for the previous project. Incidentally, that's also how you should start: by reviewing the previous projects from each team member to create the best possible pipeline framework for adding the 16-bit float and 8-bit posit operations. To be precise, I'd suggest figuring out how to encode/decode the new instructions, building the AIK assembler, and then making sure that your selected Assignment-3-based pipeline structure still works for all the old instructions before you start adding support for executing the new instructions.

You are not allowed to use anything from another Assignment 4 team nor from an Assignment 3 team that none of your Assignment 4 team members were on. You can use things done by any of your Assignment 4 team members, including things their teams did on Assignment 3, and things provided as part of this assignment. If you find other materials, for example solutions posted from previous semesters, useful, you may borrow ideas from them, but should generally not literally copy code and you must cite the sources you borrowed ideas from in your Implementor's Notes. Although you might be able to do many things exactly as you did in the previous project, you should carefully consider everything from the instruction encoding on, because there are some new instructions in this project.

Testing

Again, the test coverage plan and testbench from Assignment 3 are probably very close to what you want. However, there are some new instructions and functionality, so you'll need to test that. Although it would be technically feasible, I do not expect you to do exhaustive testing of the new arithmetic, but again, don't wait until "everything" is done to start testing! Do incremental testing as you add support for each type of instruction to your pipeline....

Just to be clear, I do not expect you to incorporate any design for testability features in your Verilog design.

Since there have been issues with the Covered coverage analysis tool (probably relating to trace files becoming too large for the WWW form interfaced tool), it is acceptable to not run Covered to confirm line coverage. Simply don't generate a trace. However, I would expect that your test cases still obtain very close to 100% line coverage. Manually confirming line coverage isn't really hard; just do a little code review confirming that each line should be covered by one or more of your test cases.

Due Dates

I know that COVID-19 disrupting everyones life in many ways, and having only virtual meetings, will make it difficult to stick to any schedule. However, we can't push things back any further than the date of the final exam: May 4, 2020. This gives you close to a month, but please don't think of it that way. It is strongly recommended that you treat the project as though the deadline was April 24, which still gives you more than two weeks to work on it, so that it need not interfere with end-of-semester things for your other courses.

First priority should be for you to get your assembler specification (gr8bond.aik) and Implementor's Notes together, including an overview of the structure of your intended design. That overview could be in the form of a diagram, or it could be a list of top-level modules, but it is important in that it should demonstrate that you are on the right track and thus allow assigning significant partial credit even if you don't get much farther. There will be a lot more Verilog code in this project, but it isn't really significantly more difficut than the previous pipelined implementation. After all, none of the changes to the instruction set and arithmetic need to cause any new pipeline interlocks, etc. Just keep in mind that debugging this larger Verilog code could easily take a lot longer than for the previous project.

Submission Procedure

For each project, your team (NOT each person individually) will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't gr8bond.v or the AIK file isn't gr8bond.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Advanced Computer Architecture.