Why the Modern ISA is moving towards a RISC-V ISA

Why the Modern ISA is moving towards a RISC-V ISA

Since the RISC and CISC wars that raged in the late 1990s, people have claimed that RISC and CISC doesn’t matter anymore. Many will claim that instruction-sets are irrelevant.

But instruction-sets matter. They put limits on what kind of optimizations you can easily add to a microprocessor.

I have lately been learning more about the RISC-V instruction-set architecture (ISA) and here are some of the things which really impress me about the RISC-V ISA:

  1. It is a RISC instruction set which is small and easy to learn (47 in base). Very favorable to anyone interested in learning about microprocessors. RISC-V Cheat Sheet.
  2. Dominant architecture used for teaching and in universities: Why Universities Want RISC-V.
  3. It is cleverly designed to allow CPU builders to create high performance microprocessors using a RISC-V ISA.
  4. With no license fees and being designed to allow simple hardware implementations, a dedicated hobbyist could in principle make his own RISC-V CPU design in reasonable time.
  5. Open Source designs readily available to modify and play with: The Berkely Out-of-Order (BOOM) RISC-V Processor.

The Revenge of RISC

As I have begun to understand RISC-V better, I realize that RISC-V is a radical shift back to what many thought was a bygone era of computing. In terms of design, RISC-V is almost like taking a time machine back to the classic Reduced Instruction Set Computer (RISC) of the early 80s and 90s.

Many have remarked over the latter years that the RISC and CISC distinction no longer matters because RISC CPUs such as ARM has added so many instructions, many fairly complex, that it is more of a hybrid today than a pure RISC CPU. Similar sentiments are uttered about other RISC CPUs such as PowerPC.

RISC-V in contrast is really hard core about being a RISC CPU. In fact if you read discussion online about RISC-V you will find people claiming RISC-V has been made by some old school RISC radicals who refuse to get with the times.

A former ARM engineer, Erin Shepherd, wrote an interesting criticism of RISC-V some years ago:

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions.

Let me give some quick context. Keeping code small is advantageous to performance because it makes it easier to keep the code you are running inside high speed CPU cache.

The criticism here is that the RISC-V designers have been too concerned with having a small instruction-set. That is after all one of the original RISC goals.

The claimed consequence of this is that a realistic program will require far more instructions to get stuff done, and thus consume more space in memory.

The conventional wisdom for many years has been that RISC processor should add more instructions and become more CISC like. The idea is that more specialized instructions can replace the use of multiple generic instructions.

Compressed Instructions and Macro-Operation Fusion

However there are in particular two innovations in CPU design which in many ways renders this strategy of adding more complex instructions redundant:

  • Compressed instructions — Instructions are compressed in memory and decompressed at first stage of CPU.
  • Macro-operation Fusion — Two or more simple instructions read by the CPU are fused into one complex instruction.

ARM actually employs both of these strategies already and x86 CPUs utilize the latter, so this isn’t a new trick RISC-V is pulling.

However here is the kicker: RISC-V gets far more milage out of these strategies for two important reasons:

  1. Compressed instructions got added in from the start. Thumb2 compressed instruction format used on ARM, had to be retrofitted by adding it as a separate ISA. This requires an internal mode switch and separate decoder to handle. RISC-V compressed instructions can be added to a CPU with a minuscule 400 extra logical gates (AND, OR, NOR, NAND gates).
  2. The RISC obsession of keeping number of unique instruction low pays off. There is simply more room to fit compressed instructions.

Instruction Encoding

The latter part needs a bit of elaboration. Instructions are typically 32-bits wide on RISC architectures. These bits need to be used to encode different information. E.g. say you have an instruction such as this (hash marks comments):

ADD x1, x4, x8    # x1 ← x4 + x8

This add content of register x4 and x8 and store result in x1. How many bits we need to encode this, depends on the number of registers we have. RISC-V and ARM64 have 32 registers. The number 32 can be expressed with 5-bits:

2⁵ = 32

Since we got 3 different registers we have to specify, we need a total of 15 bits (3 × 5) to encode the operands (inputs to add operation).

Thus the more stuff we want to support in our instruction-set, the more bits we consume out of our available 32-bits. Sure we could go to 64-bit instructions, but that would consume far too much memory and thus kill performance.

By aggressively keeping number of instructions low, RISC-V leaves more room to add bits that express that we are using compressed instructions. If the CPU sees that certain bits in the instruction is set, it knows that it should be interpreted as a compressed instruction.

Compressed Instructions — Two in One

That means instead of fitting one instruction inside a 32-bit word we can fit two instructions which are 16-bit wide each. Naturally not all RISC-V instructions can be expressed in 16-bit format. Thus a subset of the 32-bit instructions are picked based on their utility and frequency of use. While the uncompressed instructions can take 3 operands (inputs), the compressed instructions can only take 2 operands. Thus a compressed ADD instruction would look like this:

C.ADD x4, x8     # x4 ← x4 + x8

RISC-V assembly use C. prefix to indicate that an instruction should be converted to a compressed instruction by the assembler. But actually you don’t need to write this. A RISC-V assembler will be able to pick compressed instructions over uncompressed ones when applicable.

Basically compressed instructions reduce the number of operands. Three register operands would have consumed 15 bits, leaving us with only 1-bit to specify the operation! Thus by using two operands we got 6 bits left to specify opcode (operation to perform).

This is in fact close to how x86 assembly works, where not enough bits were reserved to have 3 register operands. Instead x86 spends bits to allow e.g. an ADD instruction to read input from both memory and registers.

Macro-operation Fusion — One to Two

However it is when we combine instruction compression with Macro-operation fusion where we see the real payoff. You see, if the CPU gets a 32-bit word containing two compressed 16-bit instructions, it can fuse these into a single complex instruction.

That sounds like nonsense, aren’t we just back to the start then? Aren’t we back to a CISC style CPU, which is what we are trying to avoid?

Nope, because we avoid filling up the ISA specification with lots of complex instructions, the x86 and ARM strategy. Instead we are basically expressing a whole host of complex instructions indirectly through various combinations of simple instructions.

Under normal circumstances there is a a problem with Macro-fusion: While two instructions can be replaced by one, they still consume twice as much space in memory. But with instruction compression we are not consuming any more space. We get the best of both worlds.

Let us look at one of the examples from Erin Shepherd. In her criticism of the RISC-V ISA, she shows a simple C function. I am taking some liberty to rewrite for clarity:

int get_index(int *array, int i) { 
return array[i];
}

On x86 this compiles into:

mov eax, [rdi+rsi*4]
ret

When you call a function in a programming language, the arguments are typically passed to the function in register according to an established convention, which will depend on the instruction-set you are using. On x86, the first argument is placed in register rdi, the second in rsi. By convention return values have to be placed in register eax.

The first instruction multiplies the content of rsi with 4. It contains our i variable. Why multiply? Since the array is made up of integer elements, they are space 4 bytes apart. Thus the 3rd element in the array is actually at byte offset 3 × 4 = 12.

Afterwards we add this to rdi which contains the base address of array. This gives us the final address of the ith element of array. We read the content of the memory cell at that address and store it in eax: Mission accomplished.

On ARM it is quite similar:

LDR r0, [r0, r1, lsl #2]
BX lr ; return

Here we are not multiplying with 4, but shift register r1 2-bits to the left, which is equivalent to multiplying with 4. This is probably a more faithful representation of what happens in the x86 code as well. I doubt you can multiply with anything but multiples of 2, since multiplication is a fairly complex operation. Shifting is cheap and simple.

Anyway you can pretty much guess the rest from my x86 description. Now let us get to RISC-V, where the real fun begins! (hash starts comments)

SLLI a1, a1, 2     # a1 ← a1 << 2
ADD a0, a0, a1 # a0 ← a0 + a1
LW a0, a0, 0 # a0 ← [a0 + 0]
RET

On RISC-V registers a0 and a1 are just aliases for x10 and x11. These are where the first and second argument of a function call are placed. RET is a pseudo instruction (shorthand):

JALR x0, 0(ra)     # sp ← 0 + ra
# x0 ← sp + 4 ignoring result

JALR perform a jump to address in ra which refers to the return address. ra is an alias for x1.

Anyway, this just looks absolutely terrible right? Twice as many instructions for such a simple and common operation as doing an index based lookup in a table and returning the result.

It does look bad indeed. That is why Erin Shepherd, was highly critical of the design choices made by the RISC-V guys. She writes:

RISC-V’s simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial — x86 is a particularly bad case of this with its’ numerous prefixes).

However thanks to instruction compression and macro-op fusion we can turn this around.

C.SLLI a1, 2      # a1 ← a1 << 2
C.ADD a0, a1 # a0 ← a0 + a1
C.LW a0, a0, 0 # a0 ← [a0 + 0]
C.JR ra

Now this takes exactly the same amount of space in memory as the ARM example.

Okay, next let us do some Macro-op fusion!

One of the rules in RISC-V to allow operations to be fused into one, is that the destination register is the same. That is the case for the ADD and LW (load word) instructions. Thus the CPU will turn these into one instruction.

If this had been the case for SLLI as well we could have fused all three instructions into one. Thus the CPU would have seen something akin to the more complex ARM instruction:

LDR r0, [r0, r1, lsl #2]

Why can we not write this complex macro-operation directly in our code?

Because our ISA does not contain support for it! Remember we have limited number of bits available. Why not make the instructions longer?! Nope, that would consume too much memory, and fill up precious CPU cache faster.

However if instead we manufacture these long semi-complex instructions inside the CPU, there are no worries. The CPU never has more than a few hundred instructions floating around at any time. So wasting say 128 bits on each instruction is no big deal. There is plenty of silicon to go around for everyone.

Thus when the decoder gets a normal instruction it usually turns it into one or more micro-operations. These micro-operations are the instructions the CPU actually deals with. These can be really wide and contain lots of extra useful information. That they are called “micro” may seem ironic, given that they are wide. However “micro” refers to the fact that they do a limited number of tasks.

Goldie Locks Instruction Complexity

Macro-operation fusing turns what the decoder does a bit on its head: Instead of turning one instruction into multiple micro-ops, we take multiple operations and turn them into one micro-operation.

Thus what is going on in a modern CPU make look rather odd:

  1. First it is combining two instructions into one through compression.
  2. Then it splits it into two through decompression.
  3. The combine them back into one operation through macro-op fusion.

Other instructions in contrast may end up getting split into multiple micro-ops rather than getting fused. Why do some get fused and others split? Is there a system to the madness?

The key thing is the end up with micro-operations of the right level of complexity:

  • Not too complex, because otherwise it cannot finish in the fixed number of clock cycles allocated for each instruction.
  • Not too simple, because then we are just wasting CPU resources. Executing two micro-ops will take twice as long as executing just one.

This all began with CISC processors. Intel began splitting their complex CISC instruction into micro-operations, so they could more easily fit into their pipelines, like RISC instructions. However in later designs they realized many of the CISC instructions were so simple that they could easily be fused to one moderately complex instruction. If you got fewer instructions to execute, you finish sooner.

Benefits Gained

Okay, this was a lot of details and maybe it is hard to get a handle on what exactly the point is. Why do all this compression and fusion? It sounds like a lot of extra work.

First of all, instruction compression is nothing like zip-compression. The word “compression” is a bit of a misnomer as it is completely straightforward to instantly decompress a compressed instruction. There is no time lost doing this. And remember for RISC-V it is simple. With a mere 400 logic gates, you can perform decompression.

Same goes with macro-operation fusion. While this seems complex, these approaches are already used in modern microprocessors. Thus the tax or cost of this complexity, has already been paid for.

However unlike the ARM, MIPS and x86 designers, RISC-V designers knew about instruction compression and macro-ops fusion when they began designing their ISA. Through various tests with the first minimal instruction-set, they made two important discoveries:

  1. RISC-V programs would typically take close to or less space in memory than any other CPU architecture. Including x86, which was supposed to be space efficient given that it is a CISC ISA.
  2. It needed to execute fewer micro-operations than other ISAs.

Basically by designing the base instruction-set with fusion in mind, they were able to fuse enough instructions that the CPU for any given program had to execute fewer micro-operations than the competition.

This has made the RISC-V team double down on macro-operation fusion as a core strategy for RISC-V. You can see in the RISC-V manual a lot of notes on what operations can be fused. You see revisions which has been made to instructions to make it easier to fuse instructions appearing in common patterns.

Keeping the ISA small means it is easier to learn for a student. And it means it is easier for a student learning about CPU architecture to actually construct a CPU which runs RISC-V instructions.

RISC-V has a small core instruction set which everybody must implement. However all other instructions exist as part of extensions. Compressed instructions are simply an optional extension. Thus for simple designs it can be omitted.

Macro-op fusion is simply an optimization. It does not change the overall behavior and hence you are not required to implement it in your particular RISC-V processor.

For ARM and x86 in contrast a lot of the complexity is not optional. The whole instruction-set and all the complex instructions have to be implemented even if you try to create a minimal simple CPU core.

RISC-V Design Strategy

RISC-V has taken what we know about modern CPUs today and let that inform their choice in designing and ISA. For instance we know:

  • CPU cores have advance branch predictors today. They do predictions correct over 90% of the time.
  • CPU cores are superscalar, meaning they perform multiple instructions in parallel.
  • The use Out-of-Order execution to be superscalar.
  • They are pipelined.

This means that things such as conditional execution as supported by ARM, is no longer needed. Supporting that on ARM eats up bits in the instruction format. RISC-V can save those bits.

The original purpose of conditional execution was to avoid branches, because they were bad for pipelines. For a CPU to run fast it would typically prefetch the next instructions so it can pick the next one quickly once the previous one is done with its first stage.

But with a conditional branch, you don’t know where the next instruction will be when you begin filling up your pipeline. However a superscalar CPU could simply execute both branches in parallel.

It is also why RISV-C doesn’t have status registers. That creates dependencies between instructions. The more independent each instruction is, the easier it is to run it in parallel with another instruction.

The RISC-V strategy is basically, is how can we make the ISA as simple as possible and a minimal implementation of a RISC-V CPU as simple as possible without making design decisions which make it impossible to make a high performance CPU.

Read more