How does a CPU use the registers?
This is the first part in, I hope, a long series of short articles, which will explain some of the basics about the internal design of CPUs.
In this first part we’ll look at the registers, or more specifically, how the registers are used. Registers?
A CPU usually contains a register file. A register file consists of a number of registers plus status bits, which give information about the registers; for instance whether an overflow occurred. Registers are needed because a CPU cannot directly work with data that is stored in the memory. So if the CPU wants to work with data, it needs to copy this data to the registers and afterwards copy them back. When a CPU executes an instruction this usually takes 5 steps:
Instruction Fetch: the instruction(s) that need to be executed are grabbed from the memory, or more often (instruction) cache. Instruction Decode: the instruction is translated to the commands that are specific for the CPU.
Instruction Execute: the instruction is executed. Memory Access: if needed (only for load, store and brand instructions) data is written or read from the memory or cache.
Write Back: if needed (only for instructions where a result needs to be stored in a register) the result of the execution step is stored into a register.
The question is now, how do the registers fit in here. In general there are four ways to use the registers. I'll illustrate this with a simple calculation:
C = A + B
Now this calculation can be done via accumulator, stack, register-memory or register-register type of register-usage. Accumulator The calculation would require three steps. Load A Add B Store C Only one register is used, the Accumulator. Into this register, first, memory data A is placed. Then memory data B is added to the contents of the register. The register then contains the result of A+B.
Finally, the result is placed in memory C. This is the simplest way ofusing a register, but of course also a very limited one, since complex instructions that require more different values are much more difficult to code. Also a problem is that in each step the memory is accessed.
Memory is much slower than register-space and therefore the CPU clock is limited by the memory-speed. Therefore this way of register usage is often only used in microcontrollers. Stack A more efficient way of using registers is stack-based: Push A Push B Add Pop C The easiest way to explain this is to draw a picture.
After the first instruction - push A - the value of A is placed onto the stack, here with three registers. After the second instruction - push B - the first value is pushed down into the stack and the value B is placed on top. The third instruction adds the values of the two top-most registers in the stack, pops the topmost value off the stack which causes the value A to shift back to the topmost position. Then the result is placed on this topmost place and therefore overwrites value B with the result A+B.
The fourth and last instruction simply pops this A+B value off the stack and puts the result in memory place C. As one can see this is an improvement compared to the accumulator. Now more than one register is available for calculation. Therefore less memory-access is required. This way of register usage is for instance used in x87-based CPUs, better known as the FPU of PC processors such as the Intel Pentium III and AMD Athlon. But also modern Java processors such as the pico-Java I use stack-based registers. Of course the downside is obvious.
It is only possible to use the topmost registers. This limits the freedom of register-usage. This is also one of the reasons why even the powerful FPU units of the Pentium III and AMD Athlon are often no match for most RISC CPUs like the Alpha 21164 or MIPS R12000. Register-Memory A more efficient way of using registers is the register-memory way: Load reg1, A Add reg1, B Store C, reg1 In this simple example, only one register is used but it is not difficult to see that more freedom exists than in the stack-based case. Each register in the register-file can be accessed directly, and not just the top-most. This technique is often used in CISC based CPUs.
The reason lies in the past. The ‘glory-days’ of CISC lay in about 1960 to 1975. The reason for this is that memory was extremely expensive in this period. Therefore systems were equipped with relatively slow memory. To understand the importance of this we must look back at the 5 steps required for executing an instruction. Step 1, the instruction fetch, is determent by the speed of the memory (including cache memory).
All following steps (besides maybe storage back to memory) are CPU-speed dominated. CPU's in that time were partially pipelined, but to make things easier to understand we will assume they were fully pipelined. We would get: Figure 2, Fetch is the dominating step. Here we see a few cycles. As the picture show the decode-step takes less time than the fetch. However because during the decode step already a new fetch of another instruction takes place - remember the chip is assumed to be pipelined - the time saved cannot be used. So with those chips getting maximum performance was simply an issue of doing as much as possible in as little as possible instructions.
No matter how complex instructions are, they will still require less time than a fetch. This explains why CISC has many complex instructions and little instructions for a task are needed. Of course the downside is that increasing MHz is difficult with such a complex chip. However this was no issue since the (cache) memory wouldn't be able to catch up anyway. This however changed around 1975 when cheap and fast memory arrived. Register-Register When fast memory chips were available things changed. Now it didn't matter anymore how many cycles were needed for a task. If you pipeline a chip the effective throughput will be 1 instruction per cycle anyway. So now it was just a case of getting as much MHz out of the chip as possible. As a result of this register usage changed. Load reg1, A Load reg2, B Add reg1, reg2 Store C, reg1 More instructions are needed than in the previous case, but fewer actions need to be performed. Therefore it is easier to increase the MHz of a chip, which results in higher performance.
This is a policy very common in RISC designs. One downside is however that more registers are needed. Also the chip design is often kept simple in order to prevent nasty instructions, that are difficult to be optimised, from becoming bottlenecks. Therefore RISC chips often have, besides many registers, little instructions and deep pipelines. Final Words These are the four basic ways to use registers. Of course real world CPU's will not always fit in one category and make use of combinations. For instance the AMD Athlon is a CISC chip on the outside but RISC-like on the inside. Its FPU-design is still stack-based however. Still, I hope this article gave you some basic knowledge of how registers are being used in a chip