User:CTho/Out of Order Execution

This is copied from my forum post here (page 3, I think) on 9/18/2007 at 7:34pm (not sure what timezone, probably eastern or central). I edited it slightly and converted it to mediawiki markup so it would look right here.

This writeup gives an extremely simplified description of out of order execution. I wrote it to answer one specific question (I was debating what a "uop" is in AMD vs Intel terminology with someone, and some laypeople wanted to understand our argument), so it isn't very general. I arbitrarily ignored a very large number of very significant things in order to keep this understandable to people with minimal computer architecture background. If you understand the "classic 5 stage pipeline", you may end up either confused or thinking I'm stupid ;). I revised parts of this a bunch of times, so hopefully it still "flows" reasonably well.

Motivation for out of order execution

edit

A CPU executes an instruction in a series of steps:

  1. Fetch the instruction from memory.
  2. Decode the instruction
  3. Get the inputs to the instruction.
  4. Execute the instruction in one of the execution units (let's say we have an ALU, a multiplier, a memory-read unit, and a memory-write unit)
  5. Write the result back to the register file (or memory or the screen)

If you build your machine just like this one, you can have one instruction in each of those stages and have a throughput of 1 instruction each cycle. Note that this means only 1 of the 5 execution units will ever be in use at a time. This design looks something like a line of people handing a package down the line, and in the middle of the line there's a spot where there are 5 people next to each other and you can hand the package to one of them. The multiplier and divider are slow, and they hold on to packages for a while before passing them on, and while they're holding something, everyone else has to wait (packages stay in order).

X = person, | = directions a package can be handed down to.

  X    fetch
  |
  X    decode
 /|\
X X X  execute (really 5 instead of just 3 here)
 \|/
  X    write back

Now, you'll often find that one instruction takes a while to execute and holds up the rest of the machine, which is bad for performance. For example, if you had a multiply instruction, there would be about 3 cycles that the machine spends waiting for the multiply to finish while everything else is stalled. To get around this, you can use "out of order execution".

Buffers required for out of order execution

edit

That changes the story a little:

  1. Fetch the instruction from memory.
  2. Decode the instruction
  3. Send the instruction to a queue where it waits until the inputs become available
  4. Execute the instruction in one of the execution units (let's say we have an ALU, a multiplier, a divider, a memory-read unit, and a memory-write unit)
  5. Wait in a queue until the instruction is the oldest instruction
  6. Write the result back to the register file (or memory or the screen)

A quick note: an input might not be available if it came from something slow. If your input came from an add, it'll be ready the next cycle, but if it came from a multiply you'll have to wait around for a while.

Now, instead of having a line where each person holds one package at a time and passes it on with every tick of a clock, there's a buffer part of the way down. The front-end of the cpu (fetch and decode) crunch through instructions as fast as they can, and toss them into this buffer. The back-end (execute) takes instructions as they become ready, executes them, and puts them into a second buffer. The instructions finish ("retire") when they leave the second buffer.

You might wonder why we have to have step 5 - why do instructions have to wait? Well, the problem is that program flow doesn't always go smoothly. What happens if a computation produces a number too big to store? Let's consider this example:

  1. a = 999
  2. b = 999
  3. c = 1
  4. z = 0
  5. c = a * b
  6. d = z + 1

First interesting cycle: z becomes 0
Next cycle: the multiplication starts
Next cycle: the multiplication is still going. d becomes 1
Next cycle: the multiplication is still going. we discover that we can't hold a number that big.
Next cycle: we dump the values of all of the variables so the programmer can figure out what happened.

At this point, we have a problem. The programmer will want to see the values of all the variables to figure out what his mistake was, and he'll see a=999, b=999, c=garbage, z=0, and d=1. He'll be very confused, because even though the program crashed on the 5th instruction, the 6th instruction already executed!

To solve this, we use that second queue. The real registers aren't written out of order as their values are calculated; they're written in order. By waiting to do writeback until the instruction is the oldest in the machine, we can ensure that the "program crashed" variable dump can't show the updated value of d until the multiplication has actually finished. This queue is called the re-order buffer (ROB) because it puts the instructions back into the right order.

Note that in reality, the mistakes that happen are caused by the branch predictor 99.99...% of the time, and exceptions like divide by zero / invalid memory access / etc very infrequently.

I made some serious simplifications here which may have confused you (whoever's reading this post ;)). If you had no background in this stuff, you probably got the message, but if you have just a little background you might have figured out some things that made the examples confusing. I'm going to ignore that for now. The message was, "there's a queue at the front that holds instructions until they're ready, and there's a queue at the back that holds them so they finish in order".

Buffer sizes, and uops

edit

The sizes of the two queues are very important. Let's say a division takes 100 cycles. If the queues hold 10 entries, what happens when you encounter a division? Well, the re-order buffer can only track 10 instructions and they have to be handled in order, so once the division instruction is the oldest instruction in the machine, it probably hasn't finished yet. Over the next few cycles, the other execution units finish some more instructions and fill up the remaining 9 ROB slots. At this point, the machine has to stall until the division finishes, because if it executes any more instructions it won't know how to put them back in order. The bigger the queues are, the longer the machine can stay busy when some units are encountering long delays.

The size of the first queue comes into play when reading instructions from memory occasionally takes extra long (a cache miss). If the front-end can get far enough ahead of the execution units, then even when it gets held up for a few cycles because of a cache miss, the execution units can keep crunching instructions that are waiting in that buffer. If the buffer is small, the execution units will quickly run out of work to do.

There's another constraint besides just size: how many entries you can add to them each cycle ("dispatch"), and how many you can remove from the second one each cycle ("retire"). There's a lot of book-keeping type work involved, and the complexity grows drastically as you increase this number. If you go back to that slow-division example above, real CPUs will be able to empty the ROB at a rate of about 3 entries per cycle, even though as soon as the division finishes, all entries are theoretically ready to retire.

Now, many real x86 instructions aren't as simple as add, multiply, read, write. A single x86 instruction, like "push" can do a combination of things ("push" writes a value to memory, and also decrements a number).

Older Intel CPUs broke that "push" into 2 "microoperations" or "µops" or "uops" that handle the 2 uops, and each uop takes its own slot in the queues. If you had a 100-entry ROB, you could actually only hold 50 "push" instructions. This is pretty intuitive.

AMD's architecture, on the other hand, tracks the "push" in one slot, but each queue entry has a little extra information that indicates what operations are required by the entry (an entry holding a "push" instruction indicates that both a subtraction and a memory write are required). Both the ALU and the memory-write unit know that they'll have to do an operation for a "push". At the cost of this added complexity, the 100-entry ROB can hold 100 "push" instructions (making it effectively twice as big!). One annoying thing about this style is that the term "uop" is now ambiguous: is a uop something that takes one entry in the ROB, or something that an execution unit views as one piece of work? When someone says K8 is 3-wide, it's 3 ROB-entries wide, but more than 3 execution-unit pieces of work wide.

With Pentium M, Intel picked up some of the benefit of the AMD-style architecture with "uop fusion". uop fusion allows one queue entry to represent more than one execution-unit piece of work. Core 2's "macro-op fusion" improves on this further, so there are some things that take less queue space in Core 2 and some things that take less space in K8.

The argument I was having that this writeup hopefully clarifies

edit

We were arguing about exactly how wide K8 (Athlon 64, Opteron, some Semprons) is. He thought that some parts of the pipeline were limited to 3 execution-unit pieces of work wide, most likely because a lot of people use "uop" inconsistently when talking about AMD chips. I was arguing that the pipe was generally 3 macro-operations or "fused uops" wide and a potentially large number of uops wide.