For the last couple years, there has been a serious ongoing battle between the top x86 manufacturers for performance dominance in the consumer PC market. In this article, we discuss the technical background with respect to the Intel Pentium 4 Processor, and how it achieves its performance levels, and how this will affect the future battles for x86 supremacy. It also discusses how varying code-types affected the Pentium 4's design points.
Modern x86 microprocessors have been increasing the length of their pipelines since the 486. The reason is not because having a longer pipeline increases performance all by itself - it in fact decreases performance (discussed in the next section). Pipelining actually decreases the amount of work done each cycle, because the workload is spread out over more stages. However, pipelines allow the greater throughput, and allow greater clock speeds (because each stage is less complex, each one can run faster - think of it as an assembly line). The Pentium 4 stretches the work over a staggering 20 stages! All this in the never-ending quest for more MHz.
It will be explained later why the number of stages can sometimes be as many as 28.
The greater the number of stages, the more stages that have to be "flushed" (cleared out) if there is a branch mispredict. When a processor encounters a conditional statement (such as an if/else statement), rather than simply waiting for the answer to the condition, modern processors use what's called "Branch Prediction." This means that if the processor guesses right, it will have saved a lot of time (the time it takes for the processor to computer the condition). However, if the processor guesses wrong, it means that it has to flush all the work it's already done, and then start it all over. In a worst-case scenerio, the penalty for a mispredict is 19 cycles! This is greater than the Pentium III's, but this should be no surprise as it has a longer pipeline.
On the Pentium III, Intel had a Branch Prediction Unit, which had an average accuracy of about 90% when predicting branches. This seems pretty good, at first glance, no? I'd love to simply be able to guess right (with prior knowledge about how my guesses turned out, of course) 90% of the time on true or false exams! However, in processors with long pipelines, 90% simply isn't good enough. Intel has stated that approximately 30% of real world performance is thrown out the window due to times when the processor guesses wrong. Given that the penalty for the Pentium 4 is potentially longer than that of the Pentium III, it should come as no surprise that Intel opted to improve the efficiency of their Branch Prediction Unit! As such, Intel has increase the number of entries in the history table eight-fold over the Pentium III (this happens to be ½ the size of the history table of the K6-x family, though AMD has stated that it was overkill for a chip with a short pipeline)!
Intel has claimed that they have reduced the misrate of the Branch Prediction Unit on the Pentium 4 by 30% over that of the Pentium III. Given that the Pentium III had an average prediction rate of about 90%, this means that the Pentium 4's branch prediction rate is somewhere around 94% (because 30% of 100-90% ~= 4%). Missing a branch is so costly on a processor with such a long pipeline that it was quite necessary to avoid guessing wrong as much as possible.
In the next section we'll see how the Pentium 4's Trace Cache helps to alleviate some of the issues with having long pipelines.
>> The P4's Caches