Click here to print this article.

Re-Printed From SLCentral

Intel Pentium 4: In-Depth Technical Overview
Author: Paul Mazzucco
Date Posted: August 3rd, 2001
URL: http://www.slcentral.com/articles/01/8/p4

Introduction

For the last couple years, there has been a serious ongoing battle between the top x86 manufacturers for performance dominance in the consumer PC market. In this article, we discuss the technical background with respect to the Intel Pentium 4 Processor, and how it achieves its performance levels, and how this will affect the future battles for x86 supremacy. It also discusses how varying code-types affected the Pentium 4's design points.

Hyper Pipelined

Modern x86 microprocessors have been increasing the length of their pipelines since the 486. The reason is not because having a longer pipeline increases performance all by itself - it in fact decreases performance (discussed in the next section). Pipelining actually decreases the amount of work done each cycle, because the workload is spread out over more stages. However, pipelines allow the greater throughput, and allow greater clock speeds (because each stage is less complex, each one can run faster - think of it as an assembly line). The Pentium 4 stretches the work over a staggering 20 stages! All this in the never-ending quest for more MHz.

It will be explained later why the number of stages can sometimes be as many as 28.

Branch Prediction

The greater the number of stages, the more stages that have to be "flushed" (cleared out) if there is a branch mispredict. When a processor encounters a conditional statement (such as an if/else statement), rather than simply waiting for the answer to the condition, modern processors use what's called "Branch Prediction." This means that if the processor guesses right, it will have saved a lot of time (the time it takes for the processor to computer the condition). However, if the processor guesses wrong, it means that it has to flush all the work it's already done, and then start it all over. In a worst-case scenerio, the penalty for a mispredict is 19 cycles! This is greater than the Pentium III's, but this should be no surprise as it has a longer pipeline.

On the Pentium III, Intel had a Branch Prediction Unit, which had an average accuracy of about 90% when predicting branches. This seems pretty good, at first glance, no? I'd love to simply be able to guess right (with prior knowledge about how my guesses turned out, of course) 90% of the time on true or false exams! However, in processors with long pipelines, 90% simply isn't good enough. Intel has stated that approximately 30% of real world performance is thrown out the window due to times when the processor guesses wrong. Given that the penalty for the Pentium 4 is potentially longer than that of the Pentium III, it should come as no surprise that Intel opted to improve the efficiency of their Branch Prediction Unit! As such, Intel has increase the number of entries in the history table eight-fold over the Pentium III (this happens to be the size of the history table of the K6-x family, though AMD has stated that it was overkill for a chip with a short pipeline)!

Intel has claimed that they have reduced the misrate of the Branch Prediction Unit on the Pentium 4 by 30% over that of the Pentium III. Given that the Pentium III had an average prediction rate of about 90%, this means that the Pentium 4's branch prediction rate is somewhere around 94% (because 30% of 100-90% ~= 4%). Missing a branch is so costly on a processor with such a long pipeline that it was quite necessary to avoid guessing wrong as much as possible.

In the next section we'll see how the Pentium 4's Trace Cache helps to alleviate some of the issues with having long pipelines.

The P4's Caches

Since the 486, no x86 Intel processor has reduced the amount of L1 cache that is implemented. Each architecture has proven itself faster than its predecessor overall. It stands to reason why people often equate more L1 cache with better absolute performance, if they merely followed the above reasoning. However, this time around, Intel implemented a L1 data cache that was half the size of its predecessors, and it even chose to forgo the standard L1 instruction cache (not a bad thing, as we'll see)!

The L1 cache of the Pentium 4 is only 8Kb, but it has very low access latencies. Data can be fetched from the L1 Dcache in only 2 cycles! This is very important due to two things: one, in that the vast majority of the computations that are done are done from the L1 cache. Even what looks like a tiny L1 cache has a high hit-rate, in part due to the 4-way associativity.

In the past, going to the L2 cache for anything was murder on performance, so L1 caches have been made larger over time. However, the Pentium 4 was designed with having L2 cache integrated on-die from the get-go (unlike the Pentium III Coppermine, which has origins in the Pentium III Katmai, which is itself an extension of the Pentium II, whose ancestor is the Pentium Pro, none of which had on-die L2 cache). This allowed the designers to let the L2 cache focus on the hit-rate and bandwidth, as it doesn't hurt too much to go to the L2 cache, while allowing the L1 cache to focus instead upon reducing average memory latencies (due to its prevalence as the most used cache in the system). Because the target frequency for the Pentium 4 is so high (radically higher than the Pentium III, potentially over twice that of the Pentium III on the same process technology), the only way to decrease access time was to reduce the size of the L1 cache. However, this is okay due to the fact that there is on-die L2 cache, which will holdout for the high hit-rate.

However, while the L1 Dcache of the Pentium 4 is 2 cycles for integer data loads, it doesn't use the same type of speculative loading for floating point values. As such, it instead has a latency of 6 cycles for floating point L1 Dcache accesses. While Intel was waging a war against memory latency with the 2-cycle L1 Dcache for integer, it is not nearly so detrimental for floating point code to have longer latencies, because they tend to be more stream oriented anyway.

Instead of going with the traditional L1 Instruction Cache as they have done in the past, Intel opted for a instruction caching paradigm that was brought about by academia (with Intel research assistance, of course). Below is an excerpt from a prior article (http://www.slcentral.com/articles/01/1/intel/page3.php)

The Trace Cache concept was patented in 1994 by Peleg and Weiser, but not with the intent of necessarily caching decoded x86 instructions. Instead, it was thought of as a way to increase "real world" instruction throughput by caching instructions that were already executed in a contiguous manner.

In The Fundamentals of Cache, there were two terms about the locality of programs discussed, spatial locality, and locality of reference. Locality of reference has two parts, spatial, and temporal. Spatial locality was discussed, but temporal locality wasn't mentioned. Temporal locality has to do with when a program uses what instructions, not where, as two or more instructions can be in completely different sections of memory.

Among other things, a Trace Cache helps to turn temporal locality into spatial locality. Thought of another way, it can be analogous to Windows 98's "defragmenting program" which intentionally changes the order of the program, into the order in which it is run, rather than the order in which it was originally compiled. This is important, because contiguous blocks containing temporally related instructions is a faster and more efficient way to issue instructions than the conventional Icache. In some tests, performance was increased by over 25% by using a Trace Cache instead of a regular instruction cache of the same size (128kb). For more information about Trace Caches, see the bibliography at the end of the article (you'll need a post script viewer, one of which can be found at http://www.cs.wisc.edu/~ghost/.

To expand upon how Intel implemented their Trace Cache, it is capable of storing around 12,000 instructions, grouped into Trace Segments of 6 "uops" (RISC-like micro-operations). However, if the part of the Pentium 4 which builds these trace segments isn't able to find enough instructions to build a full 6-uop segment, it will leave some of them empty. This means that the maximum 12,000 uops will not always be fully utilized.

The Trace Cache is able to issue three uops per cycle, or half a trace segment. While the Pentium 4 can is able to execute up to 4 simple integer instructions per clock (discussed later), the bottleneck of 3 uops is not nearly as severe as one might otherwise think. It is rare for microprocessors to be able to average anywhere near their peak functional unit usage.

Many people glossed over the Trace Cache, but it is vital to whatever performance the Pentium 4 achieves. While we previously discussed the "Hyper Pipelined Technology," the Trace Cache plays a vital role with the length of the pipeline. The reason is because the Trace Cache alleviates the need to decode so many x86 instructions into friendlier uops (except for the big and nasy x86 instructions, which could potentially be decoded into many many uops, and "pollute" the trace cache). Because of the trace cache, fewer instructions have to be decoded at a time. This is a very positive move, because without the Trace Cache, there would be 8 more stages involved with processing the majority of the instructions! Without the Trace Cache (if an instruction isn't in the Trace Cache), the pipeline length is 28 cycles. If there are many branches in the code, and the Pentium 4 had to decode instructions every time, the penalty for branch mispredicts would be much higher than the 19 cycles it is with the Trace Cache!

However, despite the emphasis on the removal of the decoders as a main resource, the biggest benefit (potentially), is the conversion of Temporal Locality into Spatial Locality. This is very, very important, and also very overlooked by many sites (usually this concept, which isn't even spelled out, is given a one liner). So I'll say it again. The fact that a Trace Cache turns Temporal Locality into Spatial locality is of utmost importance, and is in fact the main premise behind the original invention of the Trace Cache. The fact that Intel has managed to use it to mitigate the fact that they still use the x86 instruction set it a wonderful side effect.

As mentioned above, the L2 "Advanced Transfer Cache" (ATC)is designed for both hit-rate, and bandwidth. Like the ATC of the Pentium 3, the L2 cache is 256-bits wide, is 8-way associative, and is non-blocking. However, unlike the Pentium III's ATC, the Pentium 4's ATC is able to send data every clock cycle. This means that at equal clock-speeds, the Pentium 4's ATC is able to deliver twice the bandwidth as that of the Pentium III. Despite its roots in the Pentium III's ATC, the latency for access is greater at up to 10 cycles, as compared to 4 on the Pentium III. For the Pentium 4, bandwidth is exceedingly important, while it has features which allow it to tolerate latencies better than its predecessors.

Bandwidth And The Line-Sizes

A member of our forums had this to say with regards to the Pentium4 "I have to say that I think the whole idea of the P4 was based on the extreme memory bandwidth..." This statement is true with respect to a great number of the Pentium 4's design decisions. There is no desktop platform that can rival the Pentium 4 in terms of sheer bandwidth, and this is a very forward-looking design decision, as we'll explain.

It's important to know the line-sizes that a CPU architecture uses. Whenever a processor searches for a bit of data, it works its way down the memory hierarchy. This means that if there is a miss in the L1 cache for an instruction or data element, it will then search the L2 cache. If it finds it in the L2 cache, it will grab not only that data element, but also a number of physically contiguous elements. The reason for this is simple: a concept called "spatial locality," which states that both data, and code that are physically close to each other often are needed at about the same time.

Memory latencies are detrimental for maintaining peak-processing efficiency, thus it makes sense to try to hide memory latencies as much as possible. This is why CPUs and caches will fetch more than on data element or instruction at a time - it will then likely be able to use process the same data that it brought in advance to actually requesting it (note that this is still distinctly different from hardware prefetch). At the same time, complex data-structures (such as nested structures in C, and large objects in C++) can have very negative effects.

Complex data-structures are stored contiguously just as simple arrays are, however, their usage tends to be quite distinct, and they behave differently. It is far more common to processes one element of an object or structure, especially in linked lists (where traversing the list means a pointer is accessed, some comparison done, and will likely move on), and then go to the next piece. Arrays on the other hand, are more likely to have its data elements accessed one right after another (especially in strings). Where the line-size fits in with all of this is that the more complex the data-structure, the less likely it is for all of the elements in the data-structure to be needed. Where the line-size comes into play, is that it grabs a bunch of extra data elements that have to be transferred, and in complex data structures, much of what is transferred is likely to do nothing more than waste space in precious caches.

A graphical example is shown below:


A

B

In this case we have a small section of memory, broken up into 32-bit sections (4 bytes, which is represented by one square unit). Regular integers on 32-bit machines occupy 32-bits of space. Red cells represent those elements that have their contents copied to the next (higher) layer of the memory hierarchy, and cells with the blue spot represent the data that is (or will shortly be) needed.

Suppose we have a scenario with a processor that uses 32-byte lines (such as the Pentium III). Now suppose that the first element that needs to be processed is in the lower left hand corner of 'a'. Suppose that the program calls for the following 7 data elements to be needed right afterward as well (hence why they are marked with blue). In a case such as this, the cache-line of 32-bytes works perfectly, where all the data elements that need to be accessed are in an array (as an example). Then, it makes a similar grab at data, and uses it in the same way. In this case, fetching all 32-bytes at once paid off, as now it doesn't have to go back for more data again (which takes a lot of time from the CPUs perspective). Here, no bandwidth was wasted.

Now we'll look at the opposite extreme. In this case, perhaps where a program is searching through a linked list, and grabs the pointer to a structure, compares to a value (hence the two data elements being "used"), and moves on because it hasn't found the necessary node of the list. In this case, the full 32-byte lines are still being dragged to the next higher layer of the memory hierarchy, even though only 8 bytes are being used (the pointer and the integer). The searching processes wasted a lot of bandwidth - of it, in fact. Only of the data brought in by a cache-line fill were actually used, thus much space and bandwidth were not used efficiently.

We'll take a look at both of these types of scenario with the a larger line-size (128-bytes), ala Pentium 4.

Perhaps first thing that comes to mind is just how much more red these graphs have! Lets walk through this as we did above. We take the same data in the same organization in memory, but change the line size. Here, the line-size is 128-bytes, which means two columns (rather than just half a column as in a 32-byte line). While this isn't truly representative how it is (because caches, as small as they are, have thousands, or even millions of "units" or cells), it does show the point.


A

B

In the first case, where there are two arrays where all the data elements are being used, but with some space between them, a 128-byte line gets both arrays at the same time. This means that the higher levels don't have to experience a cache miss before moving to the data in the second array, while the 32-byte-line design would! This has the benefit of greatly decreasing average memory access latencies for contiguously used data! However, even still, half the bandwidth in the above case was wasted, as half of the data transferred wasn't used.

However, the story changes when moving towards code that "jumps" around a lot, as is potentially the case with linked lists (because they are often used to dynamically increase the number of nodes, linked lists are often not contiguously located in memory, unlike arrays). In case such as this, a lot of spatial locality is lost. Here, when compared to a system with merely a 32-byte-line, the amount of wasted bandwidth is staggering. While the 32-byte-line system wasted 3/4 of the bandwidth available, the 128-byte-line system would waste 4 times as much, or 15/16 of the bandwidth! This is of course an extreme example, but it does prove the point.

So one should easily be able to see one reason why the Pentium 4 requires so much bandwidth, because in some cases, it wastes a great deal of it.

Programs are constantly becoming more complex, with more abstract, intricate and large structures and objects, especially in applications that need to be variable in the dataset size. This means that sometimes, only a small piece of an object or structure will be used at a time, which causes ever increasing waste in bandwidth when large line-sizes are used.

As I've said in other articles, latencies and bandwidth go hand in hand. With large line-sizes, the miss-penalty involved with large line-sizes is much greater, because, if there isn't much bandwidth, it takes longer for the whole block (the size of a line) to be transferred. Because it takes longer to send a block, the latency that the processor sees increases. To combat the penalties induced by such waste, a system which can send much larger amounts of data at a time will have a much reduced miss-penalty, because it doesn't take as long to transfer the block (the line). In this case, increasing bandwidth so much stops the bleeding that the Pentium 4 would otherwise see with integer and "jumpy" programs.

On the other hand, with streaming data, the large bandwidth and huge line-sizes don't "stop the bleeding." Rather, they allow greatly enhanced peak performance. With many floating point intensive applications, code doesn't tend to be written in objects that are as nested or complex, and tend to be more contiguous. Add to the fact that with double-precision floating point, the size of the smallest data element is 64-bits (8 bytes), or twice the size of integer values. Thus, fewer elements can be transferred at a time, and so having more bandwidth and larger-line sizes increases performance dramatically (assuming the processor has a strong FPU unit).

So where do these design decisions come into play? There has always been a balancing act between the line-size, and the miss-penalty. The more bandwidth a system has, the more the miss-penalty is reduced, and the balancing point is shifted towards a larger line-size. Such is the case with the Pentium 4. This is why it is so important that the Pentium 4 have so much bandwidth at all levels of the memory hierarchy. It is no wonder that this chip was designed with RDRAM in mind, as it offers incredible bandwidth per (which is becoming important). Just for reference, the Athlon uses 64-byte line sizes at all levels of the memory hierarchy.

Hardware Prefetch

There is yet another reason, beyond the line-size of the Pentium 4, why the platform requires such enormous bandwidth: the hardware prefetch unit.

With the Pentium III, Intel introduced Software Prefetch instructions, which allows a programmer to load instructions into a cache even before it's needed. While this means that there will be less space available for other potentially needed instructions, if used wisely, the latencies to main memory can be masked. This can happen because the processor can sit busy working on whatever it's currently working on, and have something that will be needed at a later time loaded into the cache, that way it doesn't have to experience the painful delays of going to main memory.

The hardware prefetch of the Pentium 4 extends this a bit further. One, in that because it is hardware based, it doesn't require any support on the part of the program. Also, as it is hardware based, there is no code dilution due to the fact that no instructions are needed! However, there is a downside to prefetching instructions.

Prefetching, of any sort, uses up bandwidth, simply because it is loading instructions. When a program is bandwidth constrained, this can lead to performance decreases, because of contention for main-memory bandwidth. However, when paired up with a great deal of memory bandwidth, prefetching doesn't take any "needed" bandwidth away from the fetching of other instructions and data. In this way, prefetching can soak up "excess" bandwidth, and do something useful - load instructions into a cache before they are needed, thus increasing the cache's hit-rate, which in turn means that average memory accesses decrease. And, because this requires no effort on the part of the programmer, hardware prefetch allows existing programs to make use of the mammoth bandwidth afforded by a dual-channel PC800 system.

Some Of The "Guts"

Now that all the basics of how the Pentium 4 gets its data, and why it needs to be able to grab lots of it at a time, we'll slide over to an area most people would start with first - the execution resources afforded by the processor.

In brief:

  • 2 "Double Pumped" ALU (Arithemetic Logical Units: Add, Subtract, logical AND, logical OR). One of the benefits of this is that the Pentium 4 is able to get the same performance out of half the area (it can execute 4 instructions per base CPU cycle, though it is constrained by only 3 uops issued per cycle).
  • 2 FPU units: one for FPU loads, and stores, the other for FPU adds and subtracts.
  • 126 entry Reorder buffer: This means that the processor has a window of 126 instructions in which to search for, and execute, non-data-dependant instructions. This helps to hide latencies.

The Pentium 4 has fewer Integer units than the Athlon, and it has fewer Floating Point units as well. Moreover, the Pentium 4 no longer has FXCH (an instruction which shuffles data around in the archaic x86 FPU stack) for "free," which the Pentium III and Athlon do have. As software has been optimized for the Pentium III, and a little bit for the Athlon, this means that optimizations for prior processors will actually degrade performance on the Pentium 3.

Also, the FMUL instruction is no longer pipelined, and many instructions have longer latencies for execution. This means that the Pentium 4 has fewer execution resources than the Athlon, it takes longer to complete the instructions that it can issue, and it "undoes" some of the optimizations that software vendors have been doing since the Pentium Pro days. On the other hand, the Pentium 4 should theoretically be able to deal streaming, and in particular, large data-sets better than the Pentium III and Athlon, due to the massive bandwidth that it has at all levels. Despite the view that the Pentium 4 is "crippled," it does have a way to make up for the lack of solid floating point performance - more SIMD instructions!

iSSE2

Intel introduced "SWAR" (SIMD Within A Register) to the x86 world back in the days of MMX. SIMD computers have existed beforehand, in vector computers such as those produced by Cray. The basic idea is to use one instruction on multiple data elements to increase throughput, and reduce instruction count (Single Instruction, Multiple Data). While MMX was for integer only, and required the use of the floating point registers (as did 3dNow!), SSE, and iSSE2 give floating point SIMD to the x86 world, and use 8 128-bit registers so as not to make the floating point registers do double-duty.

iSSE2 theoretically gives the Pentium 4 the same throughput on some floating point code as the Athlon can do in pure x86 mode. However, iSSE2 allows both 64-bit integer, and double precision floating point to be done in SIMD mode. This means that theoretically, code size could decrease somewhat, though this point is moot and likely to be mostly negligible even when optimized.

Despite the apparent disappointment with the FPU performance of the Pentium 4, aceshardware has done a compiler analysis (Here) that shows that the Pentium 4 is capable of much higher floating point performance when optimized for iSSE2. Another important point about both SSE and iSSE2 (the Pentium 4 contains both), is that they are fully IEEE compliant. This means that the answers will be exactly as defined by a specific, universal IEEE standard. x86 code will do this by itself (if they want to be considered x86 compatible), however 3dNow in single precision mode will not. The numbers will be a very small fraction off. This can cause problems in tests that need to be highly precise, but double precision can take care of that for 3dNow!, though this slows down the process.

The precision afforded by 3dNow! is certainly enough for games, but in many scientific applications, anything but full IEEE compliance can ruin results. As such, iSSE and iSSE2 has shown that they can be potentially more useful, simply due to the fact that they give the exact results one would expect.

Despite all the optimizations that are theoretically possible, a common workstation benchmark, with SPEC2000, Intel claims that only about 5% of its phenomenal absolute performance is due to SIMD optimizations! This is in part due to the fact that some portions of SPEC2000 are bandwidth limited, which is where the Pentium 4's Quad Pumped, 3.2Gigabytes per second of main memory bandwidth can become very useful.

Thermal Protection

Athlon cores have been cracking on many a Do-It-Yourselfers, despite the despite the little pads in the corners. While this can be avoided if proper time and care are taken, the Pentium 4 uses different mounting mechanisms. This means that the thermal solution is less likely to damage the processor, which is certainly a noble goal.

The Pentium 4 also has a built in temperature diode, which is used for two things: to slow the processor down when it starts heating up too fast; and two, to shut itself down when it reaches a temperature that is unsafe for operation. Current Thunderbird Athlons do not have either ability, and thus it is not uncommon in the Do-It-Yourself market for people to burn up their Athlons, even multiple times. However, results shown by various online publications insist that the "throttling" that the Pentium 4 does when overheating kicks in when it shouldn't. We here at SystemLogic.net experienced no such problems, nor have any other review sites that we are aware of.

Though this is a personal preference, I would rather have my processor throttle or shut itself down than burn up my investment in it. The issues with improper throttling (throttling at times when it shouldn't) have not appeared in wide-spread instances at all. Another benefit is that temperature readings are more accurate, and precise, from an internal diode than from a motherboard socket thermal diode (as the Thunderbird is currently relegated to using). AMD certainly knows that they've had a problem with overheating and inaccurate (and imprecise) temperature readings from motherboard manufacturers, as they have included an internal thermal diode themselves with the advent of the Palomino core in the Athlon 4 and Athlon MP (both the same processor, really).

Conclusion

In some ways, the Pentium 4 represents what many may view as "backward steps" in technology - some have viewed the increasing instruction latencies as a negative side-effect of the quest for more MHz. This has been happening since the days of the 486, perhaps earlier, due to pipelining.

Intel has also done away with the traditional instruction cache in favor of the new Trace Cache. They have chosen to use fewer functional units than the competition, and knowingly reduce average IPC (Instructions Per Clock). In the quest for performance, the best metric is time of completion for a task. It matters not, in the end, whether a processor achieves greater performance through higher IPC (instructions per clock), or greater frequency. In the end, absolute performance is what matters.

Intel has chosen to sacrifice a little bit of average IPC, in exchange for radically higher clock speeds than their previous architecture when on the same process technology. While the Pentium 4 at 1.3ghz doesn't outperform the 1ghz Pentium III in many tasks, the highest performing Pentium 4, at 1.8ghz, is indisputably faster than the fastest Pentium 3 (at 1.1ghz - which was a silent introduction, I might add). The Pentium 4, when in the same process technology as the Pentium III, will have greater absolute performance in nearly every facet of computing (performance/watt excluded).

Of course, the Pentium 4 has more to ward off in performance than it's older sibling, the Pentium III, it also has its archrival AMD's Athlon core to contend with. While the Pentium 4 may not radically outperform the Athlon series right now (and in some cases, at all), it has a future, which indicates it has that potential. With Intel being at least a year ahead of AMD in terms of volume .13-micron process technology, and Intel will be able to introduce much more highly clocked Pentium 4's at a pace much faster than AMD will be able to do, due to the difference in process technology.

AMD's current hope for survival rests with their Palomino core, which boasts even greater average IPC than the Thunderbird (which itself is higher than that of the Pentium 4). The Pentium 4 contains a new core intended to supplant the old, and venerable Pentium Pro core in all markets (over its lifetime, not right away). As such, it was designed with many aspects of the future in mind, such as the soaring disparity between main memory and CPU clock-rates. Its performance now is relatively unimportant (as long as it outperforms the Pentium III in the same process technology), which wasn't true of the Athlon (AMD needed the Athlon to crush the Pentium III right then and there to survive as a company). There are technical indications that the Athlon core isn't as future-proof as the Pentium 4, which means that the battle between the two major x86 rivals will spark even hotter.

Re-Printed From SLCentral