The P4's Caches
Since the 486, no x86 Intel processor has reduced the amount of L1 cache that is implemented. Each architecture has proven itself faster than its predecessor overall. It stands to reason why people often equate more L1 cache with better absolute performance, if they merely followed the above reasoning. However, this time around, Intel implemented a L1 data cache that was half the size of its predecessors, and it even chose to forgo the standard L1 instruction cache (not a bad thing, as we'll see)!
The L1 cache of the Pentium 4 is only 8Kb, but it has very low access latencies. Data can be fetched from the L1 Dcache in only 2 cycles! This is very important due to two things: one, in that the vast majority of the computations that are done are done from the L1 cache. Even what looks like a tiny L1 cache has a high hit-rate, in part due to the 4-way associativity.
In the past, going to the L2 cache for anything was murder on performance, so L1 caches have been made larger over time. However, the Pentium 4 was designed with having L2 cache integrated on-die from the get-go (unlike the Pentium III Coppermine, which has origins in the Pentium III Katmai, which is itself an extension of the Pentium II, whose ancestor is the Pentium Pro, none of which had on-die L2 cache). This allowed the designers to let the L2 cache focus on the hit-rate and bandwidth, as it doesn't hurt too much to go to the L2 cache, while allowing the L1 cache to focus instead upon reducing average memory latencies (due to its prevalence as the most used cache in the system). Because the target frequency for the Pentium 4 is so high (radically higher than the Pentium III, potentially over twice that of the Pentium III on the same process technology), the only way to decrease access time was to reduce the size of the L1 cache. However, this is okay due to the fact that there is on-die L2 cache, which will holdout for the high hit-rate.
However, while the L1 Dcache of the Pentium 4 is 2 cycles for integer data loads, it doesn't use the same type of speculative loading for floating point values. As such, it instead has a latency of 6 cycles for floating point L1 Dcache accesses. While Intel was waging a war against memory latency with the 2-cycle L1 Dcache for integer, it is not nearly so detrimental for floating point code to have longer latencies, because they tend to be more stream oriented anyway.
Instead of going with the traditional L1 Instruction Cache as they have done in the past, Intel opted for a instruction caching paradigm that was brought about by academia (with Intel research assistance, of course). Below is an excerpt from a prior article (http://www.slcentral.com/articles/01/1/intel/page3.php)
The Trace Cache concept was patented in 1994 by Peleg and Weiser, but not with the intent of necessarily caching decoded x86 instructions. Instead, it was thought of as a way to increase "real world" instruction throughput by caching instructions that were already executed in a contiguous manner.
In The Fundamentals of Cache, there were two terms about the locality of programs discussed, spatial locality, and locality of reference. Locality of reference has two parts, spatial, and temporal. Spatial locality was discussed, but temporal locality wasn't mentioned. Temporal locality has to do with when a program uses what instructions, not where, as two or more instructions can be in completely different sections of memory.
Among other things, a Trace Cache helps to turn temporal locality into spatial locality. Thought of another way, it can be analogous to Windows 98's "defragmenting program" which intentionally changes the order of the program, into the order in which it is run, rather than the order in which it was originally compiled. This is important, because contiguous blocks containing temporally related instructions is a faster and more efficient way to issue instructions than the conventional Icache. In some tests, performance was increased by over 25% by using a Trace Cache instead of a regular instruction cache of the same size (128kb). For more information about Trace Caches, see the bibliography at the end of the article (you'll need a post script viewer, one of which can be found at http://www.cs.wisc.edu/~ghost/.
To expand upon how Intel implemented their Trace Cache, it is capable of storing around 12,000 instructions, grouped into Trace Segments of 6 "uops" (RISC-like micro-operations). However, if the part of the Pentium 4 which builds these trace segments isn't able to find enough instructions to build a full 6-uop segment, it will leave some of them empty. This means that the maximum 12,000 uops will not always be fully utilized.
The Trace Cache is able to issue three uops per cycle, or half a trace segment. While the Pentium 4 can is able to execute up to 4 simple integer instructions per clock (discussed later), the bottleneck of 3 uops is not nearly as severe as one might otherwise think. It is rare for microprocessors to be able to average anywhere near their peak functional unit usage.
Many people glossed over the Trace Cache, but it is vital to whatever performance the Pentium 4 achieves. While we previously discussed the "Hyper Pipelined Technology," the Trace Cache plays a vital role with the length of the pipeline. The reason is because the Trace Cache alleviates the need to decode so many x86 instructions into friendlier uops (except for the big and nasy x86 instructions, which could potentially be decoded into many many uops, and "pollute" the trace cache). Because of the trace cache, fewer instructions have to be decoded at a time. This is a very positive move, because without the Trace Cache, there would be 8 more stages involved with processing the majority of the instructions! Without the Trace Cache (if an instruction isn't in the Trace Cache), the pipeline length is 28 cycles. If there are many branches in the code, and the Pentium 4 had to decode instructions every time, the penalty for branch mispredicts would be much higher than the 19 cycles it is with the Trace Cache!
However, despite the emphasis on the removal of the decoders as a main resource, the biggest benefit (potentially), is the conversion of Temporal Locality into Spatial Locality. This is very, very important, and also very overlooked by many sites (usually this concept, which isn't even spelled out, is given a one liner). So I'll say it again. The fact that a Trace Cache turns Temporal Locality into Spatial locality is of utmost importance, and is in fact the main premise behind the original invention of the Trace Cache. The fact that Intel has managed to use it to mitigate the fact that they still use the x86 instruction set it a wonderful side effect.
As mentioned above, the L2 "Advanced Transfer Cache" (ATC)is designed for both hit-rate, and bandwidth. Like the ATC of the Pentium 3, the L2 cache is 256-bits wide, is 8-way associative, and is non-blocking. However, unlike the Pentium III's ATC, the Pentium 4's ATC is able to send data every clock cycle. This means that at equal clock-speeds, the Pentium 4's ATC is able to deliver twice the bandwidth as that of the Pentium III. Despite its roots in the Pentium III's ATC, the latency for access is greater at up to 10 cycles, as compared to 4 on the Pentium III. For the Pentium 4, bandwidth is exceedingly important, while it has features which allow it to tolerate latencies better than its predecessors.
>> Bandwidth And The Line-Sizes