There is yet another reason, beyond the line-size of the Pentium 4, why the platform requires such enormous bandwidth: the hardware prefetch unit.
With the Pentium III, Intel introduced Software Prefetch instructions, which allows a programmer to load instructions into a cache even before it's needed. While this means that there will be less space available for other potentially needed instructions, if used wisely, the latencies to main memory can be masked. This can happen because the processor can sit busy working on whatever it's currently working on, and have something that will be needed at a later time loaded into the cache, that way it doesn't have to experience the painful delays of going to main memory.
The hardware prefetch of the Pentium 4 extends this a bit further. One, in that because it is hardware based, it doesn't require any support on the part of the program. Also, as it is hardware based, there is no code dilution due to the fact that no instructions are needed! However, there is a downside to prefetching instructions.
Prefetching, of any sort, uses up bandwidth, simply because it is loading instructions. When a program is bandwidth constrained, this can lead to performance decreases, because of contention for main-memory bandwidth. However, when paired up with a great deal of memory bandwidth, prefetching doesn't take any "needed" bandwidth away from the fetching of other instructions and data. In this way, prefetching can soak up "excess" bandwidth, and do something useful - load instructions into a cache before they are needed, thus increasing the cache's hit-rate, which in turn means that average memory accesses decrease. And, because this requires no effort on the part of the programmer, hardware prefetch allows existing programs to make use of the mammoth bandwidth afforded by a dual-channel PC800 system.
Some Of The "Guts"
Now that all the basics of how the Pentium 4 gets its data, and why it needs to be able to grab lots of it at a time, we'll slide over to an area most people would start with first - the execution resources afforded by the processor.
- 2 "Double Pumped" ALU (Arithemetic Logical Units: Add, Subtract, logical AND, logical OR). One of the benefits of this is that the Pentium 4 is able to get the same performance out of half the area (it can execute 4 instructions per base CPU cycle, though it is constrained by only 3 uops issued per cycle).
- 2 FPU units: one for FPU loads, and stores, the other for FPU adds and subtracts.
- 126 entry Reorder buffer: This means that the processor has a window of 126 instructions in which to search for, and execute, non-data-dependant instructions. This helps to hide latencies.
The Pentium 4 has fewer Integer units than the Athlon, and it has fewer Floating Point units as well. Moreover, the Pentium 4 no longer has FXCH (an instruction which shuffles data around in the archaic x86 FPU stack) for "free," which the Pentium III and Athlon do have. As software has been optimized for the Pentium III, and a little bit for the Athlon, this means that optimizations for prior processors will actually degrade performance on the Pentium 3.
Also, the FMUL instruction is no longer pipelined, and many instructions have longer latencies for execution. This means that the Pentium 4 has fewer execution resources than the Athlon, it takes longer to complete the instructions that it can issue, and it "undoes" some of the optimizations that software vendors have been doing since the Pentium Pro days. On the other hand, the Pentium 4 should theoretically be able to deal streaming, and in particular, large data-sets better than the Pentium III and Athlon, due to the massive bandwidth that it has at all levels. Despite the view that the Pentium 4 is "crippled," it does have a way to make up for the lack of solid floating point performance - more SIMD instructions!