For a long time, the secret to more performance was to execute more instructions per cycle, otherwise known as Instruction Level Parallelism (ILP), or decreasing the latency of instructions. To execute more instructions each cycle, more functional units (integer, floating point, load/store units, etc) have to be added on. In order to more consistently execute more instructions, a processing paradigm called out-of-order processing (OOP) can be used, and it has in fact become mainstream (notable exceptions are the UltraSparc and IA-64).
This paradigm arose because many instructions are dependent upon the outcome of other instructions, which have already been sent into the processing pipeline. To help alleviate this problem, a larger number of instructions are stored so as to be ready to execute immediately. The purpose is to find more instructions, which are not dependent upon each other. This area of storage is called the reorder buffer. Reorder buffers have been growing: the Pentium III has a window of 40 instructions, the Athlon has 72 (the .18 micron incarnation of the Athlon is purported to have 78), and the new Pentium 4 contains a window size of no less than 126 instructions! The reason is simple: code that is spatially related tends also to be temporally related in terms of execution (this excludes arrays of complex structures and linked lists). The only problem is that these instructions also have a tendency to depend upon the outcome of prior instructions. With a CPU's ever increasing amount of code to plow at a time, the only current way to find more independent instructions has been to increase the size of the reorder buffer.
This has shown a rather impressive downturn in the rate of increased performance -- it's showing diminishing returns. It is now taking more and more transistors to show the same rate of performance increase. Instead of focusing intently upon uniprocessor ILP extraction, one can focus upon a coarser form of extracting performance - at the thread level, via multithreading, but without the system bus as a major constraint.
>> On-Chip Multiprocessing