How Cache Helps Performance
Why is this advantageous to use an exclusive cache? Let us consider Intel's Celeron2 and AMD's Duron. The Celeron has 32k of L1 cache, equally split between data and instruction caches. It has a unified, on-die 128k L2 cache. So we add the two together, and we have 160k of on-die cache, right? Technically, yes. However, the information in the L1 is duplicated in the L2, because it is an inclusive design. This means that it has 32k of L1, and 96k of effective L2 (because 32k of the information stored in the L2 is the same as the L1), for a total of 128k of useful cache. Contrast that with the Duron, which has 128k of L1 cache, equally split between data and instruction caches. It has a unified, on-die L2 cache that is 64k in size. What!?! Only 64kb? If the Duron were designed inclusively, this defeats the purpose of the L2 cache. In fact, it would defeat the purpose of adding an L2 cache if AMD designed it with a L2 cache size that was equal to, or less than, the size of the L1. It works this way in all exclusive designs (including the never-released Joshua version of the Cyrix III). So AMD made the design exclusive (as they did with the Thunderbird). The Duron has 128kb of L1, plus 64kb of L2, and because neither one contains the same information, one can just add the two together for the effective on-die cache, which amounts to 192kb. The Duron's is obviously larger.
So, let us consider this again using some diagrams (not to scale of course):
Exclusive L2 - L1 relationship
What must be discussed (this exclusive diagram is a change from the first version of it) is the nature of the exclusivity of the cache: in AMD's Duron, the relationship is only a L1/L2 relationship. What this means is that, it is solely exclusive between the L1 and L2, and that the information in the L1 is duplicated in the main memory (and, potentially, though not always or often, in the hard drive), yet not in the L2 cache.
Moving from the Celeron2 and Duron as examples, let's now look at their more-powerful siblings, the AMD Thunderbird Athlon and the Intel Pentium3 Coppermine. The Thunderbird has an on-die, exclusive 256kb L2 cache, and the Coppermine P3 features an on-die, inclusive 256kb L2 cache. Caches take up enormous numbers of transistors. The P3 went from about 9.5 million transistors to about 28 million just from adding 256k of L2 cache. The Athlon had about 22 million, and adding 256k of L2 cache made the Thunderbird weigh in at a hefty 37 million. These represent rather large fractions of the transistors used in two top-of-the-line x86 processors, which are there solely to feed these processors' hungry execution units.
Considering these caches take up so much space, they increase die sizes significantly, which is not a good thing because die size plays a crucial role in yields. In fact, the formula for die yields can be expressed as:
Die yield = (1 + (((defects per mm^2) * (Die area in mm^2))/n))^-n where n is equal to the number of layers of metal in the CMOS process (assuming wafer yields are 100%) (Hennessy and Patterson, 12).
As you can see, the larger the die, the lower the yields. I add this in only because the original Athlon on a .18 micron process was 102 mm^2, and the Thunderbird was about 20% larger at 120mm^2. This increases die size is bad because it reduces yields. As you can see, from the standpoint of economy v. Performance, there need to be good reasons to put the L2 cache on-die, and there are.
Now that Intel has moved a majority, and AMD the entirety, of their production to socket chips, they do not really need to put the CPUs on expensive PCBs. This is because the cache was formerly placed on the cartridge, but off-die, and those cache chips have now been replaced with on-die cache. There are other benefits besides cost however: in this case, performance.
Despite the common misconception that electricity flows at the speed of light, it does not. It certainly travels at speeds far greater than the speed of sound, but electrons flow at a finite speed that is much lower than that of light, and this fact has an impact upon the design and performance of processors. Why mention this? One must remember that computers only deal with information in low and high voltages of electricity. The speed of any given part of a computer is, at the very least, bound by the speed at which electricity can be transmitted across whatever medium it is on. This, in turn, shows us that the ultimate performance bottleneck is necessarily the speed at which electricity can move. This is also the case for cache.