Cache And The Evolution Of Form Factors
I can recall one of my friends' 486 DX 33s (hey Adam! If you are reading this... this is YOUR 486 motherboard.... the one we played far too much Duke Nukem 3D on ;) ) which had a very large (for the time) 256k of L2 cache. This meant, of course, that it was off-die on the motherboard, since no slot 486's were around. This is due in part to the fact that it had farther to travel to get to the processor. Because the speed of electricity is not instantaneous (though it may seem that way when sticking your finger in an electrical outlet - I swear I've never done that! ), the latency involved with caches is also tied to how far it has to travel. In this case, even if the latency of a cache were zero (not possible - you cannot send anything instantaneously), latency would be incurred because of the fact that it takes time to transmit the data to the processor.
Another thing must be explained about the latency of a cache. For example, while the latency of the L1 cache is precisely the number given, the L2 cache isn't just the number of clocks that it takes for the L2 to give the desired information to the core. It is the latency of the L2 cache, plus the latency of the L1. This happens because in most cases (some exceptions include the Itanium which bypasses the L1 cache for FPU data and goes directly to the L2 instead), the processor has to wait for a L1 miss occur before searching the L2 for the information. So the latency of the L2 is the latency of the L1 cache, plus the timing required for a L2 hit. This works its way down the memory hierarchy.
Socket 5 and 7 motherboards continued to use (and actually popularized) external (L2) caches. Slot 1 evolved and continued the use of external caches. Slot 1, Slot 2, and Slot A also allow a "backside bus," which in this case run at certain fractions of the core clock speed, and they transmit data more frequently than when the L2 cache was on the motherboard, which was a great boon for performance. Socket 8, (which chronologically should be placed after Socket 7 but before Slot 1, as it housed the Pentium Pro), was an oddball with on-package, but off-die, cache, which is better than merely being on the same daughter card as the CPU, but not as good as being truly integrated onto the same die. It was, in fact, cost prohibitive, hence one of the reasons for the slot architecture.
Those who are halfway observant of the CPU industry have surely notice the shift from socket to slot, and from slot back to socket with the advent of the Socket370 Celeron, and subsequently the socketed Pentium IIIs. AMD is doing likewise with their Thunderbird Athlons and their Durons. There's a good reason for this about-face: it's now technically possible and economically desirable to build these processors with the cache on the CPU die due to the smaller fabrication processes. When the processes were larger, caches (and everything else of course) took up more space, thus power and generated more heat as well, which are detriments to clock speed.
Because process technologies have gotten to the point where it can be commonplace to have an L2 cache on the same die as the processor, many designers have started to do this in practice (the Alpha team put on 96K of L2 cache on-die back in a .35 micron process, but they're not in the cut-throat, relatively low-margin consumer PC x86 world). The benefit from this is threefold: manufacturers no longer need to purchase expensive SRAM's for their L2 cache, which lowers costs even while taking the larger dies into consideration, since being able to remove the cartridge and get rid of the external SRAM's saves quite a bit; because the L2 cache is now on-die, there is far less distance for the information to travel from the cache to the registers, and thus, much lower latencies; and, if the architects feel it worthwhile to dig back into the core and make a wider bus interface (say, from 64 to 256 bits wide), they can massively increase bandwidth. Recall that bandwidth = bus width * transfers/second, and I say transfers/second because when something states 300mhz DDR, it is sometimes difficult to determine if it is 150mhz that has been "double pumped" for 300mega-transfers per second, or if it is a 300mhz, but at double data rate, meaning 600mega-transfers per second.
This does not, of course, stop anyone from making CPUs with more than 2 levels of cache. Makers of enterprise-class server chips have done it many times, and the PC market saw it with the advent of the K6-III, where the L2 cache on the motherboard simply became the L3 cache, and was effectively a small bonus. The Itanium (which, it now appears, will not even become a true server chip, as it has basically become a testing platform) uses 3 levels of cache, two on the CPU, and one on the PCB which is on the cartridge. I even recall seeing on the 'net a 4-way Xeon motherboard that had 128mb of L3 cache!
With all these "basics" down, the impact on performance that all of these varying factors can be taken into account.