As was seen with the Mendocino Celerons (the incarnation of the Celeron which featured 128k L2 cache at full core speed, and approached the performance of the Pentium II in many situations), speed can often make up for size. Intel used this same idea with the Coppermine P3. With the introduction of the Coppermine P3, Intel regained, and in some cases superceded, the Athlon in performance. This performance increase at the same clock speed over a Katmai P3 is almost exclusively due to the addition of the L2 cache onto the die of the core. It has a phenomenally low L2 latencies for such high clock rates - 7 cycles (3 for a L1 miss, 4 for a L2 hit)! Even better, when Intel engineers looked at the (aging) P6 core, they opened it up, and increased the L2 bus width from 64 bits to 256 bits, and had the L2 cache operate at core speed, double that of the Katmai.
However, while doubling the speed frequency, and quadrupling bus width, of the L2 cache, the net bandwidth is not eight times that of the Katmai P3, but "merely" four times. While true that doubling frequency, and quadrupling bus width would mean an eightfold increase in bandwidth, Intel made it so that the L2 cache does not send data on every cycle. Yet it is still full speed - take for example the K6 family. Its FPU ran at "full speed", just as the P6 cores did. However, it had an average latency of two cycles for most floating point operations, as it wasn't pipelined. This is a similar situation, where the L2 cache runs at full speed, but does not send information every cycle.
In an attempt to regain some lost ground that they had previously won against the P3, AMD introduced the "Thunderbird" Athlon, which has "Performance-Enhancing Cache," a phrase that sounds suspiciously familiar to the "Advanced Transfer Cache" buzzword that Intel coined for the P3 Coppermine.
Nearly everyone was anticipating that the L2 cache on the new Athlon would be markedly similar to that on the P3 Coppermine: it was expected to have a 256-bit-wide bus, perhaps eight-way associative cache, both like the Coppermine; and lower latencies, though no one expected it to be quite as low as the P3's latencies. They also expected it to be exclusive. AMD surprised everyone in two ways: one, they had a 16-way associative (higher hit rates) as opposed to the 8-way expected. In addition to the large L1 cache and the fact that it was exclusive, which makes for rather high hit rates (think of it as a 6-way 384kb L1.5 cache - this is a good thing), it came out with merely a 64-bit L2 bus, meaning that unlike the P3 Coppermine, whose L2-to-processor bandwidth increased fourfold, the new Athlon's bandwidth increased by "only" a factor of between two and three (twofold over the half-speed L2-cache versions, threefold over the 1/3-L2-cache versions. Lastly, the latency of the Athlon did indeed decrease from 24 cycles to 11, but in some situations (see here), its latency shoots up to 20 cycles. The reason for this was as follows (from here):
...the AMD Athlon processor's L1 cache is capable of efficiently handling most requests for data. As a result, the victim buffer can be drained during idle cycles to the L2 cache interface.
The reason that the L2 latency can shoot up to 20 cycles is, if the victim buffer (victim cache) is full, it has to send the information to the L2 cache before the L1 reads from the L2 (because the information contained in the victim buffer could be what is being requested), and the time required to send the data from the buffer to the L2 cache is 8 cycles (the same as it is for going from the L2 to the L1). When in the L2 cache, there is a total of about a 4 cycle turnaround, plus the 8 cycles to go back from the L2 cache. 8 + 8 + 4 = 20 cycles, which is exactly what was read from the Cachemem utility in the above link at Aceshardware when the memory footprint increased to a size greater than the L1, but smaller than the L2 cache sizes.
I lurk around the technical forums at Aceshardware a lot, and many people were quite shocked about the meager 64-bit L2 bus. AMD seems to have seen these reactions, as they released the above PDF file stating practically nothing more than why they chose the 64 bit bus and why the latency is in some cases more than 11 cycles (it has much other information, however, nothing that couldn't be found elsewhere).
Because the L1 cache of the Athlon already has such a high hit rate, the amount of bandwidth required of the L2 cache wasn't enough to warrant the time required to widen the bus any further. Time to market is a very important concept, one which AMD seems to have stolen from Intel as of late (Intel has seemingly forgotten this concept, but that's another one of my digressions ;) ). Another reason that many were surprised that the L2 bus width wasn't increased was this: while AMD may state that the large L1 de-emphasizes the need for such high bandwidth L2, being exclusive in nature means that there is more traffic going through the bus because of the cache evictions - meaning, the L1 has to make sure that the L2 doesn't have the same information, and it goes over the L2 bus in order to do that, thus increasing bus congestion.
Even though the P3 Coppermine doesn't use an exclusive cache it is only wasting 32kb of cache right now, one fourth the amount that the Athlon would be wasting if it were inclusive), and doesn't need the bandwidth for making sure of cache evictions from the L2. Instead, it needs very high bandwidth because it needs to be refilled quickly, and often, due to its smaller size.
I found a rather interesting quote by Paul DeMone over here, which also helps to explain the situation:
The EV5 [Alpha 21164] has a 96 KB L2 cache on-chip. The EV6 [Alpha 21264] only has one level of cache [on-die]. When you have one level of on-chip cache you have to make it as big and associative as possible while not letting it hurt clock rate (going off chip is murder on performance). So it is a compromise. But with two levels of on-chip cache you make the L2 as big and associative as possible and make low latency the number one priority for the L1.
Processors that have "large" 3 clock L1s along with on-chip L2's (EV7, K7 t-bird) got that way because the L2 caches were add-ons in subsequent chip versions of the core and because of time to market concerns it wasn't worth opening up the CPU pipeline and layout in order to go with a more optimal 2 level cache arrangement.
Dirk Meyer, the head architect of the EV6, also happened to subsequently become an employee of AMD, and the head architect of the K7 (the Athlon). This helps to explain some of the similarities in cache design between the Alpha 21264 and the Athlon.