Cache and Architecture Terminology (Part 2)
The bandwidth in current caches is expressed in terms of gigabytes per second, or Gbytes/sec. This is calculated by the bus width (for example, 64 bits which is 8 bytes, or 256 bits which is 32 bytes, though many other combinations are possible), times the clock speed, or Clock * (bus width) = bandwidth. This is very important for processors, because, as it has been shown, processors are requiring more and more bandwidth even at the same clock speeds, because they are continuing to be able to crunch through more and more numbers each clock (though this is diminishing in some cases, but that's not the scope of this article).
There are even more variations in L1 cache design, usually between a Harvard Architecture cache, which is a cache that is split between a data cache (also called Dcache, which is the stuff to be computed) and an instruction cache(also called an Icache, which is the information about how the data is to be computed), and a Unified cache, which is where both data and instructions can be stored in the one cache. While a Unified cache might sound better, as a reader pointed out to me, dealing with cache, as stated in the beginning, is a bandwidth issue. I quote from him:
The Harvard architecture is an effort to reduce the number of read ports that a cache needs to have. In other words, it is a bandwidth issue. If we can separate tasks, then they should be separated, so they don't compete for bandwidth. Otherwise, we will have to have a higher bandwidth interface. A separate data and instruction cache is approximately equivalent to a dual(read)-ported unified cache. (the unified cache would still be slower, as the array would be bigger).
So for the sake of performance, the Harvard architecture is better, due to its more efficient use of bandwidth.
Just as pipelining a processor allows it to reach high clock speeds (an excellent discussion on this can be found at http://www.aceshardware.com/Spades/read.php?article_id=50), caches too can be pipelined. What this means is that, assuming a cache hit, the cache can start sending more information from another request before the information from the first request has gotten to the registers of the CPU. This is certainly advantageous because it allows the processor to be designed with higher latencies and reach higher clock speeds, while get maximum bandwidth at the same time. Obviously, a lower latency is more desirable, but it is not always possible depending upon cache size, associativity, and target clock frequency (more on this later).
Most caches today are pipelined, and consequently there are some hazards. One such hazard is a stall on a cache miss. On a cache miss, normally the cache would not be able to continue to send data until the required data was finally received (from a lower level of the memory hierarchy). This is called a blocking cache, but this can be thought of as an in-order CPU design. The solution? A non-blocking cache ;) This can be thought of as an out-of order cache design, because the data doesn't have to arrive at the CPU in the same order that it was requested, just like an out-of-order CPU design does not have to execute instructions upon data in the same order that it was received. This is called a "hit-under-miss" optimization. If this is layered, meaning a cache can send multiple cache hits in the face of a miss, this is called "hit under multiple-miss" or "miss under miss" (Hennessy and Patterson, 414).
Now that some design types have been discussed, the next item on the agenda is how to determine what gets to be in the cache.
As stated before, for highest performance, one would want to design a CPU with caches that allow the most recently used information to be nearest the processor so that it has fastest access to the needed information. Well, let us assume a scenario where the L1 cache (the one closest to the CPU) is full. Now assume that the cache does not contain the data needed. Therefore, the processor goes to the L2 cache. Since the data has been used recently (namely, right now), the CPU assumes it will probably be used again fairly soon, and it wants to copy the data to the L1 cache. Since the L1 cache is already full, how do it decide what to get rid of? There are two options. The processor could randomly boot some information from the cache. Alternatively, going back to the concept of locality of reference, the CPU can do the converse of this, and look for the data that has not been used lately. This latter option is called the least-recently-used (LRU) method. This method, as the name implies, boots out the information which has been needed the least of all the information in the cache.
Newer processors use what is called a victim cache (or alternatively, the victim buffer), which is simply information that was kicked out (the "victim") of the L1 cache. When the LRU kicks out data from the L1 cache, it goes instead to the victim buffer, and then to the L2 cache, so the CPU searches the data that was in the victim-cache, as it is sure to be the most-recently evicted information from the L1 (remember, although it is "older" than the cache in the L1, it is not as old as the data currently in the L2, which is full of information previously evicted, and therefore is more likely to be needed again). It should be noted that this is used only in an exclusive architecture (explained later).
What about when the CPU alters the information that it got from the cache? There are generally two options employed, write-through cache, and write-back cache. The first term means that the information is written both to the cache, and to the lower layers of the memory subsystem, which takes more time. The second term means that the information is written only to the cache, and the modified information is only written to a lower level when it is replacing it. Write-back is faster because it does not have to write to a lower cache, or to main memory, and thus is often the mode of choice for current processors. However, write-through is easier, and better for multiple CPU based systems because all CPU's see the same information. Consider a situation where the CPU's are using the same information for different tasks, but each has different values for that information. That's one reason why write-through is used, since it alleviates this problem by making sure the data seen by all processors remains the same. Write-through is also slower, so for systems that do not need to worry about other CPUs (read: uniprocessor systems), write-back is certainly better.
Furthermore, a feature called a dirty-bit allows fewer writes to be made back to memory, since it doesn't need to write back over the copy in memory if the block is "clean" (unmodified). If the bit is "dirty" (modified), then the write occurs.
Going back to the information about how each larger, slower layer of the memory hierarchy has the information contained within the smaller, faster layer of memory above it, this is not always the case. Most processor families, including Celerons, Pentium IIs and Pentium3s, and the K6-x, do use this general system, which is called an inclusive cache design. Each layer has a copy of the information contained within the layer above it. Notable exceptions are the AMD Thunderbird and Duron families, which are examples of an exclusive cache design, marketed by AMD as their "performance-enhancing cache" design.
Exclusive cache designs mean that the information contained within one layer is not contained within the layer above it. In the Thunderbird and Duron, this means that the information in the L2 cache is not contained within the L1 cache, and vice versa.