* Note: Here's a list of corrections made to this article:
- Revised the definition of the LRU
- Expanded upon Unified cache v. Harvard architecture
- Introduced the terms "Icache" and "Dcache"
- Revised the efinition of the Cache-line
- Redid the exclusive cache diagram
- Explained the exclusive relationships
- Corrected the formula for die yields
I would like to thank Tom McFadden (mechBgon of the AnandTech forums) for the numerous hours he spent (during his vacation!) helping me edit this article into what it is now.
In early PCs, the various components had one thing in common: they were all really slow :^). The processor was running at 8 MHz or less, and taking many clock cycles to get anything done. It wasn't very often that the processor would be held up waiting for the system memory, because even though the memory was slow, the processor wasn't a speed demon either.
That's a quote from PCGuide.
My, how times have changed. Now practically every chip is starved for memory bandwidth. From the Athlon to the PIII, the Celeron to the Duron, most all current microprocessors are outstripping the resources their memory subsystems provide. If not for the use of special high-bandwidth memory reserves called cache, both on and off the chip, these powerful processors would give pathetic performance simply because getting data to process, and instructions with which to process it, would be bottlenecked by the relatively snail-paced capability of the system's memory, or RAM. It's no accident that modern mainstream microprocessors have cache, even multiple layers of it, and we have all heard the term tossed around in relation to our CPU's, but there's a lot more to know about cache than simply the size.
Cache is an expensive type of memory, usually made out of SRAM (static ram), and it is distinctly different from the main memory that we've come to know (such as EDO, SDRAM, DDR SDRAM, DRDRAM, etc): it uses more transistors for each bit of information; it draws more power because of this; and it takes up more space for the very same reason.
What makes SRAM different? Well, first it must be noted that regular DRAM must be periodically "refreshed," because the electrical charge of the DRAM cells decays with time, losing the data. SRAM, on the other hand, does not suffer this electrical decay. It uses more transistors per bit, allowing it to operate without losing its charge while a current is flowing through it. SRAMs also have lower latencies (the amount of time that it takes to get information to the processor after being called upon).
Cache and Architecture Terminology (Part 1)
Before we delve into some of the basics of cache, let us get a simple understanding of the whole memory subsystem, known as a memory hierarchy. This term refers to the fact that most computer systems have multiple levels of memory, each level commonly being of a different size, and different speed. The fastest cache is closest to the CPU (in modern processors, on the die), and each subsequent layer gets slower, farther from the processor, and (generally), larger. One of the reasons why each progressively-lower level on the memory hierarchy must be larger (we'll look at some exceptions later) is due to the fact that each layer keeps a copy of the information that is in the smaller/faster layer above it. What this means is that the hard drive holds the information that's in the RAM, which holds information that is in the cache, and if there are multiple layers of cache, this process keeps going. The reason for this is explained later. A graphical representation is shown below:
Diagram showing how each level's information is stored in a layer below it in a traditional memory hierarchy
In a typical machine, L1 means the first level of cache (smallest and fastest), L2 is the second level of cache (larger, and slower), RAM is the main memory (much larger, and much slower still), and then the hard drive, incredibly slow and incredibly large in comparison to the other layers.
Before one can understand the performance differences that cache can play (aside from disabling it in the bios, and watching your precious Q3 or UT framerates drop to single digits, thus realizing that cache is important :P ), one must understand how it works.
However, there are times when this isn't quite the case, such as when the computer is manipulating, and creating large amounts of data, and not just reading stuff from the computer. An example of this would be very scientific calculations, where they don't write the disk much at all, and therefore, the data in the L1, L2, and main memory might not make it all the way to the hard drive, except in the situation where the results of all the computations need to be saved.
More than one reader pointed out that my previous definition wasn’t the case:
[A] cache-line is the amount of data transferred between the main memory and the cache by a cache-line fill or write-back operation.
The size of the cache-line takes advantage of the principle called spatial locality, which states that code that is close together is more likely to be executed together. Therefore, the larger the cache-line size, the more data that is close together, and therefore, likely related, is brought into the cache at any one time. The CPU only requests a small piece of information, but it will get whatever other information is contained within the cache-line. If the cache is large enough, then it can easily contain the information within a large cache-line. However, if the cache is too small in comparison to the cache-line size, it can reduce performance (because sometimes, irrelevant information is in the cache-line, and takes up valuable space).
Latency refers to the time it takes a task to be accomplished, expressed in clock cycles from the perspective of the device's clock. For instance, 100mhz SDRAM with a latency of 9 cycles with a 1ghz CPU means a latency of 90 cycles to the CPU! In the case of cache and memory, it refers to the amount of time that it takes for the cache (or memory) to send data.
A cache hit refers to an occurrence when the CPU asks for information from the cache, and gets it. Likewise, a cache miss is an occurrence when the CPU asks for information from the cache, and does not get it from that level. From this, we can derive the hit rate, or the average percentage of times that the processor will get a cache hit.
So enough already, you say, what's the bottom line? How is cache helping my framerate in Quake 3 Arena? Well, it turns out that there are still some more important concepts that we have to cover first. Firstly, there's a term in computer science / computer architecture known as locality of reference: Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends [about] 90% of its execution time in only [about] 10% of the code (Hennessy and Patterson, 38). Well, if a processor is only using 10% of the code most of the time, why not keep that information really close to the processor so that it can get access to that information the fastest? That's exactly what cache is used for.
It is not that simple though. Cache can be designed many different ways. They are discussed below as defined by PCGuide:
Direct Mapped Cache: Each memory location is mapped to a single cache line that it shares with many others; only one of the many addresses that share this line can use it at a given time. This is the simplest technique both in concept and in implementation. Using this cache means the circuitry to check for hits is fast and easy to design, but the hit ratio is relatively poor compared to the other designs because of its inflexibility. Motherboard-based system caches are typically direct mapped.
Fully Associative Cache: Any memory location can be cached in any cache line. This is the most complex technique and requires sophisticated search algorithms when checking for a hit. It can lead to the whole cache being slowed down because of this, but it offers the best theoretical hit ratio since there are so many options for caching any memory address.
N-Way Set Associative Cache: "N" is typically 2, 4, 8 etc. A compromise between the two previous design, the cache is broken into sets of "N" lines each, and any memory address can be cached in any of those "N" lines. This improves hit ratios over the direct mapped cache, but without incurring a severe search penalty (since "N" is kept small). The 2-way or 4-way set associative cache is common in processor level 1 caches.
As a rule of thumb, the more "associative" a cache is, the higher the hit rate, but the slower it is (thus higher latencies result as the cache is ramped up to high clock speeds). However, the larger the cache, the less of an impact associatively has upon hit rate. While associatively (read: complexity) plays a role in the speed of a cache, and thus the latencies, so does size. Given the same latencies, the more associative a cache is, generally the higher the hit rate and better performance received. However, the larger the cache, the more difficult it is to get it to reach both high clock speeds and low latencies. Clock speed is very important in terms of bandwidth, and latencies and bandwidth go hand in hand.