Latencies And Bandwidth
Note: Much of this section is adapted from "Computer Organization and Design: The Hardware/Software Interface."
High latencies can adversely affect the rate at which processors execute instructions, and thus, performance. Caches greatly alleviate this problem, and caches have, arguably, a larger role to play in multiprocessor systems than they do in uniprocessor systems. Back in the "Fundamentals of Cache" (http://www.systemlogic.net/articles/00/10/cache), there were two general forms of making caches (in terms of how they write to memory locations): write-back, and write-through. A quick excerpt from the aforementioned article is reproduced below:
What about when the CPU alters the information that it got from the cache? There are generally two options employed, write-through cache, and write-back cache. The first term means that the information is written both to the cache, and to the lower layers of the memory subsystem, which takes more time. The second term means that the information is written only to the cache, and the modified information is only written to a lower level when it is replacing it. Write-back is faster because it does not have to write to a lower cache, or to main memory, and thus is often the mode of choice for current processors. However, write-through is easier, and better for multiple CPU based systems because all CPU's see the same information. Consider a situation where the CPU's are using the same information for different tasks, but each has different values for that information. That's one reason why write-through is used, since it alleviates this problem by making sure the data seen by all processors remains the same. Write-through is also slower, so for systems that do not need to worry about other CPUs (read: uniprocessor systems), write-back is certainly better.
What "The Fundamentals of Cache" (http://www.systemlogic.net/articles/00/10/cache/page2.php) hinted at is something called cache-coherency protocols. This determines what information is in the highest shared-level of the memory hierarchy, and how it is done. The main idea is to make sure that all CPUs have the most up-to-date version of whatever information is being requested. The reason why this can be an issue is that one CPU could be working on the piece of data, and if the cache is write-back, then the version of the data in the highest common memory layer will not be the most current version until it has been updated to main memory, which could be awhile! Certainly one wouldn't want to do work on a data element twice, or worse yet, do inappropriate work on data! This is where these protocols come into place - they make sure that the data stays safe.
The two types of protocols that are most widely used in shared-bus architectures are write-invalidate and write-update. I'll begin with write-update, because it is very similar to the write-through protocol. When a cache-line (or alternatively, block) is updated in one level, the changes in the contents of that block are then broadcast over the shared, system bus to the other processors.
Each CPU takes note of which block is being updated, and if it has a copy of the block (which is outdated), it then rewrites it. Each CPU makes use of something called a snoop-port, a port dedicated to the task of determining whether its contents need to be brought up-to-date. This does, of course, mean that a lot of bandwidth is used solely for keeping the contents of each processor as recent as possible. Due to limited bandwidth, this can in fact slow down the rate at which processors compute. (The EV6 bus uses a dedicated 13bit bus for snooping, and is exempt from this main-memory bandwidth contention). As stated in the "Fundamentals of Cache", latency and bandwidth go hand in hand.
Assume the bus for a shared-bus system has a maximum bandwidth of 800Mb/second. Now suppose that there are 4 processors connected to this bus. Each CPU effectively has 200Mb/second of bandwidth, assuming each is running the same type of task. If each CPU could use up 300Mb/second of bandwidth, then bandwidth becomes the constraining factor in improving performance (if this doesn't make sense, think about using a 56k modem, and downloading four very large files at the same time from fast servers). Next, consider a program that not only uses 300Mb/second of bandwidth, but also needs to update the cache-lines frequently, and broadcasts the data frequently. In addition to the bandwidth used by the program itself, the contents of the blocks ( or cache-lines) are sent back out across the bus, and if the block size is large, broadcasting data will require enormous bandwidth. In cases such as these, the frequent updating done in the write-update scheme (between a processor's cache and main memory) exacerbates the bandwidth problem. This, in turn, means that each processor has less bandwidth available for actual computing tasks, and the rate of execution stagnates. A write-update scheme used between a L1 and L2 cache on a processor (assuming it is an inclusive design), though slower for the L1 cache, actually makes snooping easier, because it is assured that anything in the L2 cache is the most up-to-date information, and other processors only have to snoop the L2 cache, and not deal with the L1 cache.
>> Latencies And Bandwidth Cont.