Latencies And Bandwidth Cont.
In order to imagine a common reference with a program that is bandwidth intensive, think of Seti@Home. The older clients in particular (excluding version 3.0, which was small-cache friendly) were large enough that they did not fit inside the L2 caches of most processors (excluding 2Mb Xeons). Because of this, the vast majority of the program had to be constantly fetched from main memory. And, due to the shared-bus architecture, if there were two processors, they would both be impeding upon each other's resources (i.e., memory bandwidth). To give an example of a better way of sharing the load, I made a post to the AnandTech forums awhile back (Click here - I use the pseudonym "BurntKooshie" (long story, don't ask). It discusses two Distributed Computing projects, RC5 (www.distributed.net), and Seti@Home (http://setiathome.ssl.berkeley.edu/) The following is the post, edited for spelling and grammar:
You see, with dual motherboards (you are obviously using an Intel machine, because the dual Athlon chipsets aren't out yet), the CPU's share the bandwidth of the system. The more bandwidth, and low latencies that the program requires, the more of a diminishing returns you'll see by running two instances of S@H. RC5 causes nearly no bandwidth overhead, and so runs on multiprocessors nearly flawlessly, and with near X times as much work done in the same amount of time, where X is the number of processors.
S@H is (or at least, was), a different beast. With dual Intel motherboards, the systems share the bandwidth of the memory. Prior clients used a lot of bandwidth, and liked low latencies. When two (or more) instances of S@H were run, this caused the CPU's to fight for bandwidth. If you are familiar with economics (I can't believe how handy this course is coming in :D ), you can consider memory bandwidth, at least on an Intel platform, to be a rival good, whereby if one CPU is consuming it, it detracts from the other CPU's ability to consume it. This means that the output is nowhere near X times as much work where X is the number of processors in the system.
HOWEVER, the new clients are supposedly more bandwidth/latency friendly (though not nearly to the point of being like RC5, but that is intrinsic to S@H's nature), which mitigates, at least to some degree, the amount of bandwidth required. Also, CPU's with larger cache's (read: Xeons, UltraSparcs, PA-RISC chips, etc) tend to deal with less bandwidth better, as they can rely on their larger L2 cache's than on system bandwidth (to some extent).
SO, if you have a dual system, with my current understanding of S@H, if I were to participate in both contests, I would put one CPU on S@H, and one CPU on RC5. This will allow you to get full S@H power from one CPU, and one full RC5 power from the other. IMO, this makes the most sense when looking at efficiency, and wanting to participate in both contests using multiple machines.
What I mean by that is, take a person with 2 dual systems. To participate in both contests (within this limited example), there are two options: 1) To have one dual system run RC5 on both CPU's, and one to run S@H on both CPU's, or 2) To have each dual system run one instance of RC5, and one instance of S@H.
The latter leads to the same production of RC5 (because of its almost non-existent overhead of bandwidth and low latency memory), HOWEVER, it would at the same time give way to MORE S@H units being produced than if one were to employ the former option.
Many scientific applications (most Distributed Computing Projects fit into this category) behave in a similar fashion: they are too large to fit into many caches, make heavy use of main memory bandwidth. If processors use write-update from the L2 cache to main memory, they only make the situation worse.
So, how does one work around this? Take the previous write-update scheme, and instead of updating the contents of the cache-line, why not just tell all the other processors that the information contained within that block is out of date? That's exactly what the write-invalidate protocol does. While this protocol does require information to be broadcast over the system bus, write-invalidate proves to use memory bandwidth in shared-bus architectures more frugally than write-update. This occurs because the CPU only needs to broadcast which cache-line needs to be marked as invalid - the other CPUs don't try to use it, because it is out of date - instead of the contents of a whole cache-line. Write-invalidate is a protocol that is in the same spirit as write-back information transferal (save the bandwidth for what's really needed) - especially when there are many processors on a bus.
Rather than use write-invalidate, one could increase the bandwidth of the main bus, but it isn't a cheap option. One could either widen the bus or implement something like RDRAM with several channels. For every bit wide the bus is, another trace has to be put into the motherboard. Even when using RDRAM, the traces are more sensitive to their placement, and so designing motherboards for use with RDRAM is more difficult. Either method means additional costs.
However, there are locations in a computer system where additional wires for increased bandwidth are comparatively cheap - on the processor; due to their placement, latencies become comparatively low. Moore's law continues to push on, despite the nay-sayers for whom processes technology is always on the verge of reaching its endpoint. With more transistors at engineers' disposal, more and more functional units are thrown on, and larger caches are employed, but there are diminishing returns for the continuing use of this strategy. To quote an Intel engineer (taken from "The Register" Click here), "The low hanging fruit is all gone. Now we have to build scaffolds around the tree. We'll stand on our head and do strange things for a little more performance." Intel engineers aren't the only ones looking for innovative technologies to speed things up. More functional units aren't showing the same amount of performance scaling as they initially did (and how could they?), and memory latencies from the processors' perspective, are getting longer. New ways of extracting performance need to be found.
>> ILP Background