NOTE: This article intentionally ignores the discussion of a NUMA architecture. It is aimed at smaller scale multi-processing, which is used as a lead into the various forms of on-chip multithreading.
I remember, quite distinctly, my first encounter with someone with a multi-processor system as his home machine. "Bob" (fake name to hide his identity) claimed that he owned a 466Mhz computer, when at the time, the top-of-the-line Intel machine was a Pentium II 450. I told him that no, he didn't have a 466Mhz machine, and because "Bob" didn't know about Alpha processors, and had never heard of "overclocking" before, that could not have been it. Then it hit me: he had a dual Pentium II 233Mhz rig. After asking him if this was the case, he said that it was, and that two 233Mhz chips are comparable to a 466Mhz machine.
What "Bob" didn't realize was that programs have to be written in a special way in order to take advantage of multiple processors (called multithreading). Nor did he realize that they needed to be using an OS which supports multiprocessors, such as Windows NT or *nixs (any sort of Unix variant). MacOSs below OSX had a crude form of multiprocessing support (OSX is based off of BSD, a *nix clone, and has full fledged multiprocessing support). Generally speaking, the computer needs OS support as well as application support to take advantage of multiprocessing.
The other thing "Bob" didn't realize is that for the vast majority of current consumer based applications, the multithreading support, if existent, does not often expedite the completion of a task proportionally with the number of processors in the system. This fact does not preclude an application from experiencing linear scaling with respect to the number of processors - nay, some applications can experience superlinear scaling, though these are few and far between.
Assuming that an application is multithreaded (programs written to execute in a parallel manner, rather than a serial or purely sequential one), there are inherent difficulties in making a program run faster proportional to the number of processors: the program needs to be written in a parallel fashion, and the program itself must be resource friendly.
This is explained by Amdahl's Law: "…the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used," (Hennessey and Patterson, 29). This law applies to more than just changing sequential code to parallel code.
Assume that a program consists of two main parts, A and B, and that each can be optimized. Part A represents 90% of the execution time, and B represents the remaining 10%. Assume that B can be optimized in such a fashion so as to be able to finish in one tenth the time of the original version, and that A can be optimized so as to complete its part of the program in 2/3 the time it previously needed. If both parts A and B take the same amount of time to be optimized, and the programmer has time only to optimize one part, which should the programmer work on? Obviously, he/she should work on part A.
Just to walk through the thinking involved, assume that this program requires 100 seconds to complete. Part A consumes 90 seconds of execution time, and B requires 10 seconds of execution time. After optimization part A would take 60 seconds, and B a mere 1 seconds. The choice is between a total of 70 seconds of execution time if A is optimized, and 91 seconds if B is optimized. Though this article is not intended to be about software engineering, if one looks at the payoffs for where to put effort into programming, the choice should be obvious.
So where do multiprocessor-based system fit in here? Multithreaded programs must be written to be as parallel as possible. The same discussion about speeding up parts of a program to increase performance applies to the discussion of multithreading. To create a linear speedup (a performance increase of X due to having X processors), generally all of the code must be written in a parallel manner (generally, and not explicitly always, due to other circumstances discussed later). The mathematical representation of Amdahl's law with respect to parallel computing follows:
Adapted from "Computer Architecture: a Quantitative Approach," second edition, page 645.
And even this formula doesn't tell the whole story. This assumes that the parallel code, per processor, is just as fast as the sequential code, which may or may not be the case. It also assumes that there is no overhead due to the latencies of travel. This, of course, is not the case. Few commercial consumer based applications support nearly perfectly parallel code. The only way to superlinear performance increases occur is for the cache afforded by each additional processor to dramatically drop main-memory bandwidth usage (which is not common).
The assumption about latencies is important, because it is twofold: one, the program will have the same latencies to all processors; and two, the latencies themselves are not, in fact, exacerbated by the presence of the other processors executing their code.
>> Latencies And Bandwidth