As with a fair comparison between traditional superscalar and a CMP processor, each processor must be allotted the same number of functional units, same cache and cache-line sizes, etc. Again, we use the handy graphics as a means of comparing two processing paradigms:
'a' represents the traditional superscalar, while 'b' represents a coarse-grained multithreading architecture.
While CMP shares the same physical die, and can share the L2 cache (either if it is on-die, or off), and executes two (or more, depending upon the number of processors on the die) threads at the same time, coarse-grained multithreading (CMT) architectures do not. CMT improves the efficiency with respect to the usage of the functional units by executing one thread for a certain number of clock cycles. The efficiency is improved due to a decrease in vertical waste. Vertical waste describes situations in which none of the functional units are working due to one thread stalling.
When switching to another thread, the processor saves the state of that thread (i.e., it saves where instructions are in the pipeline, which units are being used) and switches to another one. It does so by using multiple register sets. The advantage of this is due to the fact that often, a thread can only go for so long before it falls upon a cache miss, or runs out of independent instructions to execute. A CMT processor can only execute as many different threads in this way as it has support for. So, it can only store as many threads as there are physical locations for each of these threads to store the state of their execution. An n-way CTM processor would therefore have the ability to store the state of n threads.
A variation on this is to simply execute one thread until it has experienced a cache miss (usually a L2 cache miss), at which point it will switch to another thread. This has the advantage of simplifying the logic needed to rotate the threads through a processor, as it will simply switch to another one as soon as the prior thread is stalled. The penalty of waiting for the requested block to be transferred back into the cache is then alleviated. This is similar to the hit under miss (or hit under multiple miss)  caching scheme used by many processors, but it differs because it operates on threads instead of upon instructions. The MAJC architecture made use of CMP, and it also uses a form of CTM, where it switches threads on a cache miss, with support for 4 threads in this manner.  The MAJC architecture also has a few more tricks up its sleeve for multithreading, which will be discussed later. The APRIL architecture, circa 1990, also was to use CMT.
The advantages of CMT over CMP are: CMT doesn't sacrifice single-thread performance, and there is less hardware duplication (less hardware that is halved to make the two processors "equal" to a comparable CMT).
>> Fine-Grained Multithreading