A more aggressive approach to multithreading is called fine-grained multithreading (FMT). Like CMT, the basis of FMT is to switch rapidly between threads. Unlike CMT, however, the idea is to switch each and every cycle. While both CMT and FMT actually do indeed slow down the completion of one thread, FMT expedites the completion of all the threads being worked on, and it is overall throughput which generally matters most.
Suppose there is an n-way FMT processor. Then, every nth cycle, instructions from thread number one are executed, every nth + 1 cycle, instructions are executed from thread number two, etc. What this accomplishes is to hide very long latencies. A graphical representation is shown below:
'a' is a 4-way fine-grained multithreading processor, and 'b' is a traditional superscalar.
Tera architecture is an example of FMT architecture. As stated before, the idea is to hide long latencies and to try to eliminate vertical waste. Super computers are infamous for taking up a great deal of space and being attached via high-speed networks to operate in tandem. The Tera Super Computer was no different - though the architecture is now found in the Cray MTA (Multi-Threaded Arhictecture). The expected average latency for an instruction was 70 cycles, (which included the fraction of the time that it might need the data from over the network). In order to hide massive latencies such as these, most architectures use caches, which reduce the number of times each processor would access main memory. However, the Terra architecture is completely cacheless! The Cray MTA has L1 and L2 instruction caches, but lacks any data caches. To combat large latencies, each processor is capable of storing the state of no less than 128 separate threads. Like any other traditional FMT, if there are fewer than the maximum supported threads, then the processor will simply cycle through those which are available.
In order for the Cray MTA architecture to achieve and maintain peak performance when using FMT, there must be at least as many threads as the average cyclic latency is for each word request. In the case of the MTA architecture, this constitutes ~70 threads! While certainly not commonplace in traditional applications, in the super-computing arena it is not unthinkable to come up with that many threads from which to execute.
However, the Cray MTA architecture is more a hybrid architecture -- it will act as a CMT processor when necessary. Each instruction carries with it a 3-bit tag that tells the processor how many more instructions can be expected from that thread before a non independent instruction is found, at which point it will execute until that instruction is reached. At that point, it will switch to a different thread at the next clock cycle. For example, to cover the 70 cycle latency, nine threads would require a stream of seven instructions. With both features, the Cray MTA architecture shows how both CMT and FMT can work to hide very long latencies, and do so entirely without the use of caches.
Even if there are enough threads to make an FMT processor stall-free, research has shown that on an 8-issue processor, at best only 40% of the functional units are actually used (or, ~3.2 instructions per clock cycle).
>> Simultaneous Multithreading