Applications Of Multithreading: Redundancy Is Faster?
To avoid the fact that IA-64 can't execute instructions out of order, one of the features Intel chose to use was Predication. The idea is to do the work twice, each for one of two possible outcomes from a branch (an if/else statement - for more information, see http://www.systemlogic.net/articles/00/9/ia64). This is actually useful, because due to the fact that IA-64 calls for in-order processing, most functional units would otherwise be idle.
For some branches, there is no good way to predict which path to take. An extension of SMT, called Threaded Multi-Path, does for hard-to-predict threads what predication does for instructions: instead of guessing, do both at once, and discard the unused result.
Another processing concept was originally known as "Cooperative Redundant Threads" and is now called Slipstream processing. A Slipstream processor actually does the work of the whole program twice! And yet, it ends up running faster. The name stems from NASCAR, of all places (for the reasoning, go to http://www.tinker.ncsu.edu/ericro/slipstream).
Slipstreaming works by using two threads that start out exactly the same - the A-stream (advance stream), and the R-stream (redundant stream). What happens is that the R-stream remains as an unmodified thread of the original program, the hardware works to remove instructions that don't have any apparent effect, and the A-stream is then stripped accordingly.
As the shortened A-stream runs slightly ahead of the R-stream via a delay buffer, the R-stream is able to get information about how the program will execute, even before it executes! This, in a sense, is a real-time version of the schemes used by Intel with IA-64 and feedback-driven compiling (where the program is compiled, run, profiled, given to the compiler once more with information about the program, and then runs faster with the new executable).
The A-stream, now shortened, tells the R-stream (unmodified) which branch it should take. The A-stream runs faster because it is a smaller executable; current techniques have shown a decrease in instruction count by up to 50% on average!. The R-stream runs faster due to having many branches resolved as it needs the answers, and thus even more rarely needs to use branch prediction etc.
The original means by which one could create a slipstream processor was to have a 2-way CMP chip with two simple CPUs, each with half the execution resources of a more robust processor that one would normally design. With this approach, a speedup of ~12% was achieved for the Slipstream CMP processor of the larger, traditional super-scalar core (though some programs did run substantially slower than the superscalar). Thus there are ways of using the second CPU in a CMP processor even without having additional threads to run. Additional performance increases are possible because the two smaller CPUs are able to run faster due to less complexity in each.
Another approach is to use a base SMT architecture, which is an extension of the large superscalar, and to run the A-thread and R-thread on that. The key here is that the SMT would normally act as a regular superscalar without having additional threads, and attempts to speedup a single thread via entirely different means than DMT. An interesting comparison would be between the performance (and design issues) of DMT and Slipstreaming on an SMT processor (as both are base SMT processors), and between a DMT and a similarly equipped CMP Slipstreaming processor.
>> Summary Of The Forms Of Multithreading And Conclusion