SMT Induced Changes
In order to support multiple threads, an SMT processor requires more registers than the traditional superscalar processor. The general aim is to provide as many registers for each supported thread as there would be for one processor. For a traditional RISC chip, this implies 32 * n registers (n is the number of threads an SMT processor could handle in one cycle), plus whatever renaming registers are required. For a 4-way SMT processor RISC processor, this would mean 128 registers, plus however many renaming registers are needed.
Most SMT models are straightforward extensions of a conventional out-of-order processor. With an increase in the real-world throughput, comes more strain upon instruction issue width, which is increased accordingly. Because of the aforementioned increase in the register file size, a SMT pipeline should be increased by 2 stages so as not to slow down the length of the clock cycle. The register read and register write stages are both broken up into two pipelined stages. Astute readers will note that additional stages would normally have a negative impact upon performance. This is where the small sacrifice in single threaded performance comes in: studies have shown a modest performance decrease of ~2%.
So as not to allow any one thread to dominate the pipeline, an effort must be made to ensure that the other threads get a realistic slice of the execution time and resources. When the functional units are requesting work to do, the fetch mechanism will put a higher priority to those threads that have the fewest instructions already in the pipeline. Of course, if the other threads have little they can do, more instructions from the thread are already dominating the pipelines.
Intel (in the high-end at least) is going with predication, speculation, and other ways of dealing with integer performance (though these are not terribly impressive, the floating point performance has proven to be top notch); IBM is using the standard beefy out-of-order processing paradigm, using CMP, for some truly monolithic processors, and massively parallel computing; Sun is using CMP, CMT, and a few other tricks I've not discussed with MAJC; and lastly, API is moving forward with an 8-wide, 4-way SMT processor.
Concerns About SMT
SMT is about sharing whatever possible. However, in some instances, this disrupts the traditional organization of data, as well as instruction flow. The branch prediction unit becomes less effective when shared, because it has to keep track of more threads, with more instructions, and will therefore be less efficient at giving an accurate prediction. This means that the pipeline will need to be flushed more often due to mispredicts, but the ability to run multiple threads more than makes up for this deficit.
The penalty for a mispredict is greater due to the longer pipeline used by an SMT architecture (by 2 stages), which is in turn due to the rather large register file required. However, there has been research into minimizing the number of registers needed per thread in an SMT architecture. This is done by more efficient OS and hardware support for better deallocation of registers, and the ability to share registers from another thread context if another thread is not using all of them . Alternatively, one could achieve better performance with the same number of registers per thread.
Another issue is the number of threads in relation to the size of caches, the line-sizes of caches, and the bandwidth afforded by them. Few studies have gone into detail with this issue, yet it remains a fundamentally important topic. As is the case for single-threaded programs, increasing the cache-line size decreases the miss rate but also increases the miss penalty. Having support for more threads, which use more differing data, exacerbates this problem, and thus less of the cache is effectively useful for each thread. This contention for the cache is even more pronounced when dealing with a multiprogrammed workload over a multithreaded workload. Thus, naturally, the more threads are in use, the larger caches should be made . The same applies to CMP processors with shared L2 caches. One study has shown that going above 4 threaded support on an SMT system caused a slowdown, due to the limited amount of bandwidth available from the L2 cache.
The more threads that are in use, the higher the overall performance, and the differences in associativity become more readily apparent. Keeping the L1 cache size at a constant 64Kb, a study in France  showed that the highest level of performance is achieved when using a more associative cache, despite longer access times. To emphasize the miss rate, a small 16Kb L1 cache was used to determine the varying performances of differing block sizes, with different associatively, amongst differing number of threads. As before, increasing the associativity increased the performance at all times, however, increasing the block size decreased performance if more than 2 threads were in use, so much so that the increase in associativity could not make up for the deficit caused by the greater miss penalty of the larger block size.
>> Jackson Technology And SMT