Applications Of Multithreading: Dynamic Multithreading
While programmers can write code to be multithreaded, it is time consuming to do so. Considering that time to market is rather important, many programs may forgo the whole multithreading phase (multithreaded in the sense that they are CPU intensive, and would benefit from the addition of another logical processor).
Yet, not all CPU intensive tasks are multithreaded. For these, the ability for the hardware to be responsible for creating threads instead of the programmer would be a great boon in performance. In fact, such processing paradigms do exist, and some don't even require compiler support! (so much for the RISC approach of simplifying everything on the hardware side…). One approach to this is found in the Dynamic Multithreading Architecture (DMT), inspired by Haitham Akkary, now at Intel Corp.
DMT makes use of a traditional SMT pipeline and adds onto it. Increasing the size of the classic reorder buffers and register files (beyond that of a traditional SMT processor) does not make sense, because for it to be effective, the temporal locality of the instructions must be fairly close. Rather than increase them to disproportionate sizes and massively increase latencies, another level outside the pipeline called Trace Buffers are included for every thread that is supported. The optimal size for the Trace Buffers is 200 instructions per thread, where 300 resulted in a relatively minor boost in IPC over 200.
One way (among four) in which Dynamic Multithreading will break a sequential program into multiple threads is to search through a program for a loop, and when found, to go beyond the loop looking for an additional thread. If there is sufficient work to do that is beyond the loop boundary that is not dependent upon the work done in the loop, it will create another thread, and speculatively execute this one. Generally, the idea is to look ahead through the program, and run as many portions of it as possible by speculatively creating new ones.
The last little trick that the MAJC architecture reveals is the same general idea as the above form of spawning new threads from a single thread. They've chosen to call it "Space Time Computing,"  but the effect is the same - it spawns a new thread from an older one. The difference is that, because MAJC is not based on an SMT architecture (rather a hybrid between CMT and CMP), the newly created thread will instead be executed on another processor on the die.
What about Jackson Technology? Could it too be a form of Dynamic Multithreading? By using a Trace Cache, the Pentium 4 architecture, in a sense, makes quasi-threads where they are simply the path of execution the last time they were run. If different areas of the trace cache could be scanned for "threads," then a DMT processor might make use of the trace cache for the formation of threads.
Akkaray's thesis didn't come out until 1998, and who knows if the P4 was so far along in its design that they couldn't reorganize it so as to include DMT. SMT, on the other hand, was out around about 1995, perhaps earlier (my earliest SMT related source is 1995), which is just after the introduction of the Pentium Pro - the P6 core still found in the Pentium III.
On the other hand, as Jackson Technology has still not appeared, perhaps Intel incorporated DMT in an unfinished state and disabled it so that they could finish it for later revisions. Either way, it seems that the timing works out in favor of SMT, or perhaps even DMT over CMP, which would be extremely expensive to produce (though not beyond Intel's abilities).
Despite the fact that DMT takes a base SMT processor, which is already lengthened by 2 pipeline stages (pipelined register read, and register write), the possibility is still open that an additional stage might have to be added so as not to significantly impact cycle time. However, even if this is the case, the additional stage showed only ~5% performance loss over a DMT architecture that lacked the additional stage.
Overall, DMT was shown to increase performance of SPECInt 95 programs by 15% without changing the number of fetch ports or functional units, and by 30% with one additional fetch port. DMT, like SMT, shows more potential for speeding up integer applications than floating point applications. This is because integer programs tend to have more branches, and thus more times when having multiple threads is beneficial in hiding long latencies.
The DMT architecture described in Akkary's thesis is a form of speculative multithreading that operates on a single threaded program. It reaches far into a program, and achieves higher performance by running later parts of a program on a base SMT pipeline. More recent research has shown that running multiple programs (or preexisting threads from a multithreaded program) using a traditional SMT approach with the additional support of a DMT architecture (called Dynamic Simultaneous Multithreading, or DSMT) improves performance over a completely SMT processor by 5-15% depending upon the amount and type of applications. This works by spawning new threads via DMT protocols when there are fewer threads than the processor has support for. 
>> Applications Of Multithreading: Redundancy Is Faster?