Click here to print this article.

Re-Printed From SLCentral

Fundamentals Of Multithreading
Author: Paul Mazzucco
Date Posted: June 15th, 2001
URL: http://www.slcentral.com/articles/01/6/multithreading

Introduction

NOTE: This article intentionally ignores the discussion of a NUMA architecture. It is aimed at smaller scale multi-processing, which is used as a lead into the various forms of on-chip multithreading.

I remember, quite distinctly, my first encounter with someone with a multi-processor system as his home machine. "Bob" (fake name to hide his identity) claimed that he owned a 466Mhz computer, when at the time, the top-of-the-line Intel machine was a Pentium II 450. I told him that no, he didn't have a 466Mhz machine, and because "Bob" didn't know about Alpha processors, and had never heard of "overclocking" before, that could not have been it. Then it hit me: he had a dual Pentium II 233Mhz rig. After asking him if this was the case, he said that it was, and that two 233Mhz chips are comparable to a 466Mhz machine.

What "Bob" didn't realize was that programs have to be written in a special way in order to take advantage of multiple processors (called multithreading). Nor did he realize that they needed to be using an OS which supports multiprocessors, such as Windows NT or *nixs (any sort of Unix variant). MacOSs below OSX had a crude form of multiprocessing support (OSX is based off of BSD, a *nix clone, and has full fledged multiprocessing support). Generally speaking, the computer needs OS support as well as application support to take advantage of multiprocessing.

The other thing "Bob" didn't realize is that for the vast majority of current consumer based applications, the multithreading support, if existent, does not often expedite the completion of a task proportionally with the number of processors in the system. This fact does not preclude an application from experiencing linear scaling with respect to the number of processors - nay, some applications can experience superlinear scaling, though these are few and far between.

Amdahl's Law

Assuming that an application is multithreaded (programs written to execute in a parallel manner, rather than a serial or purely sequential one), there are inherent difficulties in making a program run faster proportional to the number of processors: the program needs to be written in a parallel fashion, and the program itself must be resource friendly.

This is explained by Amdahl's Law: "…the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used," (Hennessey and Patterson, 29). This law applies to more than just changing sequential code to parallel code.

Assume that a program consists of two main parts, A and B, and that each can be optimized. Part A represents 90% of the execution time, and B represents the remaining 10%. Assume that B can be optimized in such a fashion so as to be able to finish in one tenth the time of the original version, and that A can be optimized so as to complete its part of the program in 2/3 the time it previously needed. If both parts A and B take the same amount of time to be optimized, and the programmer has time only to optimize one part, which should the programmer work on? Obviously, he/she should work on part A.

Just to walk through the thinking involved, assume that this program requires 100 seconds to complete. Part A consumes 90 seconds of execution time, and B requires 10 seconds of execution time. After optimization part A would take 60 seconds, and B a mere 1 seconds. The choice is between a total of 70 seconds of execution time if A is optimized, and 91 seconds if B is optimized. Though this article is not intended to be about software engineering, if one looks at the payoffs for where to put effort into programming, the choice should be obvious.

So where do multiprocessor-based system fit in here? Multithreaded programs must be written to be as parallel as possible. The same discussion about speeding up parts of a program to increase performance applies to the discussion of multithreading. To create a linear speedup (a performance increase of X due to having X processors), generally all of the code must be written in a parallel manner (generally, and not explicitly always, due to other circumstances discussed later). The mathematical representation of Amdahl's law with respect to parallel computing follows:


Adapted from "Computer Architecture: a Quantitative Approach," second edition, page 645.

And even this formula doesn't tell the whole story. This assumes that the parallel code, per processor, is just as fast as the sequential code, which may or may not be the case. It also assumes that there is no overhead due to the latencies of travel. This, of course, is not the case. Few commercial consumer based applications support nearly perfectly parallel code. The only way to superlinear performance increases occur is for the cache afforded by each additional processor to dramatically drop main-memory bandwidth usage (which is not common).

The assumption about latencies is important, because it is twofold: one, the program will have the same latencies to all processors; and two, the latencies themselves are not, in fact, exacerbated by the presence of the other processors executing their code.

Latencies And Bandwidth

Note: Much of this section is adapted from "Computer Organization and Design: The Hardware/Software Interface."

High latencies can adversely affect the rate at which processors execute instructions, and thus, performance. Caches greatly alleviate this problem, and caches have, arguably, a larger role to play in multiprocessor systems than they do in uniprocessor systems. Back in the "Fundamentals of Cache" (http://www.systemlogic.net/articles/00/10/cache), there were two general forms of making caches (in terms of how they write to memory locations): write-back, and write-through. A quick excerpt from the aforementioned article is reproduced below:

What about when the CPU alters the information that it got from the cache? There are generally two options employed, write-through cache, and write-back cache. The first term means that the information is written both to the cache, and to the lower layers of the memory subsystem, which takes more time. The second term means that the information is written only to the cache, and the modified information is only written to a lower level when it is replacing it. Write-back is faster because it does not have to write to a lower cache, or to main memory, and thus is often the mode of choice for current processors. However, write-through is easier, and better for multiple CPU based systems because all CPU's see the same information. Consider a situation where the CPU's are using the same information for different tasks, but each has different values for that information. That's one reason why write-through is used, since it alleviates this problem by making sure the data seen by all processors remains the same. Write-through is also slower, so for systems that do not need to worry about other CPUs (read: uniprocessor systems), write-back is certainly better.

What "The Fundamentals of Cache" (http://www.systemlogic.net/articles/00/10/cache/page2.php) hinted at is something called cache-coherency protocols. This determines what information is in the highest shared-level of the memory hierarchy, and how it is done. The main idea is to make sure that all CPUs have the most up-to-date version of whatever information is being requested. The reason why this can be an issue is that one CPU could be working on the piece of data, and if the cache is write-back, then the version of the data in the highest common memory layer will not be the most current version until it has been updated to main memory, which could be awhile! Certainly one wouldn't want to do work on a data element twice, or worse yet, do inappropriate work on data! This is where these protocols come into place - they make sure that the data stays safe.

The two types of protocols that are most widely used in shared-bus architectures are write-invalidate and write-update. I'll begin with write-update, because it is very similar to the write-through protocol. When a cache-line (or alternatively, block) is updated in one level, the changes in the contents of that block are then broadcast over the shared, system bus to the other processors.

Each CPU takes note of which block is being updated, and if it has a copy of the block (which is outdated), it then rewrites it. Each CPU makes use of something called a snoop-port, a port dedicated to the task of determining whether its contents need to be brought up-to-date. This does, of course, mean that a lot of bandwidth is used solely for keeping the contents of each processor as recent as possible. Due to limited bandwidth, this can in fact slow down the rate at which processors compute. (The EV6 bus uses a dedicated 13bit bus for snooping, and is exempt from this main-memory bandwidth contention). As stated in the "Fundamentals of Cache", latency and bandwidth go hand in hand.

Assume the bus for a shared-bus system has a maximum bandwidth of 800Mb/second. Now suppose that there are 4 processors connected to this bus. Each CPU effectively has 200Mb/second of bandwidth, assuming each is running the same type of task. If each CPU could use up 300Mb/second of bandwidth, then bandwidth becomes the constraining factor in improving performance (if this doesn't make sense, think about using a 56k modem, and downloading four very large files at the same time from fast servers). Next, consider a program that not only uses 300Mb/second of bandwidth, but also needs to update the cache-lines frequently, and broadcasts the data frequently. In addition to the bandwidth used by the program itself, the contents of the blocks ( or cache-lines) are sent back out across the bus, and if the block size is large, broadcasting data will require enormous bandwidth. In cases such as these, the frequent updating done in the write-update scheme (between a processor's cache and main memory) exacerbates the bandwidth problem. This, in turn, means that each processor has less bandwidth available for actual computing tasks, and the rate of execution stagnates. A write-update scheme used between a L1 and L2 cache on a processor (assuming it is an inclusive design), though slower for the L1 cache, actually makes snooping easier, because it is assured that anything in the L2 cache is the most up-to-date information, and other processors only have to snoop the L2 cache, and not deal with the L1 cache.

In order to imagine a common reference with a program that is bandwidth intensive, think of Seti@Home. The older clients in particular (excluding version 3.0, which was small-cache friendly) were large enough that they did not fit inside the L2 caches of most processors (excluding 2Mb Xeons). Because of this, the vast majority of the program had to be constantly fetched from main memory. And, due to the shared-bus architecture, if there were two processors, they would both be impeding upon each other's resources (i.e., memory bandwidth). To give an example of a better way of sharing the load, I made a post to the AnandTech forums awhile back (Click here - I use the pseudonym "BurntKooshie" (long story, don't ask). It discusses two Distributed Computing projects, RC5 (www.distributed.net), and Seti@Home (http://setiathome.ssl.berkeley.edu/) The following is the post, edited for spelling and grammar:

You see, with dual motherboards (you are obviously using an Intel machine, because the dual Athlon chipsets aren't out yet), the CPU's share the bandwidth of the system. The more bandwidth, and low latencies that the program requires, the more of a diminishing returns you'll see by running two instances of S@H. RC5 causes nearly no bandwidth overhead, and so runs on multiprocessors nearly flawlessly, and with near X times as much work done in the same amount of time, where X is the number of processors.

S@H is (or at least, was), a different beast. With dual Intel motherboards, the systems share the bandwidth of the memory. Prior clients used a lot of bandwidth, and liked low latencies. When two (or more) instances of S@H were run, this caused the CPU's to fight for bandwidth. If you are familiar with economics (I can't believe how handy this course is coming in :D ), you can consider memory bandwidth, at least on an Intel platform, to be a rival good, whereby if one CPU is consuming it, it detracts from the other CPU's ability to consume it. This means that the output is nowhere near X times as much work where X is the number of processors in the system.

HOWEVER, the new clients are supposedly more bandwidth/latency friendly (though not nearly to the point of being like RC5, but that is intrinsic to S@H's nature), which mitigates, at least to some degree, the amount of bandwidth required. Also, CPU's with larger cache's (read: Xeons, UltraSparcs, PA-RISC chips, etc) tend to deal with less bandwidth better, as they can rely on their larger L2 cache's than on system bandwidth (to some extent).

SO, if you have a dual system, with my current understanding of S@H, if I were to participate in both contests, I would put one CPU on S@H, and one CPU on RC5. This will allow you to get full S@H power from one CPU, and one full RC5 power from the other. IMO, this makes the most sense when looking at efficiency, and wanting to participate in both contests using multiple machines.

What I mean by that is, take a person with 2 dual systems. To participate in both contests (within this limited example), there are two options: 1) To have one dual system run RC5 on both CPU's, and one to run S@H on both CPU's, or 2) To have each dual system run one instance of RC5, and one instance of S@H.

The latter leads to the same production of RC5 (because of its almost non-existent overhead of bandwidth and low latency memory), HOWEVER, it would at the same time give way to MORE S@H units being produced than if one were to employ the former option.

Many scientific applications (most Distributed Computing Projects fit into this category) behave in a similar fashion: they are too large to fit into many caches, make heavy use of main memory bandwidth. If processors use write-update from the L2 cache to main memory, they only make the situation worse.

So, how does one work around this? Take the previous write-update scheme, and instead of updating the contents of the cache-line, why not just tell all the other processors that the information contained within that block is out of date? That's exactly what the write-invalidate protocol does. While this protocol does require information to be broadcast over the system bus, write-invalidate proves to use memory bandwidth in shared-bus architectures more frugally than write-update. This occurs because the CPU only needs to broadcast which cache-line needs to be marked as invalid - the other CPUs don't try to use it, because it is out of date - instead of the contents of a whole cache-line. Write-invalidate is a protocol that is in the same spirit as write-back information transferal (save the bandwidth for what's really needed) - especially when there are many processors on a bus.

Rather than use write-invalidate, one could increase the bandwidth of the main bus, but it isn't a cheap option. One could either widen the bus or implement something like RDRAM with several channels. For every bit wide the bus is, another trace has to be put into the motherboard. Even when using RDRAM, the traces are more sensitive to their placement, and so designing motherboards for use with RDRAM is more difficult. Either method means additional costs.

However, there are locations in a computer system where additional wires for increased bandwidth are comparatively cheap - on the processor; due to their placement, latencies become comparatively low. Moore's law continues to push on, despite the nay-sayers for whom processes technology is always on the verge of reaching its endpoint. With more transistors at engineers' disposal, more and more functional units are thrown on, and larger caches are employed, but there are diminishing returns for the continuing use of this strategy. To quote an Intel engineer (taken from "The Register" Click here), "The low hanging fruit is all gone. Now we have to build scaffolds around the tree. We'll stand on our head and do strange things for a little more performance." Intel engineers aren't the only ones looking for innovative technologies to speed things up. More functional units aren't showing the same amount of performance scaling as they initially did (and how could they?), and memory latencies from the processors' perspective, are getting longer. New ways of extracting performance need to be found.

ILP Background

For a long time, the secret to more performance was to execute more instructions per cycle, otherwise known as Instruction Level Parallelism (ILP), or decreasing the latency of instructions. To execute more instructions each cycle, more functional units (integer, floating point, load/store units, etc) have to be added on. In order to more consistently execute more instructions, a processing paradigm called out-of-order processing (OOP) can be used, and it has in fact become mainstream (notable exceptions are the UltraSparc and IA-64).

This paradigm arose because many instructions are dependent upon the outcome of other instructions, which have already been sent into the processing pipeline. To help alleviate this problem, a larger number of instructions are stored so as to be ready to execute immediately. The purpose is to find more instructions, which are not dependent upon each other. This area of storage is called the reorder buffer. Reorder buffers have been growing: the Pentium III has a window of 40 instructions, the Athlon has 72 (the .18 micron incarnation of the Athlon is purported to have 78), and the new Pentium 4 contains a window size of no less than 126 instructions! The reason is simple: code that is spatially related tends also to be temporally related in terms of execution (this excludes arrays of complex structures and linked lists). The only problem is that these instructions also have a tendency to depend upon the outcome of prior instructions. With a CPU's ever increasing amount of code to plow at a time, the only current way to find more independent instructions has been to increase the size of the reorder buffer.

This has shown a rather impressive downturn in the rate of increased performance -- it's showing diminishing returns. It is now taking more and more transistors to show the same rate of performance increase. Instead of focusing intently upon uniprocessor ILP extraction, one can focus upon a coarser form of extracting performance - at the thread level, via multithreading, but without the system bus as a major constraint.

On-Chip Multiprocessing

In order to make a fair comparison between an on-chip multiprocessor (CMP) and a uniprocessor, one must compare similar architectural features - i.e., both the uniprocessor and a CMP chip should have a similar aggregate number of functional units, registers, and similar renaming registers. This is to say that for a die that has two CPU's on it, each individual CPU has half the registers of the single CPU die, but because there are two processors on a CMP chip, the total resources are the same. These processors can also have an on-die L2 cache, which would be shared. If the L1 caches were write-through, then the cache-coherency problem between the two processors would be solved.

The general concept behind using multiple cores on one die is to extract more performance by executing two threads at once. By doing so, the two chips together are able to keep a higher percentage of the aggregate number of functional units doing useful work at all times. An example is shown below.


Pictures adapted from Jack Lo's PhD dissertation[1], and Paul DeMone's "Simultaneous Multi-threat." [2] a) conventional superscalar CPU, b) a 2 CPU multiprocessor.

To explain the context-switch code, I defer to Mr. DeMone's explanation:

Each thread runs for a short interval that ends when the program experiences an exception like a page fault, calls an operating system function, or is interrupted by an interval timer. When a thread is interrupted, a short segment of OS code (shown in Figure 1A as gray instructions in issue slots) is run which performs a context switch and switches execution to a new thread. Multitasking provides the illusion of simultaneous execution of multiple threads but does nothing to enhance the overall computational capability of the processor. In fact, excessive context switching causes processor cycles, which could have been used running user code, to be wasted in the OS.

The more functional units a processor has, the lower the percentage of units doing useful work is at any given time. The on-chip multi-processor lowers the number of functional units per processor, and distributes separate tasks (or threads) to each processor. In this way, it is able to achieve a higher throughput on both tasks combined. The comparative uniprocessor would be able to get through one thread, or task, faster than a CMP chip could, because, although there are wasted functional units, there are also "bursts" of activity produced when the processor computes multiple pieces of data at the same time and uses all available functional units. The idea behind multi-processors is to keep the processor from experiencing such "bursty" activity, and instead using what it has more frequently, and therefore efficiently. The non-use of some of the functional units during a clock cycle is known as horizontal waste, which CMP tries to avoid.

Another advantage of using a CMP chip instead of a larger, more robust uniprocessor, is that there is less difficulty in designing a smaller, less complex chip. This is useful in a couple of ways: one, it allows the designers to spend less time on the chip (and thus time to market is shorter); and two, less complex, smaller processors tend to be able to execute at a higher frequency. In this way, a CMP chip in a multithreaded (or multiprogrammed) environment is able to execute faster due to more efficient use of available resources over the various threads, and because of the potential to increase the clock rate over that of a monolithic processor.

The MAJC architecture from Sun Microsystems makes use of CMP. It allows one to four processors to share the same die, and for each to run separate threads. Each processor is limited to 4 functional units (each of which are able execute both integer and floating point operations, making the MAJC architecture more flexible).

Another example of an on-chip multi-processor is the Power4 processor from IBM. This architecture does not make use of the philosophy of using smaller, easier to implement CPUs. Instead, it takes processors that, in and of themselves, could be considered full-fledged server chips. And yet, IBM has chosen to stick two onto each die, where it should have a die size of ~400mm^2 (smaller than the HP 8500, and the same amount of on-die cache) [3].

There are problems with CMP, however. The traditional CMP chip sacrifices single-thread performance in order to expedite the completion of two or more threads. In this way, a CMP chip is comparatively less flexible for general use, because if there is only one thread, an entire half of the allotted resources are idle, and completely useless (just as adding another processor in while using a singly threaded program is useless in a traditional SMP system). Another approach to making the CPU's functional units more efficient is called course-grained multithreading.

Course-Grained Multithreading

As with a fair comparison between traditional superscalar and a CMP processor, each processor must be allotted the same number of functional units, same cache and cache-line sizes, etc. Again, we use the handy graphics as a means of comparing two processing paradigms:


'a' represents the traditional superscalar, while 'b' represents a coarse-grained multithreading architecture.

While CMP shares the same physical die, and can share the L2 cache (either if it is on-die, or off), and executes two (or more, depending upon the number of processors on the die) threads at the same time, coarse-grained multithreading (CMT) architectures do not. CMT improves the efficiency with respect to the usage of the functional units by executing one thread for a certain number of clock cycles. The efficiency is improved due to a decrease in vertical waste. Vertical waste describes situations in which none of the functional units are working due to one thread stalling.

When switching to another thread, the processor saves the state of that thread (i.e., it saves where instructions are in the pipeline, which units are being used) and switches to another one. It does so by using multiple register sets.[4] The advantage of this is due to the fact that often, a thread can only go for so long before it falls upon a cache miss, or runs out of independent instructions to execute. A CMT processor can only execute as many different threads in this way as it has support for. So, it can only store as many threads as there are physical locations for each of these threads to store the state of their execution. An n-way CTM processor would therefore have the ability to store the state of n threads.

A variation on this is to simply execute one thread until it has experienced a cache miss (usually a L2 cache miss), at which point it will switch to another thread. This has the advantage of simplifying the logic needed to rotate the threads through a processor, as it will simply switch to another one as soon as the prior thread is stalled. The penalty of waiting for the requested block to be transferred back into the cache is then alleviated. This is similar to the hit under miss (or hit under multiple miss) [5] caching scheme used by many processors, but it differs because it operates on threads instead of upon instructions. The MAJC architecture made use of CMP, and it also uses a form of CTM, where it switches threads on a cache miss, with support for 4 threads in this manner. [6] The MAJC architecture also has a few more tricks up its sleeve for multithreading, which will be discussed later. The APRIL architecture, circa 1990, also was to use CMT.[4]

The advantages of CMT over CMP are: CMT doesn't sacrifice single-thread performance, and there is less hardware duplication (less hardware that is halved to make the two processors "equal" to a comparable CMT).

Fine-Grained Multithreading

Fine-Grained Multithreading

A more aggressive approach to multithreading is called fine-grained multithreading (FMT). Like CMT, the basis of FMT is to switch rapidly between threads. Unlike CMT, however, the idea is to switch each and every cycle. While both CMT and FMT actually do indeed slow down the completion of one thread, FMT expedites the completion of all the threads being worked on, and it is overall throughput which generally matters most.

Suppose there is an n-way FMT processor. Then, every nth cycle, instructions from thread number one are executed, every nth + 1 cycle, instructions are executed from thread number two, etc. What this accomplishes is to hide very long latencies. A graphical representation is shown below:


'a' is a 4-way fine-grained multithreading processor, and 'b' is a traditional superscalar.

Tera architecture is an example of FMT architecture. As stated before, the idea is to hide long latencies and to try to eliminate vertical waste. Super computers are infamous for taking up a great deal of space and being attached via high-speed networks to operate in tandem. The Tera Super Computer was no different - though the architecture is now found in the Cray MTA (Multi-Threaded Arhictecture). The expected average latency for an instruction was 70 cycles, (which included the fraction of the time that it might need the data from over the network). In order to hide massive latencies such as these, most architectures use caches, which reduce the number of times each processor would access main memory. However, the Terra architecture is completely cacheless! The Cray MTA has L1 and L2 instruction caches, but lacks any data caches. To combat large latencies, each processor is capable of storing the state of no less than 128 separate threads. Like any other traditional FMT, if there are fewer than the maximum supported threads, then the processor will simply cycle through those which are available.[7]

In order for the Cray MTA architecture to achieve and maintain peak performance when using FMT, there must be at least as many threads as the average cyclic latency is for each word request. In the case of the MTA architecture, this constitutes ~70 threads! While certainly not commonplace in traditional applications, in the super-computing arena it is not unthinkable to come up with that many threads from which to execute.

However, the Cray MTA architecture is more a hybrid architecture -- it will act as a CMT processor when necessary. Each instruction carries with it a 3-bit tag that tells the processor how many more instructions can be expected from that thread before a non independent instruction is found, at which point it will execute until that instruction is reached. At that point, it will switch to a different thread at the next clock cycle. For example, to cover the 70 cycle latency, nine threads would require a stream of seven instructions.[7] With both features, the Cray MTA architecture shows how both CMT and FMT can work to hide very long latencies, and do so entirely without the use of caches.

Even if there are enough threads to make an FMT processor stall-free, research has shown that on an 8-issue processor, at best only 40% of the functional units are actually used (or, ~3.2 instructions per clock cycle).[8]

Simultaneous Multithreading

On-chip multi-processors can remove some horizontal waste. Coarse and fine-grained multithreading can remove some (or all) vertical waste. Yet, there has to be a way to have cake and eat it too. Such is the philosophy of a yet more advanced form of multithreading, called Simultaneous Multithreading (SMT).

The high concept for SMT is to have the ability to run instructions from different threads, at any given time, in any given functional unit. By rotating through threads, an SMT chip acts like a FMT processor, and by executing instructions from different threads at the same time, it acts like a CMP processor. Because of this, it allows architects to design wider cores without the worry of diminishing returns.


'a' is a traditional super scalar, while 'b' represents a Simultaneous Multithreading architecture.

Figure b) shows an example of a 4-issue, 4-way SMT processor. The figure shows it as nearly "fully utilized" with 4 threads (which is not to say that it will happen that way in real life). However, it is realistic for SMT to achieve higher efficiency than FMT due to its ability to share "unused" functional units amongst differing threads. So in this way, SMT shows the efficiency of a CMP machine. Unlike a CMP machine, an SMT machine makes little to no sacrifice (the small sacrifice is discussed later) for single threaded performance.

The reason for this is simple: whereas much of a CMP processor remains idle when running a single thread -- the more processors on the die, the more this is pronounced -- an SMT processor can dedicate all functional units to that thread. While obviously not as nice as being able to run multiple threads, the ability to balance between single thread and multithreaded environments is a wonderful feature. This means that an SMT processor can exploit thread-level parallelism (TLP) if it is present, and if not, will give full attention to instruction level parallelism (ILP).

SMT Induced Changes

In order to support multiple threads, an SMT processor requires more registers than the traditional superscalar processor. The general aim is to provide as many registers for each supported thread as there would be for one processor. For a traditional RISC chip, this implies 32 * n registers (n is the number of threads an SMT processor could handle in one cycle), plus whatever renaming registers are required. For a 4-way SMT processor RISC processor, this would mean 128 registers, plus however many renaming registers are needed.

Most SMT models are straightforward extensions of a conventional out-of-order processor. With an increase in the real-world throughput, comes more strain upon instruction issue width, which is increased accordingly. Because of the aforementioned increase in the register file size, a SMT pipeline should be increased by 2 stages so as not to slow down the length of the clock cycle. The register read and register write stages are both broken up into two pipelined stages. Astute readers will note that additional stages would normally have a negative impact upon performance. This is where the small sacrifice in single threaded performance comes in: studies have shown a modest performance decrease of ~2%.[9]

So as not to allow any one thread to dominate the pipeline, an effort must be made to ensure that the other threads get a realistic slice of the execution time and resources. When the functional units are requesting work to do, the fetch mechanism will put a higher priority to those threads that have the fewest instructions already in the pipeline. Of course, if the other threads have little they can do, more instructions from the thread are already dominating the pipelines.[9]

Intel (in the high-end at least) is going with predication, speculation, and other ways of dealing with integer performance (though these are not terribly impressive, the floating point performance has proven to be top notch); IBM is using the standard beefy out-of-order processing paradigm, using CMP, for some truly monolithic processors, and massively parallel computing; Sun is using CMP, CMT, and a few other tricks I've not discussed with MAJC; and lastly, API is moving forward with an 8-wide, 4-way SMT processor.

Concerns About SMT

SMT is about sharing whatever possible. However, in some instances, this disrupts the traditional organization of data, as well as instruction flow. The branch prediction unit becomes less effective when shared, because it has to keep track of more threads, with more instructions, and will therefore be less efficient at giving an accurate prediction. This means that the pipeline will need to be flushed more often due to mispredicts, but the ability to run multiple threads more than makes up for this deficit.

The penalty for a mispredict is greater due to the longer pipeline used by an SMT architecture (by 2 stages), which is in turn due to the rather large register file required. However, there has been research into minimizing the number of registers needed per thread in an SMT architecture. This is done by more efficient OS and hardware support for better deallocation of registers, and the ability to share registers from another thread context if another thread is not using all of them [10]. Alternatively, one could achieve better performance with the same number of registers per thread.

Another issue is the number of threads in relation to the size of caches, the line-sizes of caches, and the bandwidth afforded by them. Few studies have gone into detail with this issue, yet it remains a fundamentally important topic. As is the case for single-threaded programs, increasing the cache-line size decreases the miss rate but also increases the miss penalty. Having support for more threads, which use more differing data, exacerbates this problem, and thus less of the cache is effectively useful for each thread. This contention for the cache is even more pronounced when dealing with a multiprogrammed workload over a multithreaded workload. Thus, naturally, the more threads are in use, the larger caches should be made [11]. The same applies to CMP processors with shared L2 caches. One study has shown that going above 4 threaded support on an SMT system caused a slowdown, due to the limited amount of bandwidth available from the L2 cache.[12]

The more threads that are in use, the higher the overall performance, and the differences in associativity become more readily apparent. Keeping the L1 cache size at a constant 64Kb, a study in France [11] showed that the highest level of performance is achieved when using a more associative cache, despite longer access times. To emphasize the miss rate, a small 16Kb L1 cache was used to determine the varying performances of differing block sizes, with different associatively, amongst differing number of threads. As before, increasing the associativity increased the performance at all times, however, increasing the block size decreased performance if more than 2 threads were in use, so much so that the increase in associativity could not make up for the deficit caused by the greater miss penalty of the larger block size.

Jackson Technology And SMT

The much discussed Jackson Technology, which has not debuted, is rumored to be SMT. No one (outside of Intel, and NDA holders) knows whether or not Jackson Technology is indeed SMT, but there are reasons to back it up.

First, as an x86 CPU, Pentium 4 would only require 8 registers, plus the renaming registers, to be duplicated. This means that far fewer aggregate registers are needed for the same number of threads for an x86 SMT chip than on a traditional RISC chip. Due to the already-long pipeline of the Pentium 4 (20 stages), the reduced number of registers required would also mean that, depending upon the number of threads supported by the processor, it might be possible for the Pentium 4 architecture to remain "as is" without adding the two additional stages to the pipeline. This would bode well, as branch mispredicts already hurt the Pentium 4's performance.

Despite the 100Mhz quad-pumped bus, the ratio is still radically out of place (though closer than it has been since the days of the Pentium 2 450's) and, despite the use of dual channel RDRAM, RDRAM still has high latencies. The use of SMT would allow the Pentium 4 to deal even better than it does now with long latencies, which is a must considering the soaring clock speeds.

Next, of course, there are the hints from the Linux community. And the question from Anand Lal Shimpi, and the subsequent responses (Click here) :

A very big hint was when I asked a lot of the software developers on the floor what they thought of SMT and they immediately respondend with "Jackson technology?"

SMT shows radically higher performance than a similarly equipped CMP processor (over single threaded applications), a much better price/performance ratio (imagine two P4's on one die - even if they share a L2 cache), and better performance at any rate. It seems that my reasons [13] are outweighed by the evidence for a more aggressive and elegant form of multithreading found in SMT.

This is all fine and well, but to some degree it does require software support: the programs either need to be multithreaded, or running multiple programs at the same time and expediting their aggregate execution time. Judging by the typical PC program, which tends only to use one thread of a program at a time, most programs would not receive the intrinsic advantages of running on an SMT architecture over a superscalar.

Applications Of Multithreading: Dynamic Multithreading

While programmers can write code to be multithreaded, it is time consuming to do so. Considering that time to market is rather important, many programs may forgo the whole multithreading phase (multithreaded in the sense that they are CPU intensive, and would benefit from the addition of another logical processor).

Yet, not all CPU intensive tasks are multithreaded. For these, the ability for the hardware to be responsible for creating threads instead of the programmer would be a great boon in performance. In fact, such processing paradigms do exist, and some don't even require compiler support! (so much for the RISC approach of simplifying everything on the hardware side…). One approach to this is found in the Dynamic Multithreading Architecture (DMT), inspired by Haitham Akkary, now at Intel Corp.

DMT makes use of a traditional SMT pipeline and adds onto it. Increasing the size of the classic reorder buffers and register files (beyond that of a traditional SMT processor) does not make sense, because for it to be effective, the temporal locality of the instructions must be fairly close. Rather than increase them to disproportionate sizes and massively increase latencies, another level outside the pipeline called Trace Buffers are included for every thread that is supported. The optimal size for the Trace Buffers is 200 instructions per thread, where 300 resulted in a relatively minor boost in IPC over 200.[14]

One way (among four) in which Dynamic Multithreading will break a sequential program into multiple threads is to search through a program for a loop, and when found, to go beyond the loop looking for an additional thread. If there is sufficient work to do that is beyond the loop boundary that is not dependent upon the work done in the loop, it will create another thread, and speculatively execute this one. Generally, the idea is to look ahead through the program, and run as many portions of it as possible by speculatively creating new ones.

The last little trick that the MAJC architecture reveals is the same general idea as the above form of spawning new threads from a single thread. They've chosen to call it "Space Time Computing," [6] but the effect is the same - it spawns a new thread from an older one. The difference is that, because MAJC is not based on an SMT architecture (rather a hybrid between CMT and CMP), the newly created thread will instead be executed on another processor on the die.

What about Jackson Technology? Could it too be a form of Dynamic Multithreading? By using a Trace Cache, the Pentium 4 architecture, in a sense, makes quasi-threads where they are simply the path of execution the last time they were run. If different areas of the trace cache could be scanned for "threads," then a DMT processor might make use of the trace cache for the formation of threads.

Akkaray's thesis didn't come out until 1998, and who knows if the P4 was so far along in its design that they couldn't reorganize it so as to include DMT. SMT, on the other hand, was out around about 1995, perhaps earlier (my earliest SMT related source is 1995), which is just after the introduction of the Pentium Pro - the P6 core still found in the Pentium III.

On the other hand, as Jackson Technology has still not appeared, perhaps Intel incorporated DMT in an unfinished state and disabled it so that they could finish it for later revisions. Either way, it seems that the timing works out in favor of SMT, or perhaps even DMT over CMP, which would be extremely expensive to produce (though not beyond Intel's abilities).

Despite the fact that DMT takes a base SMT processor, which is already lengthened by 2 pipeline stages (pipelined register read, and register write), the possibility is still open that an additional stage might have to be added so as not to significantly impact cycle time. However, even if this is the case, the additional stage showed only ~5% performance loss over a DMT architecture that lacked the additional stage.[15]

Overall, DMT was shown to increase performance of SPECInt 95 programs by 15% without changing the number of fetch ports or functional units, and by 30% with one additional fetch port. DMT, like SMT, shows more potential for speeding up integer applications than floating point applications. This is because integer programs tend to have more branches, and thus more times when having multiple threads is beneficial in hiding long latencies.

The DMT architecture described in Akkary's thesis is a form of speculative multithreading that operates on a single threaded program. It reaches far into a program, and achieves higher performance by running later parts of a program on a base SMT pipeline. More recent research has shown that running multiple programs (or preexisting threads from a multithreaded program) using a traditional SMT approach with the additional support of a DMT architecture (called Dynamic Simultaneous Multithreading, or DSMT) improves performance over a completely SMT processor by 5-15% depending upon the amount and type of applications. This works by spawning new threads via DMT protocols when there are fewer threads than the processor has support for. [12]

Applications Of Multithreading: Redundancy Is Faster?

To avoid the fact that IA-64 can't execute instructions out of order, one of the features Intel chose to use was Predication. The idea is to do the work twice, each for one of two possible outcomes from a branch (an if/else statement - for more information, see http://www.systemlogic.net/articles/00/9/ia64). This is actually useful, because due to the fact that IA-64 calls for in-order processing, most functional units would otherwise be idle.

For some branches, there is no good way to predict which path to take. An extension of SMT, called Threaded Multi-Path, does for hard-to-predict threads what predication does for instructions: instead of guessing, do both at once, and discard the unused result.[15]

Another processing concept was originally known as "Cooperative Redundant Threads" and is now called Slipstream processing. A Slipstream processor actually does the work of the whole program twice! And yet, it ends up running faster. The name stems from NASCAR, of all places (for the reasoning, go to http://www.tinker.ncsu.edu/ericro/slipstream).

Slipstreaming works by using two threads that start out exactly the same - the A-stream (advance stream), and the R-stream (redundant stream). What happens is that the R-stream remains as an unmodified thread of the original program, the hardware works to remove instructions that don't have any apparent effect, and the A-stream is then stripped accordingly.

As the shortened A-stream runs slightly ahead of the R-stream via a delay buffer, the R-stream is able to get information about how the program will execute, even before it executes! This, in a sense, is a real-time version of the schemes used by Intel with IA-64 and feedback-driven compiling (where the program is compiled, run, profiled, given to the compiler once more with information about the program, and then runs faster with the new executable).

The A-stream, now shortened, tells the R-stream (unmodified) which branch it should take. The A-stream runs faster because it is a smaller executable; current techniques have shown a decrease in instruction count by up to 50% on average![16]. The R-stream runs faster due to having many branches resolved as it needs the answers, and thus even more rarely needs to use branch prediction etc.

The original means by which one could create a slipstream processor was to have a 2-way CMP chip with two simple CPUs, each with half the execution resources of a more robust processor that one would normally design. With this approach, a speedup of ~12% was achieved for the Slipstream CMP processor of the larger, traditional super-scalar core[16] (though some programs did run substantially slower than the superscalar). Thus there are ways of using the second CPU in a CMP processor even without having additional threads to run. Additional performance increases are possible because the two smaller CPUs are able to run faster due to less complexity in each.

Another approach is to use a base SMT architecture, which is an extension of the large superscalar, and to run the A-thread and R-thread on that. The key here is that the SMT would normally act as a regular superscalar without having additional threads, and attempts to speedup a single thread via entirely different means than DMT. An interesting comparison would be between the performance (and design issues) of DMT and Slipstreaming on an SMT processor (as both are base SMT processors), and between a DMT and a similarly equipped CMP Slipstreaming processor.

Summary Of The Forms Of Multithreading And Conclusion

Finally, a recap of all the forms of multithreading.


Here, a) is a traditional superscalar, b) a 2-way CMP (On-chip multiprocessor), c) a 4-way CMT (Coarse-grained Multithreading), d) a 4-way FMT (Fine-grained Multithreading), and e) a 4-way SMT (Simultaneous Multithreading). CMP uses two (or more) smaller cores to increase functional unit efficiency (removing horizontal waste), and has shown itself in new processors such as the Sun MAJC architecture, and the IBM POWER4; and the SledgeHammer from AMD should too. CMT and FMT both use the ability to switch rapidly between threads to hide memory latencies, and decrease vertical waste. CMT can be found in the MAJC architecture, and FMT in the Terra Supercomputing architecture. SMT operates by running any thread, in any functional unit, on any clock, thus removing both horizontal, and vertical waste, and will be found in the Alpha 21464, and possibly in a later incarnation of the P4 architecture. DMT and Slipstreaming processors both aim at increasing single-thread performance.

Though the many varied forms of multithreading take very different approaches, the goal is the same: higher real-world throughput. All of these techniques allow additional functional units to be added to a processor, and show something more akin to return to scales than to diminishing returns. While previous processors tended not to go far beyond 4 functional units per processor, due to diminishing returns, there are now techniques available which allow more units to be added with increased efficiency. We shall see some strange architectures in the future….

Bibliography

[1] Lo, Jack Lee-jay. "Exploiting Thread-Level Parallelism on Simultaneous Multithreaded Processors," PhD dissertation, University of Washington, 1998. http://www.cs.washington.edu/homes/jlo/papers/dissertation.ps

[2] DeMone, Paul. "Alpha EV8 (Part2): Simultaneous Multi-Threat. http://www.realworldtech.com/page.cfm?ArticleID=RWT122600000000. Dec 26, 2000.

[3] Diefendorff, Keith. "Power4 Focuses on Memory Bandwidth - IBM Confronts IA-64, Says ISA Not Important," Microprocessor Report, Oct. 6, 1999.

[4] Agarwal, Anat et al. "APRIL: A Processor Architecture for Multiprocessing"

[5] Mazzucco, Paul. "The Fundamentals of Cache." http://www.systemlogic.net/articles/00/10/cache Oct, 17, 2000. http://citeseer.nj.nec.com/rd/59311591%2C256536%2C1%2C0.25%2CDownload/http%253A%252F%252Fciteseer.nj.nec.com/cache/papers2/cs/10874/http%253AzSzzSzwww.cs.berkeley.eduzSz%257EkubitronzSzpaperszSzalewifezSzpdfzSzisca-april.pdf/agarwal90april.pdf

[6] "MAJC Architecture tutorial," Sun Microsystems, White Paper. http://www.sun.com/microelectronics/MAJC/documentation/docs/majctutorial.pdf

[7] Alverson, et al. "The Tera Computer System." Tera Computer Company. http://www.cag.lcs.mit.edu/6.893-f2000/readings.html

[8] Tullsen, Dean M. et al "Simultaneous Multithreading: Maximizing On-Chip Parallellism." Department of Computer Science and Engineering, University of Washington. http://www.cs.washington.edu/research/smt/papers/ISCA95.ps

[9] Jack L. Lo, et al "Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading" http://www.cs.washington.edu/research/smt/papers/tlp2ilp.final.pdf

[10] Lo, Jack L. et al. "Software-Directed Register Deallocation for Simultaneous Multithreaded Processors." Dept. of Computer Science and Engineering University of Washington. http://www.cs.washington.edu/research/smt/papers/regDealloc.ps

[11] Hily, Sebastian et al. "Contention on 2nd Level Cache May Limit the Effectiveness of Simultaneous Multithreading." Campus Universitaire De Beaulie. ftp://ftp.irisa.fr/techreports/1997/PI-1086.ps.gz

[12] Akkary, Haitham et al. "The Case for Speculative Multithreading on SMT Processors." Intel Microprocessor Research Labs. http://link.springer-ny.com/link/service/series/0558/papers/1940/19400059.pdf

[13] Mazzucco, Paul. "Why Intel is TwoFaced." http://www.systemlogic.net/articles/01/1/intel/page7.php January 9th, 2000.

[14] Akkary, Haitham et al. "A Dynamic Multithreading Processor." http://iacoma.cs.uiuc.edu/CS497/SP7.pdf

[15] Akkary, Haitham. "A Dynamic Multithreading Processor," Ph.D. Dissertation, Department of Electrical and Computer Engineering, Portland State University, Technical Report PSU-ECE-199811, June 1998. http://www.ee.pdx.edu/~driscoll/techreports/PSU-ECE-199811.pdf

[16] Purser, Zach et al. "A Study of Slipstream Processors." North Carolina State University. Department of Electrical and Computer Engineering. http://www.tinker.ncsu.edu/ericro/slipstream/papers/slipstream_micro33.final.pdf

More resources on SMT can be found at http://www.cs.washington.edu/research/smt.

Re-Printed From SLCentral