SLCentral - Your logical choice for computing and technology
  • Home
  • Search
  • Forums
  • Hardware
  • Games
  • Tech News
  • Deals
  • Prices
  • A Guru's World
  • CPU/Memory Watch
  • Site Info
  • Latest News
    Corsair TX750W Power Supply Unit Review
    Businesses For Sale
    Shure E530PTH Earphones Review
    Guide to HDTVs
    Cheap Web Hosting
    >> Read More
    Latest Reviews
    Corsair TX750W Power Supply Unit - 4-/-0/2008
    Shure E530PTH Earphones - 9-/-0/2007
    Suunto T6 Wrist Top Computer - 1-/-0/2007
    Suunto X9i Wristwatch - 9-/-0/2006
    Shure E3g Earphones - 5-/-0/2006
    >> Read More
    SL Newsletter
    Recieve bi-weekly updates on news, new articles, and more

    SLCentralArticlesTech Explanations Jan 21st, 2018 - 10:42 PM EST
    Fundamentals Of Multithreading
    Author: Paul Mazzucco
    Date Posted: June 15th, 2001

    Fine-Grained Multithreading

    A more aggressive approach to multithreading is called fine-grained multithreading (FMT). Like CMT, the basis of FMT is to switch rapidly between threads. Unlike CMT, however, the idea is to switch each and every cycle. While both CMT and FMT actually do indeed slow down the completion of one thread, FMT expedites the completion of all the threads being worked on, and it is overall throughput which generally matters most.

    Suppose there is an n-way FMT processor. Then, every nth cycle, instructions from thread number one are executed, every nth + 1 cycle, instructions are executed from thread number two, etc. What this accomplishes is to hide very long latencies. A graphical representation is shown below:

    'a' is a 4-way fine-grained multithreading processor, and 'b' is a traditional superscalar.

    Tera architecture is an example of FMT architecture. As stated before, the idea is to hide long latencies and to try to eliminate vertical waste. Super computers are infamous for taking up a great deal of space and being attached via high-speed networks to operate in tandem. The Tera Super Computer was no different - though the architecture is now found in the Cray MTA (Multi-Threaded Arhictecture). The expected average latency for an instruction was 70 cycles, (which included the fraction of the time that it might need the data from over the network). In order to hide massive latencies such as these, most architectures use caches, which reduce the number of times each processor would access main memory. However, the Terra architecture is completely cacheless! The Cray MTA has L1 and L2 instruction caches, but lacks any data caches. To combat large latencies, each processor is capable of storing the state of no less than 128 separate threads. Like any other traditional FMT, if there are fewer than the maximum supported threads, then the processor will simply cycle through those which are available.[7]

    In order for the Cray MTA architecture to achieve and maintain peak performance when using FMT, there must be at least as many threads as the average cyclic latency is for each word request. In the case of the MTA architecture, this constitutes ~70 threads! While certainly not commonplace in traditional applications, in the super-computing arena it is not unthinkable to come up with that many threads from which to execute.

    However, the Cray MTA architecture is more a hybrid architecture -- it will act as a CMT processor when necessary. Each instruction carries with it a 3-bit tag that tells the processor how many more instructions can be expected from that thread before a non independent instruction is found, at which point it will execute until that instruction is reached. At that point, it will switch to a different thread at the next clock cycle. For example, to cover the 70 cycle latency, nine threads would require a stream of seven instructions.[7] With both features, the Cray MTA architecture shows how both CMT and FMT can work to hide very long latencies, and do so entirely without the use of caches.

    Even if there are enough threads to make an FMT processor stall-free, research has shown that on an 8-issue processor, at best only 40% of the functional units are actually used (or, ~3.2 instructions per clock cycle).[8]

    >> Simultaneous Multithreading

    Did you like this article?

    Article Navigation
    1. Introduction/Amdahl's Law
    2. Latencies And Bandwidth
    3. Latencies And Bandwidth Cont.
    4. ILP Background
    5. On-Chip Multiprocessing
    6. Course-Grained Multithreading
    7. Fine-Grained Multithreading
    8. Simultaneous Multithreading
    9. SMT Induced Changes/Concerns About SMT
    10. Jackson Technology And SMT
    11. Applications Of Multithreading: Dynamic Multithreading
    12. Applications Of Multithreading: Redundancy Is Faster?
    13. Summary Of The Forms Of Multithreading And Conclusion
    14. Bibliography
    Article Options
    1. Discuss This Article
    2. Print This Article
    3. E-Mail This Article
    Browse the various sections of the site
    Reviews, Articles, News, All Reviews...
    Reviews, Articles, News...
    Regular Sections
    A Guru's World, CPU/Memory Watch, SLDeals...
    Forums, Register(Free), Todays Discussions...
    Site Info
    Search, About Us, Advertise...
    Copyright 1998-2007 SLCentral. All Rights Reserved. Legal | Advertising | Site Info