SLCentral - Your logical choice for computing and technology
  • Home
  • Search
  • Forums
  • Hardware
  • Games
  • Tech News
  • Deals
  • Prices
  • A Guru's World
  • CPU/Memory Watch
  • Site Info
  • Latest News
    Corsair TX750W Power Supply Unit Review
    Businesses For Sale
    Shure E530PTH Earphones Review
    Guide to HDTVs
    Cheap Web Hosting
    >> Read More
    Latest Reviews
    Corsair TX750W Power Supply Unit - 4-/-0/2008
    Shure E530PTH Earphones - 9-/-0/2007
    Suunto T6 Wrist Top Computer - 1-/-0/2007
    Suunto X9i Wristwatch - 9-/-0/2006
    Shure E3g Earphones - 5-/-0/2006
    >> Read More
    SL Newsletter
    Recieve bi-weekly updates on news, new articles, and more

    SLCentralArticlesTech Explanations Oct 19th, 2019 - 1:17 PM EST
    Intel Pentium 4: In-Depth Technical Overview
    Author: Paul Mazzucco
    Date Posted: August 3rd, 2001

    Hardware Prefetch

    There is yet another reason, beyond the line-size of the Pentium 4, why the platform requires such enormous bandwidth: the hardware prefetch unit.

    With the Pentium III, Intel introduced Software Prefetch instructions, which allows a programmer to load instructions into a cache even before it's needed. While this means that there will be less space available for other potentially needed instructions, if used wisely, the latencies to main memory can be masked. This can happen because the processor can sit busy working on whatever it's currently working on, and have something that will be needed at a later time loaded into the cache, that way it doesn't have to experience the painful delays of going to main memory.

    The hardware prefetch of the Pentium 4 extends this a bit further. One, in that because it is hardware based, it doesn't require any support on the part of the program. Also, as it is hardware based, there is no code dilution due to the fact that no instructions are needed! However, there is a downside to prefetching instructions.

    Prefetching, of any sort, uses up bandwidth, simply because it is loading instructions. When a program is bandwidth constrained, this can lead to performance decreases, because of contention for main-memory bandwidth. However, when paired up with a great deal of memory bandwidth, prefetching doesn't take any "needed" bandwidth away from the fetching of other instructions and data. In this way, prefetching can soak up "excess" bandwidth, and do something useful - load instructions into a cache before they are needed, thus increasing the cache's hit-rate, which in turn means that average memory accesses decrease. And, because this requires no effort on the part of the programmer, hardware prefetch allows existing programs to make use of the mammoth bandwidth afforded by a dual-channel PC800 system.

    Some Of The "Guts"

    Now that all the basics of how the Pentium 4 gets its data, and why it needs to be able to grab lots of it at a time, we'll slide over to an area most people would start with first - the execution resources afforded by the processor.

    In brief:

    • 2 "Double Pumped" ALU (Arithemetic Logical Units: Add, Subtract, logical AND, logical OR). One of the benefits of this is that the Pentium 4 is able to get the same performance out of half the area (it can execute 4 instructions per base CPU cycle, though it is constrained by only 3 uops issued per cycle).
    • 2 FPU units: one for FPU loads, and stores, the other for FPU adds and subtracts.
    • 126 entry Reorder buffer: This means that the processor has a window of 126 instructions in which to search for, and execute, non-data-dependant instructions. This helps to hide latencies.

    The Pentium 4 has fewer Integer units than the Athlon, and it has fewer Floating Point units as well. Moreover, the Pentium 4 no longer has FXCH (an instruction which shuffles data around in the archaic x86 FPU stack) for "free," which the Pentium III and Athlon do have. As software has been optimized for the Pentium III, and a little bit for the Athlon, this means that optimizations for prior processors will actually degrade performance on the Pentium 3.

    Also, the FMUL instruction is no longer pipelined, and many instructions have longer latencies for execution. This means that the Pentium 4 has fewer execution resources than the Athlon, it takes longer to complete the instructions that it can issue, and it "undoes" some of the optimizations that software vendors have been doing since the Pentium Pro days. On the other hand, the Pentium 4 should theoretically be able to deal streaming, and in particular, large data-sets better than the Pentium III and Athlon, due to the massive bandwidth that it has at all levels. Despite the view that the Pentium 4 is "crippled," it does have a way to make up for the lack of solid floating point performance - more SIMD instructions!

    >> iSSE2

    Article Options

    Post/View Comments   Post/View Comments
    Print this article   Print This Article
    E-mail this article   E-Mail This Article
    Article Navigation

    1. Introduction/Hyper Pipelined/Branch Prediction
    2. The P4's Caches
    3. Bandwidth And The Line-Sizes
    4. Hardware Prefetch/Some Of The "Guts"
    5. iSSE2
    6. Thermal Protection
    7. Conclusion
    8. Bibliography

    Did you like this interview?
    Browse the various sections of the site
    Reviews, Articles, News, All Reviews...
    Reviews, Articles, News...
    Regular Sections
    A Guru's World, CPU/Memory Watch, SLDeals...
    Forums, Register(Free), Todays Discussions...
    Site Info
    Search, About Us, Advertise...
    Copyright 1998-2007 SLCentral. All Rights Reserved. Legal | Advertising | Site Info