SLCentral - Your logical choice for computing and technology
  • Home
  • Search
  • Forums
  • Hardware
  • Games
  • Tech News
  • Deals
  • Prices
  • A Guru's World
  • CPU/Memory Watch
  • Site Info
  • Latest News
    Corsair TX750W Power Supply Unit Review
    Businesses For Sale
    Shure E530PTH Earphones Review
    Guide to HDTVs
    Cheap Web Hosting
    >> Read More
    Latest Reviews
    Corsair TX750W Power Supply Unit - 4-/-0/2008
    Shure E530PTH Earphones - 9-/-0/2007
    Suunto T6 Wrist Top Computer - 1-/-0/2007
    Suunto X9i Wristwatch - 9-/-0/2006
    Shure E3g Earphones - 5-/-0/2006
    >> Read More
    SL Newsletter
    Recieve bi-weekly updates on news, new articles, and more

    SLCentralArticlesTech Explanations Oct 19th, 2019 - 2:30 PM EST
    Intel Pentium 4: In-Depth Technical Overview
    Author: Paul Mazzucco
    Date Posted: August 3rd, 2001

    Bandwidth And The Line-Sizes

    A member of our forums had this to say with regards to the Pentium4 "I have to say that I think the whole idea of the P4 was based on the extreme memory bandwidth..." This statement is true with respect to a great number of the Pentium 4's design decisions. There is no desktop platform that can rival the Pentium 4 in terms of sheer bandwidth, and this is a very forward-looking design decision, as we'll explain.

    It's important to know the line-sizes that a CPU architecture uses. Whenever a processor searches for a bit of data, it works its way down the memory hierarchy. This means that if there is a miss in the L1 cache for an instruction or data element, it will then search the L2 cache. If it finds it in the L2 cache, it will grab not only that data element, but also a number of physically contiguous elements. The reason for this is simple: a concept called "spatial locality," which states that both data, and code that are physically close to each other often are needed at about the same time.

    Memory latencies are detrimental for maintaining peak-processing efficiency, thus it makes sense to try to hide memory latencies as much as possible. This is why CPUs and caches will fetch more than on data element or instruction at a time - it will then likely be able to use process the same data that it brought in advance to actually requesting it (note that this is still distinctly different from hardware prefetch). At the same time, complex data-structures (such as nested structures in C, and large objects in C++) can have very negative effects.

    Complex data-structures are stored contiguously just as simple arrays are, however, their usage tends to be quite distinct, and they behave differently. It is far more common to processes one element of an object or structure, especially in linked lists (where traversing the list means a pointer is accessed, some comparison done, and will likely move on), and then go to the next piece. Arrays on the other hand, are more likely to have its data elements accessed one right after another (especially in strings). Where the line-size fits in with all of this is that the more complex the data-structure, the less likely it is for all of the elements in the data-structure to be needed. Where the line-size comes into play, is that it grabs a bunch of extra data elements that have to be transferred, and in complex data structures, much of what is transferred is likely to do nothing more than waste space in precious caches.

    A graphical example is shown below:



    In this case we have a small section of memory, broken up into 32-bit sections (4 bytes, which is represented by one square unit). Regular integers on 32-bit machines occupy 32-bits of space. Red cells represent those elements that have their contents copied to the next (higher) layer of the memory hierarchy, and cells with the blue spot represent the data that is (or will shortly be) needed.

    Suppose we have a scenario with a processor that uses 32-byte lines (such as the Pentium III). Now suppose that the first element that needs to be processed is in the lower left hand corner of 'a'. Suppose that the program calls for the following 7 data elements to be needed right afterward as well (hence why they are marked with blue). In a case such as this, the cache-line of 32-bytes works perfectly, where all the data elements that need to be accessed are in an array (as an example). Then, it makes a similar grab at data, and uses it in the same way. In this case, fetching all 32-bytes at once paid off, as now it doesn't have to go back for more data again (which takes a lot of time from the CPUs perspective). Here, no bandwidth was wasted.

    Now we'll look at the opposite extreme. In this case, perhaps where a program is searching through a linked list, and grabs the pointer to a structure, compares to a value (hence the two data elements being "used"), and moves on because it hasn't found the necessary node of the list. In this case, the full 32-byte lines are still being dragged to the next higher layer of the memory hierarchy, even though only 8 bytes are being used (the pointer and the integer). The searching processes wasted a lot of bandwidth - of it, in fact. Only of the data brought in by a cache-line fill were actually used, thus much space and bandwidth were not used efficiently.

    We'll take a look at both of these types of scenario with the a larger line-size (128-bytes), ala Pentium 4.

    Perhaps first thing that comes to mind is just how much more red these graphs have! Lets walk through this as we did above. We take the same data in the same organization in memory, but change the line size. Here, the line-size is 128-bytes, which means two columns (rather than just half a column as in a 32-byte line). While this isn't truly representative how it is (because caches, as small as they are, have thousands, or even millions of "units" or cells), it does show the point.



    In the first case, where there are two arrays where all the data elements are being used, but with some space between them, a 128-byte line gets both arrays at the same time. This means that the higher levels don't have to experience a cache miss before moving to the data in the second array, while the 32-byte-line design would! This has the benefit of greatly decreasing average memory access latencies for contiguously used data! However, even still, half the bandwidth in the above case was wasted, as half of the data transferred wasn't used.

    However, the story changes when moving towards code that "jumps" around a lot, as is potentially the case with linked lists (because they are often used to dynamically increase the number of nodes, linked lists are often not contiguously located in memory, unlike arrays). In case such as this, a lot of spatial locality is lost. Here, when compared to a system with merely a 32-byte-line, the amount of wasted bandwidth is staggering. While the 32-byte-line system wasted 3/4 of the bandwidth available, the 128-byte-line system would waste 4 times as much, or 15/16 of the bandwidth! This is of course an extreme example, but it does prove the point.

    So one should easily be able to see one reason why the Pentium 4 requires so much bandwidth, because in some cases, it wastes a great deal of it.

    Programs are constantly becoming more complex, with more abstract, intricate and large structures and objects, especially in applications that need to be variable in the dataset size. This means that sometimes, only a small piece of an object or structure will be used at a time, which causes ever increasing waste in bandwidth when large line-sizes are used.

    As I've said in other articles, latencies and bandwidth go hand in hand. With large line-sizes, the miss-penalty involved with large line-sizes is much greater, because, if there isn't much bandwidth, it takes longer for the whole block (the size of a line) to be transferred. Because it takes longer to send a block, the latency that the processor sees increases. To combat the penalties induced by such waste, a system which can send much larger amounts of data at a time will have a much reduced miss-penalty, because it doesn't take as long to transfer the block (the line). In this case, increasing bandwidth so much stops the bleeding that the Pentium 4 would otherwise see with integer and "jumpy" programs.

    On the other hand, with streaming data, the large bandwidth and huge line-sizes don't "stop the bleeding." Rather, they allow greatly enhanced peak performance. With many floating point intensive applications, code doesn't tend to be written in objects that are as nested or complex, and tend to be more contiguous. Add to the fact that with double-precision floating point, the size of the smallest data element is 64-bits (8 bytes), or twice the size of integer values. Thus, fewer elements can be transferred at a time, and so having more bandwidth and larger-line sizes increases performance dramatically (assuming the processor has a strong FPU unit).

    So where do these design decisions come into play? There has always been a balancing act between the line-size, and the miss-penalty. The more bandwidth a system has, the more the miss-penalty is reduced, and the balancing point is shifted towards a larger line-size. Such is the case with the Pentium 4. This is why it is so important that the Pentium 4 have so much bandwidth at all levels of the memory hierarchy. It is no wonder that this chip was designed with RDRAM in mind, as it offers incredible bandwidth per (which is becoming important). Just for reference, the Athlon uses 64-byte line sizes at all levels of the memory hierarchy.

    >> Hardware Prefect/Some Of The "Guts"

    Article Options

    Post/View Comments   Post/View Comments
    Print this article   Print This Article
    E-mail this article   E-Mail This Article
    Article Navigation

    1. Introduction/Hyper Pipelined/Branch Prediction
    2. The P4's Caches
    3. Bandwidth And The Line-Sizes
    4. Hardware Prefetch/Some Of The "Guts"
    5. iSSE2
    6. Thermal Protection
    7. Conclusion
    8. Bibliography

    Did you like this interview?
    Browse the various sections of the site
    Reviews, Articles, News, All Reviews...
    Reviews, Articles, News...
    Regular Sections
    A Guru's World, CPU/Memory Watch, SLDeals...
    Forums, Register(Free), Todays Discussions...
    Site Info
    Search, About Us, Advertise...
    Copyright 1998-2007 SLCentral. All Rights Reserved. Legal | Advertising | Site Info