Intel introduced "SWAR" (SIMD Within A Register) to the x86 world back in the days of MMX. SIMD computers have existed beforehand, in vector computers such as those produced by Cray. The basic idea is to use one instruction on multiple data elements to increase throughput, and reduce instruction count (Single Instruction, Multiple Data). While MMX was for integer only, and required the use of the floating point registers (as did 3dNow!), SSE, and iSSE2 give floating point SIMD to the x86 world, and use 8 128-bit registers so as not to make the floating point registers do double-duty.
iSSE2 theoretically gives the Pentium 4 the same throughput on some floating point code as the Athlon can do in pure x86 mode. However, iSSE2 allows both 64-bit integer, and double precision floating point to be done in SIMD mode. This means that theoretically, code size could decrease somewhat, though this point is moot and likely to be mostly negligible even when optimized.
Despite the apparent disappointment with the FPU performance of the Pentium 4, aceshardware has done a compiler analysis (Here) that shows that the Pentium 4 is capable of much higher floating point performance when optimized for iSSE2. Another important point about both SSE and iSSE2 (the Pentium 4 contains both), is that they are fully IEEE compliant. This means that the answers will be exactly as defined by a specific, universal IEEE standard. x86 code will do this by itself (if they want to be considered x86 compatible), however 3dNow in single precision mode will not. The numbers will be a very small fraction off. This can cause problems in tests that need to be highly precise, but double precision can take care of that for 3dNow!, though this slows down the process.
The precision afforded by 3dNow! is certainly enough for games, but in many scientific applications, anything but full IEEE compliance can ruin results. As such, iSSE and iSSE2 has shown that they can be potentially more useful, simply due to the fact that they give the exact results one would expect.
Despite all the optimizations that are theoretically possible, a common workstation benchmark, with SPEC2000, Intel claims that only about 5% of its phenomenal absolute performance is due to SIMD optimizations! This is in part due to the fact that some portions of SPEC2000 are bandwidth limited, which is where the Pentium 4's Quad Pumped, 3.2Gigabytes per second of main memory bandwidth can become very useful.
>> Thermal Protection