In order to make a fair comparison between an on-chip multiprocessor (CMP) and a uniprocessor, one must compare similar architectural features - i.e., both the uniprocessor and a CMP chip should have a similar aggregate number of functional units, registers, and similar renaming registers. This is to say that for a die that has two CPU's on it, each individual CPU has half the registers of the single CPU die, but because there are two processors on a CMP chip, the total resources are the same. These processors can also have an on-die L2 cache, which would be shared. If the L1 caches were write-through, then the cache-coherency problem between the two processors would be solved.
The general concept behind using multiple cores on one die is to extract more performance by executing two threads at once. By doing so, the two chips together are able to keep a higher percentage of the aggregate number of functional units doing useful work at all times. An example is shown below.
Pictures adapted from Jack Lo's PhD dissertation, and Paul DeMone's "Simultaneous Multi-threat."  a) conventional superscalar CPU, b) a 2 CPU multiprocessor.
To explain the context-switch code, I defer to Mr. DeMone's explanation:
Each thread runs for a short interval that ends when the program experiences an exception like a page fault, calls an operating system function, or is interrupted by an interval timer. When a thread is interrupted, a short segment of OS code (shown in Figure 1A as gray instructions in issue slots) is run which performs a context switch and switches execution to a new thread. Multitasking provides the illusion of simultaneous execution of multiple threads but does nothing to enhance the overall computational capability of the processor. In fact, excessive context switching causes processor cycles, which could have been used running user code, to be wasted in the OS.
The more functional units a processor has, the lower the percentage of units doing useful work is at any given time. The on-chip multi-processor lowers the number of functional units per processor, and distributes separate tasks (or threads) to each processor. In this way, it is able to achieve a higher throughput on both tasks combined. The comparative uniprocessor would be able to get through one thread, or task, faster than a CMP chip could, because, although there are wasted functional units, there are also "bursts" of activity produced when the processor computes multiple pieces of data at the same time and uses all available functional units. The idea behind multi-processors is to keep the processor from experiencing such "bursty" activity, and instead using what it has more frequently, and therefore efficiently. The non-use of some of the functional units during a clock cycle is known as horizontal waste, which CMP tries to avoid.
Another advantage of using a CMP chip instead of a larger, more robust uniprocessor, is that there is less difficulty in designing a smaller, less complex chip. This is useful in a couple of ways: one, it allows the designers to spend less time on the chip (and thus time to market is shorter); and two, less complex, smaller processors tend to be able to execute at a higher frequency. In this way, a CMP chip in a multithreaded (or multiprogrammed) environment is able to execute faster due to more efficient use of available resources over the various threads, and because of the potential to increase the clock rate over that of a monolithic processor.
The MAJC architecture from Sun Microsystems makes use of CMP. It allows one to four processors to share the same die, and for each to run separate threads. Each processor is limited to 4 functional units (each of which are able execute both integer and floating point operations, making the MAJC architecture more flexible).
Another example of an on-chip multi-processor is the Power4 processor from IBM. This architecture does not make use of the philosophy of using smaller, easier to implement CPUs. Instead, it takes processors that, in and of themselves, could be considered full-fledged server chips. And yet, IBM has chosen to stick two onto each die, where it should have a die size of ~400mm^2 (smaller than the HP 8500, and the same amount of on-die cache) .
There are problems with CMP, however. The traditional CMP chip sacrifices single-thread performance in order to expedite the completion of two or more threads. In this way, a CMP chip is comparatively less flexible for general use, because if there is only one thread, an entire half of the allotted resources are idle, and completely useless (just as adding another processor in while using a singly threaded program is useless in a traditional SMP system). Another approach to making the CPU's functional units more efficient is called course-grained multithreading.
>> Course-Grained Multithreading