EnlightenedOsote's blog: "TECH."

created on 07/01/2007 | http://fubar.com/tech/b97754

posted on 04/09/2008&nbs-07:00;@&nbs-07:00;03:04&nbs-07:00;pm

Intel's Nehalem Processor (on arstechnica.com)p2
http://fubar.com/intel-s-nehalem-processor-on-arstechnica-com-p2/b97754-759337

By Jon Stokes | Published: April 08, 2008 - 11:47PM CT Nehalem's core The basic building block of Intel's Nehalem family is a new version of the Core microarchitecture, which sports a number of major changes from its Core 2 Duo incarnation. In fact, Nehalem's core represents the biggest overhaul that this microarchitecture has undergone since the transition from Core to Core 2. Most areas of the processor have undergone major revisions in order to take advantage of the amount of bandwidth made available by QuickPath. The important exception here is the execution hardware, which, except for the addition of some floating-point and integer shuffle blocks on port 5, is substantially unchanged from Core 2. The execution engine and instruction window At a high level, you can think of Nehalem as a design that takes the very wide, extremely robust execution engine from its predecessor, the Core 2 Duo, and focuses on keeping it as busy as possible by feeding it code and data at an unprecedented rate. The Core 2 Duo's execution engine, which is substantially the same as Nehalem's. Another, related way to conceive of Nehalem's overall goal is to imagine that the Core 2 Duo's thirsty execution engine has been separated from the pools of code and data that lay in main memory by relatively thin pipes (the frontside bus and cache hierarchy) and a strong pump (the front end and memory unit) that does the best it can to keep instructions and data flowing given the circumstances. Nehalem, then, is all about replacing the plumbing with very wide pipes and beefing up the pump in order to take full advantage of all this new capacity; this way, the execution engine can get much closer to reaching its full potential. In terms of keeping the execution engine fed, the return of simultaneous multithreading (SMT) to Intel's mainstream product line has an important impact on Nehalem's design. By letting each core on the die run two instruction streams at the same time, SMT increases overall system bandwidth usage and keeps the core's execution units busier with code and data, so that they waste less time (and power) sitting idle each cycle. Because of the increased flow of instructions and data through the core that SMT enables, the (re)introduction of SMT meant that buffers on both the instruction and data sides of Nehalem had to be enlarged. On the instruction side, Intel had to enlarge Nehalem's instruction window to accommodate more instructions in-flight. Specifically, Nehalem's reorder buffer (ROB) has been enlarged to 128 entries from Core 2's 96 entries, a 33 percent increase. (Architecture trivia buffs will recall that the Pentium 4 could track 126 instructions in-flight.) To go along with this much larger number of in-flight instructions, Nehalem's reservation station has been expanded to 36 entries form Core 2's 32. The reorder buffer is statically partitioned so that the number of instructions from any one thread can never dominate the structure, thereby ensuring that this critical resource is shared fairly by the two running threads. The reservation station is competitively shared, so that instructions from one thread can dominate it from time to time as needed. On the data side, the number of load buffers has gone from 32 in Core 2 to 48 in Nehalem, and the number of store buffers has gone from 20 to 32. Both of these resources are statically partitioned to ensure fairness. It bears pointing out that this static partitioning strategy for shared resources effectively reduces the number of entries available to each thread below the number available in Core 2; that is, the two threads sharing Nehalem's ROB will have only 64 entries each, instead of the full 96 entries afforded a thread in the Core 2's ROB. I won't speculate on the degree to which this will impact Nehalem's single-threaded performance, because I don't know a) the threshold beyond which a decrease in ROB entries significantly impacts the core's ability to extract maximum instruction-level parallelism, or b) if Intel has some way of mitigating this decrease, like, say, letting one thread use all the available entries if it's the only one executing. Intel has been a bit coy about the details of how this partitioning is managed, but I expect more details to emerge as we get closer to Nehalem's launch. Update: Intel says that they do option b, i.e., if only one thread is executing that thread gets full use of the shared resources. The front end In order to push more instructions into the enlarged instruction window, Nehalem's front end has undergone some significant changes. In fact, Nehalem's front end is probably the most altered part of the processor. The first major innovation that Nehalem brings to Intel's line is the addition of a dedicated loop stream detector (LSD) to the instruction pipeline after the decode stage. In the Core 2 Duo, Intel introduced an 18-entry instruction queue between the fetch and decode stages. This queue was big enough to cache a small loop so that a cached loop could execute repeatedly from the queue without having to constantly re-fetch the necessary instructions from the L1 instruction cache. By keeping the fetch hardware idle during such loops, Core 2 was able to save power. Nehalem takes this loop caching concept a bit further by moving this loop cache down below the decode units so that it caches up to 28 decoded uops instead of raw x86 instructions. Because the instructions cached in the LSD are already decoded, a loop can now execute without activating either the fetch or decode hardware, a feature that saves power and boosts performance. As David Kanter points out in his excellent article on Nehalem, the LSD provides much of the benefit of the Pentium 4's trace cache without the added complexity and the negative impact on the decode and L1 cache hardware. The improved LSD is a great feature that will serve the core in good stead in every context, from servers to mobiles. Intel made other major improvements to Nehalem's front end in the area of macrofusion. You may recall from my article on the Core 2 Duo that macrofusion is a technique that the Core microarchitecture uses to fuse some pairs of x86 instructions (compare + jump) together prior to decoding. On a cycle when two instructions can be macrofused, this gives the processor's front end a "virtual" fifth decoder, because it is decoding an extra x86 instruction on that cycle. Nehalem widens the number of x86 instructions that can be macrofused in two ways. First, it expands four new compare + jump branch conditions to the list of macrofusable instruction pairs. Second—and this is major—it can now macrofuse 64-bit instructions. Core 2's macrofusion hardware is limited to 32-bit instructions only. By expanding the list of instructions that can be macrofused to include new branch conditions and 64-bit instructions, Nehalem gains that virtual fifth decoder for a greater number of instructions. This improves overall decode bandwidth and plays a key role in keeping the core's execution engine fed. There are a few other improvements to Nehalem's front end that I'll mention only in passing, here. First is the processor's new multi-level branch predictor, which can store more branch history and give better performance on code that has too many branches to fit into a regular predictor. As always, branch prediction is one area where improvements translate directly into increased performance and power efficiency. Nehalem also improves Core 2's return stack buffer by renaming it. Duplicating the return stack buffer helps performance in SMT situations.

Report as NSFW [?]

124 views 0 comments view comments