072022.01

Sandy bridge how many cores

The Core i7 Family Photo. If you want to see all of our Core i7 benchmarks for each one of these CPUs, head over to anandtech. Sit in a chair, lie back, and dream of It's a year when you looked at that old Core 2 Duo rig, or Athlon II system, and it was time for an upgrade.

You had seen that Nehalem, and that the Core i was a handy overclocker and kicking some butt. It was a pleasant time, until Intel went and gave the industry a truly disruptive product whose nostalgia still rings with us today. That product was Sandy Bridge. AnandTech scored the exclusive on the review, and the results were almost impossible to believe, for many reasons. In our results at the time, it was by far and above a leap ahead of anything else we had seen, especially given the thermal monstrosities that Pentium 4 had produced several years previous.

Intel managed to stand on the shoulders of its previous best product and score a Grand Slam. In that core design, Intel shook things up considerably. One key proponent was the micro-op cache, which means that recently decoded instructions that are needed again are taken already decoded, rather than wasting power being decoded again.

For Intel with Sandy Bridge, and more recently with AMD on Ryzen, the inclusion of the micro-op cache has done wonders for single threaded performance. Intel also launched into improving its simultaneous multi-threading, which Intel has branded HyperThreading for generations, slowly improving the core by making more of it dynamically allocated for threads, rather than static and potentially losing performance.

With Intel unable to recreate the uplift of Sandy Bridge, and with the core microarchitecture defining a key moment in x86 performance, users who purchased a Core iK I had two stayed on it for a long time. So much so in fact that a lot of people expecting another big jump became increasingly frustrated — why invest in a Kaby Lake Core iK quad-core processor at 4. This is why the Core iK defined a generation.

Every aspect of the core was improved over its predecessors with very functional block being improved. The Sandy Bridge core focuses on extracting performance and reducing power in a great number of ways. Intel placed heavy emphasis in the cores on performance enhancing features that can provide more-than-linear performance-to-power ratio as well as features that provide more performance while reducing power.

The various enhancements can be found in both the front-end and the back-end of the core. The front-end is tasked with the challenge of fetching the complex x86 instructions from memory, decoding them, and delivering them to the execution units. When the back-end is not being fully utilized, the core is not reaching its full performance.

A weak or under-performing front-end will directly affect the back-end, resulting in a poorly performing core. In the case of Sandy Bridge base, this challenge is further complicated by various redirections such as branches and the complex nature of the x86 instructions themselves. The entire front-end was redesigned from the ground up in Sandy Bridge. All those features not only improve performance but they also reduce power draw at the same time.

Blocks of memory arrive at the core from either the cache slice or further down the ring from one of the other cache slice. On occasion, far less desirably, from main memory.

On their first pass, instructions should have already been prefetched from the L2 cache and into the L1 cache. The L1 is a 32 KiB , 64B line, 8-way set associative cache. The instruction cache is identical in size to that of Nehalem 's but its associativity was increased to 8-way.

Sandy Bridge fetching is done on a byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that the fetcher is shared evenly between two threads, so that each thread gets every other cycle. At this point they are still macro-ops i.

Instructions are brought into the pre-decode buffer for initial preparation. At the pre-decode buffer the instructions boundaries get detected and marked. This is a fairly difficult task because each instruction can vary from a single byte all the way up to fifteen. Moreover, determining the length requires inspecting a couple of bytes of the instruction. In addition boundary marking, prefixes are also decoded and checked for various properties such as branches.

As with previous microarchitectures, the pre-decoder has a throughput of 6 macro-ops per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new byte block until the previous block has been fully exhausted.

For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3. Likewise, if the byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction.

This will produce an average throughput of 2. Note that there is a special case for length-changing prefix LCPs which will incur additional pre-decoding costs. Real code is often less than 4 bytes which usually results in a good rate.

The fetch works along with the branch prediction unit BPU which attempts to guess the flow of instructions.

All branches utilize the BPU for their predictions, including returns , indirect calls and jumps, direct calls and jumps, and conditional branches. As with almost every iteration of Intel's microarchitecture, the branch predictor has also been improved.

An improvement to the branch predictor has the unique consequence of directly improving both performance and power efficiency. Due to the deep pipeline, a flush is a rather expensive event which ends up discarding over instructions that are in-flight.

One of the big changes that was done in Nehalem and was carried over into Sandy Bridge is the further decoupling of the BPU between the front-end and the back-end. Prior to Nehalem, the entire pipeline had to be fully flushed before the front-end could resume operations.

This results in a reduced penalty i. In addition, a large portion of the branch predictor in Sandy Bridge was actually entirely redesigned. For near returns, Sandy Bridge has the same entry return stack buffer. This change should result increase the prediction coverage.

It's interesting to note that prior to Nehalem , Intel previously used a single-level design in Core. This is done through compactness.

Since most branches do not need nearly as many bits per branch, for larger displacements, a separate table is used. Sandy Bridge appears to have a BTB table with targets, same as NetBurst , which is organized as sets of 4 ways. The global branch history table was not increased with Sandy Bridge, but was enhanced by removing certain branches from history that did not improve predictions. Additionally the unit features retains longer history for data dependent behaviors and has more effective history storage.

The pre-decoded instructions are delivered to the Instruction Queue IQ. In Nehalem , the Instruction Queue has been increased to 18 entries which were shared by both threads. Sandy Bridge increased that number to 20 but duplicated over for each thread i. One key optimization the instruction queue does is macro-op fusion.

Sandy Bridge can fuse two macro-ops into a single complex one in a number of cases. In cases where a test or compare instruction with a subsequent conditional jump is detected, it will be converted into a single compare-and-branch instruction.

With Sandy Bridge, Intel expanded the macro-op fusion capabilities further. This means that more new cases are now fusable. Perhaps the most important case is in most typical loops which have a counter followed by an exist condition. Those should now be fused. Those fused instructions remain fused throughout the entire pipeline and get executed as a single operation by the branch unit thereby saving bandwidth everywhere. Only one such fusion can be performed each cycle.

Up to four instructions or five in cases where one of the instructions was macro-fused pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the decoders alternate between the two threads each cycle. The decoders organization in Sandy Bridge has been kept more or less the same as Nehalem.

As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a complex decoder while the other three are simple decoders. This extension expanded the sixteen pre-existing bit XMM registers to bit YMM registers for floating point vector operations note that Haswell expanded this further to Integer operations as well.

Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders. There are more complex instructions that are not trivial to be decoded even by complex decoder. During that time, the decoders are disabled.

Without any specialized hardware, such operations would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. The subtraction in this case will be done by the Stack Engine. Incoming stack-modifying operations are caught by the Stack Engine. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units i.

Decoding the variable-length, inconsistent, and complex x86 instructions is a nontrivial task. It's also expensive in terms of performance and power. Therefore, the best way for the pipeline to avoid those things is to simply not decode the instructions. This is exactly what Intel has done with Sandy Bridge and what's perhaps the single biggest feature that has been added to the core. The micro-op cache is unique in that it not only does substantially improve performance but it does so while significantly reducing power.

What's unique about it is that it stores actual decoded instructions i. While it shares many of the goals of NetBurst 's trace cache , the two implementations are entirely different, this is especially true as it pertains to how it augments the rest of the front-end. The idea behind both mechanisms is to increase the front-end bandwidth by reducing reliance on the decoders. The cache is competitively shared between the two threads and can also hold pointers to the microcode sequencer ROM.

At any given time, the core operates on a contiguous chunks of 32 bytes of the instruction stream. Intel refers to the traditional pipeline path as the "legacy decode pipeline". On initial iteration, all instructions go through the legacy decode pipeline. This occurs simultaneously with all other operations; i. On all subsequent iterations, the cached pre-decoded stream is sent directly to the allocation queue - bypassing fetching, predecoding, and decoding of actual x86 instructions, saving power and increasing throughput.

This is also a much shorter pipeline, so latency is reduced as well. However, such scenarios are fairly rare and most workloads do benefit from this feature.

During those cycles, the rest of the front-end is entirely clock-gated which is how the substantial power saving is gained. Any partial hits are required to go through the legacy decode pipeline as if nothing was cached.

The choice to not handle partial cache hits is rooted in this features efficiency. Not only would such mechanism increase complexity, but it's also unclear how much, if any, benefits would be gained by that. But that's where the similarities end. It's worth noting that the trace cache was costly and complicated having dedicated components such as a trace BTB unit and had various side-effects such as needing to flush on context switches. This implies a significant storage efficiency of four-fold or greater.

The Allocation Queue acts as the interface between the front-end in-order and the back-end out-of-order. The Allocation Queue in Sandy Bridge has not changed from Nehalem which is still entries per thread. The IDQ does a number of additional optimizations as it queues instructions.

Streaming continues indefinitely until reaching a branch mis-prediction. The LSD is particularly excellent in for many common algorithms that are found in many programs e. The LSD is a very primitive but efficient power saving mechanism because while the LSD is active, the rest of the front-end is effectively disabled - including both the decoders and the micro-op cache.

The back-end or execution engine of Sandy Bridge deals with the execution of out-of-order operations. Sandy Bridge back-end is a clear a happy merger of both NetBurst and P6. The implementation itself, however, is quite different. Sandy Bridge borrows the tracking and renaming architecture of NetBurst which is far more efficient. Sandy Bridge uses the tracking technique found in NetBurst which uses a rename which is based on physical register file PRF.

Sandy Bridge returned to a PRF, meaning all of the data is now stored in the PRF with a separate component dedicated for the various meta data such as status information. It's worth pointing out that since Sandy Bridge introduced AVX which extends register to bit, moving to a PRF-based renaming architecture would have more than likely been a hard requirement as the amount of added complexity would've negatively impacted the entire design.

Unlike a RRF, retirement is considerably simpler, requiring a simple mapping change between the architectural registers and the PRF, eliminating any actual data transfers - something that would've undoubtedly worsen with the new bit AVX extension. An additional component, the Register Alias Tables RAT , is used to maintain the mapping of logical registers to physical registers. This includes both architectural state and most recent speculated state.

This entry is used to track the correct execution order and statuses. It is at this stage that architectural registers are mapped onto the underlying physical registers. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the Register Alias Table RAT which is used to mark where the data we depend on is coming from after that value, too, came from an instruction that has previously been renamed.

Sandy Bridge's move to a PRF-based renaming has a fairly substantial impact on power too. With the new instruction set extension which allows for bit operations, a retirement would've meant large amount of bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. Since Sandy Bridge performs speculative execution , it can speculate incorrectly. When this happens, the architectural state is invalidated and as such needs to be rolled back to the last known valid state.

Sandy Bridge introduced a number of new optimizations it performs prior to entering the out-of-order and renaming part. Two of those optimizations are Zeroing Idioms , and Ones Idioms. The first common optimization performed in Sandy Bridge is Zeroing Idioms elimination or a dependency breaking idiom. A number of common zeroing idioms are recognized and consequently eliminated.

Eliminated zeroing idioms are zero latency and are entirely removed from the pipeline i. The ones idioms is another dependency breaking idiom that can be optimized. In all the various PCMPEQx instructions that perform packed comparison the same register with itself always set all bits to one.

Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in Nehalem a total of 54 entries. The various internal reordering buffers have been significantly increased as well. Sandy Bridge has two distinct physical register files PRF. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads.

This restriction has been eliminated in Sandy Bridge. In order to promote the replacement, Intel control the prices accordingly. The successors of the QM, QM, and XM processors cost virtually the same according to insiders marginally more in the beginning.

Considering the performance gain of the new processors manufacturers might find this offer tempting. The successors of the Arrandale dual core processors are going to be launched in the first weeks of and also completely replace the existing lineup.

What's inside the new chips? From now on, the quad-corers are implemented on 32 nanometer process technology. This is also true for the graphics processor incorporated in the architecture, which was implemented on 45nm and visibly placed on the package beside the CPU in the previous Arrandale processors.

Alike in the Arrandale and the Clarksfield processors the Intel Turbo Boost technology is a key features. It allows a dynamic adjustment of the cores' maximum clock frequencies to the specific load at a time. So, both, applications that require high clock frequencies in single cores and those profiting from many parallel cores can be optimally supported.

We are already familiar with the general concept from the former Intel Core processors. It is a range above the fixed maximum clock rates of the standard Turbo Boost, which can be used upon additional reserves of the cooling system in a smart way, e.

So, after a time without load or at sufficient cooling reserves the CPU can overclock very high for a short time. Furthermore, the Turbo control was significantly improved. As a consequence the processor can be in higher turbo levels for a longer time. The up-to-date USB 3. In addition, the user might find a number of features Intel packed into their new platform interesting. In addition, it supports DirectX Maximum resolution displayed: x pixel.

Details regarding the performance of the new integrated Intel HD Graphics graphics solution are available in our special review. The more points achieved, the higher the performance reserves of the chip. We used the Bit version of the benchmark tests, which usually displays higher scores than the bit version. However, this cannot be interpreted as a performance gain.

The difference between the two Extreme models, XM and XM is similar if they are not overclocked. The current Cinebench R However, the potential performance gain within the CPU lineup is also of interest. The SuperPi benchmark is especially popular with overclockers.

It calculates the number PI to a pre-defined precision and records the time needed. The lower the result, the faster and stronger the built-in CPU. SuperPi only uses a single core, so a high clock frequency is the decisive factor.

quoprotobssur1983's Ownd

0コメント

1000 / 1000