Multi-Cores: The Gateway to Next-Gen SBCs and Blades

With the introduction of Intel Core microarchitecture into embedded systems, history could very well repeat itself. The company that invented the microprocessor in 1971 and created the very first micro-controller in 1976 is about to revolutionize the embedded space once again. By bringing the power of parallel processing to embedded developers in an open-standards-based building block architecture, Intel is hoping to break down the cost barriers while taking embedded systems performance to new levels that once were reserved only for expensive computer systems specifically designed for symmetric multiprocessing (SMP), while also accomplishing unrivaled levels of efficiency.

Today, the industry’s state-of-the-art Single Board Computers (SBCs) are equipped with multiple processor cores for dual-processing capabilities. In December, the Intel NetStructure MPCBL0050 was unveiled — a new AdvancedTCA-compliant blade server powered by the Dual-Core Xeon processor LV 5138. This puts four microarchitecture-based cores into a 200 W form factor. By doing so, Intel is hoping these scaleable building blocks significantly will improve performance for the compute-intensive and database-driven applications of tomorrow — IP Multimedia Subsystems (IMS), IPTV, unified messaging, etc.

Multicore processors will increase the computational density of embedded systems, including new Advanced TCA communications switches, while reducing thermal issues.

Even though Moore’s Law (which states the number of transistors in a semiconductor device will double about every 18 months) has helped embedded processors keep pace with the ever increasing requirements of compute-intensive applications, power consumption and heat flux have turned into the principle barriers of further advancements. For example, just two generations ago, Intel produced a 64-bit, single-core Xeon processor that had been codenamed “Nocona” found in the Intel NetStructure MPCBL0030. If you look at the www.spec.org  Web site (a popular one for benchmarks), you can find that it scored a SPECint_rate_base2000 of about 33 in a dual-socket configuration, ran at 2.8 GHz, and consumed about 55 Watts each (110 Watts total).

The 64-bit processor from Intel that is found in the Intel NetStructure MPCBL0050 is a dual-core processor, runs at only 2.13 GHz and has a TDP of 35 Watts (70 Watts total). However, it obtains a SPECint_rate_base2000 score of 81 in a dual-socket configuration. What this means is that by going dual-core, continuing to drive the best possible lithography process, and continuously improving the Intel microarchitecture, the Dual-Core Intel Xeon processor LV 5138 can achieve 2.5 times more performance while consuming a third less power at the processor. As with all benchmarks, this is only one test of performance and should not be taken as an absolute value, but the metric of scoring several times better performance while consuming less power is the type of jump that Intel is achieving by going multi-core.

Optimizing for Multi-Core

Parallel processing delivers very high performance to a wide range of applications, depending on the programming implementation. It is clear that multi-core is the new path that will help the computing industry avoid the physical road blocks of size, power consumption, heat generation, and increasingly expensive manufacturing requirements necessary to generate ever-smaller critical dimensions on microchips. These challenges became extremely apparent as chipmakers approach clock speeds of 4 GHz.

But simply having two processors or two cores is not as simple as it may sound. In order for the blade server marketplace to truly benefit from all that multi-core has to offer, there will need to be a shift in the fundamental way that software is created. In many applications, a monolithic approach has been used since the very first Intel processors were introduced. Now, spreading the code base across many cores — both within a die and within a board — is the next stage in digital evolution. Intel recognizes this and has been doing its best to help developers make the transition by providing compilers, VTune analyzers, and cluster tools that support multi-core applications.

While Intel’s software tools can be used for asymmetrical and symmetrical multiprocessing, SMP can simplify life for both the developer (because only one set of tools is required) as well as the underlying system, which uses fewer resources. By uniformly executing instructions on the same set of identical processors, both cores can be better optimized to maximize the use of processing resources either on one core or recovered from the other. More importantly, while SMP supports programming models used on an asymmetric processor, it also ports easily over to the compute-intensive sets of instructions typically used in multimedia applications. In this sense, SMP has enabled multi-core technology to greatly enhance media or signal processing capabilities.

Real-time multimedia software can run more efficiently on parallel processor architecture to achieve higher performance, lower cost, and lower power. From a hardware point of view, multicore processors simply are single-die physical units residing side-by-side that provide capabilities similar to traditional SMP machines. Along with the components typically associated with dual-core platforms such as the front-side bus (FSB), cache, high-bandwidth data paths, and additional DMA controllers, dual-core processor designs each physical processor sharing the on-die cache, thereby greatly reducing the number of cache misses and improving performance. In addition, the latest dual-socket designs leveraging Xeon LV 5138 support dual independent buses (DIBs), thereby allowing a relatively large amount of cache support across the Intel 5000 chipset family through the use of cache coherency.

The end result is a new level of performance and power efficiency for compute- and I/O-intensive designs. Manufactured globally through an industry-leading 65-nm fab process, the new Intel Core Microarchitecture-based processors feature execution pipelines that are 33% wider in each core than previous generations. Benefits include:

  • Wide dynamic execution allowing each core to expand from three- to four-way execution — i.e., simultaneously fetch, dispatch, execute, and retire up to four instructions.
  • Advanced smart cache: A shift from a maximum of 2 to 4 MB of shared L2 cache that significantly reduces latency to frequently used data, thus improving performance and efficiency by increasing the probability that each execution core of a multi-core processor can access data from a higher- performance, more efficient cache subsystem.
  • Smart memory access: Improves system performance by optimizing the use of the available data bandwidth from the memory subsystem and hiding the latency of memory accesses.
  • Advanced digital media boost: Enables 128-bit streaming SIMD extension (SSE/SSE2/SSE3) instructions to be completely executed at a throughput rate of one per clock cycle, effectively doubling, on a per clock basis, the speed of execution for these instructions as compared to previous generations.
  • Intelligent power capability: Better power-control efficiency with micro-gating of processor circuitry, which de-energizes inactive portions of the processor with finer granularity than other processors.

To help embedded developers make better use of the capabilities of parallel processing, the Intel Communications Alliance has adopted a modular-based approach to systems design and solutions. The latest iteration of this modular communications platform is embodied by the AdvancedTCA specification. AdvancedTCA enables equipment providers to quickly integrate commercially available building blocks so they can accelerate development and deployment schedules for new network elements, and help service providers deploy new services faster than with proprietary implementations. AdvancedTCA means that IT will need to be familiar with fewer platform types. Multicore processor-based AdvancedTCA systems also save costs due to reduced power and system real estate requirements. The latest of such systems geared for the communications space was announced December 4, 2006.

The Intel NetStructure MPCBL0050 blade server delivers almost three times the performance per slot of the leading competitive blade server, which enables service providers to deliver new, revenue generating services with fewer blades. It runs Carrier Grade Linux operating system and offers significant performance improvements for compute-intensive and database-access applications, including IP Multimedia Subsystems (IMS), wireless control plane, and IPTV. Additionally, it is designed to be the first blade server to comply with the proposed Communications Platforms Trade Association (CP-TA) 1.0 standard to improve industry interoperability.

As network operators, service providers, and ISVs continue their relentless quest for the next “killer app” — whether it be voice, video, data, or wireless services — they will undoubtedly turn to parallel processing more and more to meet their power and performance requirements. And while parallel processing remains at a nascent stage, Intel intends to build upon its 30-year legacy with the promise of further advancements similar to its milestones in dual-core processing as it continues to march down the path of multi-core.

This article was written by Eric Mantion of the Embedded and Communications Group of Intel, Santa Clara, CA. For more information, contact Mr. Mantion at This email address is being protected from spambots. You need JavaScript enabled to view it. , or visit http://info.hotims.com/10964-401 .