From the first computers of the 1940s through the machines of the 1990s, all computer systems were CPUbound. In other words, the I/O interfaces could deliver more data than the CPU could process. In the 1990s Moore’s Law took over and clock speeds doubled every 18 months, along with the addition of multi-core processors. So, from 1990 through today, we have been I/O-bound, meaning CPUs can now process more data than the I/O links can deliver. Increases in CPU performance have been revolutionary while the increases in interconnect bandwidth have been incremental for many decades. However, bandwidth increases in RapidIO, InfiniBand, and Ethernet are breaking this bottleneck, giving us the ability to design incredibly powerful embedded supercomputing architectures for today’s dataintensive applications.
From PCI to Infiniband
The primary reasons for the debilitating delay in I/O bandwidth innovation over the years can be attributed to Intel and the PC. The incredibly slow and high-latency interfaces like PCI and PCIe were more than adequate for slow pedestrian applications, which are the Holy Land of PC usage. The accumulated knowledge and incorrigible RTL code (for the PCI chips) perpetuated the continued use of these outdated I/O interface concepts, and they have put the computing industry twenty years behind where we should be on the bandwidth performance curve. But, that seems to be changing. Back in January, Intel bought the InfiniBand design team and the product line from QLogic. InfiniBand is the highest-speed, lowest-latency interconnect on the market today, especially for InterProcessor Communications (IPC) links in multiprocessor systems.
To date, InfiniBand has been used to hook together hundreds or thousands of processors to build clustered Linux servers. So why, after all these years, has Intel finally taken an interest in high-speed supercomputing interconnects like InfiniBand? In two words: Cloud Computing. And, what is the primary application in the Cloud? It’s data mining. Google, Facebook, Linked In, Amazon, Yahoo...they are all drooling over the prospects of data mining. But, to build these advanced machines will require entirely new architectures that break the chains that bound us in the past.
And yes, there are many applications for supercomputing architectures in em bedded applications, especially in the military. These applications range from ad vanced radar systems, to sonar, signal intelligence (SIGINT), communications intelligence (COMINT), systems that run SWARM algorithms (for squadrons of UAVs and UUVs), and in electronic warfare (EW) systems. In addition, data mining is another arena where supercomputers can be used in military intelligence data gathering and analysis.
The cloud-based computing machines will be highly commoditized boxes made in China, so that market is not very attractive. But, the components used in those commodity-oriented cloud machines (InfiniBand chips and the advanced CPUs) can be used, under the influence of intelligent thinking, to build extremely powerful embedded supercomputers.
I/O and Architecture
Let’s look at how we got into this I/O-bound mess. We used buses as the main interconnect in computers for decades, up through the mid 2000’s. We not only increased their clock frequencies, but we “widened” the buses (from 8 bits, to 16, 32, and 64 bits wide) to increase the data transfer rates. But this technique revealed the first law of diminishing returns: every time you double your clock speed, the distance you can run the bus goes down by 50%. Single-ended signals do not like to run at very fast speeds due to the transmission line effects on copper traces and through cheap connectors on backplanes.
So, we moved to high-speed serial differential signals for the I/O early in the 2000’s. When you start trying to calculate the bandwidths of these serial links, you run into serious esoteric laborious technical arguments about frequency, bit rate, baud rate, and true speeds. Rather than do calculations here, and get sucked into that morass, take a look at this table (http://en.wikipedia.org/wiki/List_of_device_bit_rates) and draw your own conclusions. Let’s just say we are running serial links at over 200 MB/s today, and that speed is doubling about every 3-4 years. Now, we have the ability to break the I/O-bound problems in computer architectures.
Next, we have to look at how we can hook processors together with these links efficiently. All computer architectures of the past (and most in the present) are crude 2-dimensional architectures. Even the switched-serial and point-to-point architectures (stars, meshes, etc.) are 2-D. The first 3-D architecture you can build is a cube. The total number of nodes in an N-dimensional architecture is (2n), where n is the number of dimensions in the architecture. For a cube, that’s 8. The number of links on each node is equal to the number of dimensions (n) of the architecture, or 3 for a cube. The total number of links in the system is [n × 2(n-1)], or 12 for the cube. And most importantly, how many hops (i.e., how many nodes must the data pass through) before the data arrives at its destination in the worst case? For all N-dimensional architectures, that’s the same as the number of dimensions, (n). For the 3-D cube, that means 3 hops in the worst case.
In the past, there have been some aberrant architectures used to build multiprocessor systems. Take a ring, for example. The problem with a ring is that the worst-case number of hops is (n-1), where (n) is the number of nodes. So, the bigger the ring, the greater the latency. Additionally, if you break a ring at any place, the whole machine dies. So, designers connected two counter-rotating rings to each node, in case one ring failed. That requires 4 links per node and the maximum number of hops is the number of nodes divided by 2 (n/2). So, the bigger the ring, the more latency you introduce here too. DEC took this one step further with the Torus architecture (using PDP-11 minicomputers). A torus consists of counter-rotating rings at right angles to other counter-rotating rings. Here again, the maximum number of hops is (n/2), but the number of links per node goes up to 6, not a good trade-off. Along the way, there were trees, fat trees, and variations on the theme of rings (http://pg-server.csc.ncsu.edu/mediawiki/index.php/PG_MediaWiki:Community_Portal). All these techniques fall apart when you get above 8 processors.
So, to overcome the peculiar inefficiencies in these deviant 2-D and 3-D structures, we must increase the number of dimensions in the architecture. The first 4-dimensional architecture we can build is a hypercube (Figure 1). The number of nodes or processors (2n) is 16. The number of links per node is the number of dimensions (n), or 4. The total number of links in the system [n × 2(n-1)] is 32. The maximum number of hops (n) is 4 (Figure 2). Take this to a 6- dimensional hypercube (Figure 3) and you get 64 processor nodes, 6 links per node, 192 total links in the system, and the maximum number of hops is 6.
Wire-up these 4 and 6-dimensional architectures with optical links, using GPGPUs (General Purpose Graphic Processing Units) as the processors, and we can build extraordinarily powerful supercomputers in a small box. These machines will certainly appeal to many of the I/O-bound, algorithmdriven, data-intensive applications we have today.