GPU vs ManyCore

by Tim Anderson

NVIDIA is pitting its graphics cards against Intel’s MIC solutions. Tim Anderson evaluates the technologies and the tools.

HardCopy Issue: 56 | Published: May 1, 2012

Software development experts have been warning for years that programmers can no longer expect ever-faster CPUs to speed their code with each new generation of hardware. It was in December 2004 that Herb Sutter, chair of the ISO C++ standards committee, wrote that the “free lunch” is over:


NVIDIA’s latest GPU, the GTX 690, has over 3,000 cores.

“The major processor manufacturers and architectures, from Intel and AMD to Sparc and PowerPC, have run out of room with most of their traditional approaches to boosting CPU performance. Instead of driving clock speeds and straight-line instruction throughput ever higher, they are instead turning en masse to hyperthreading and multicore architectures… if you want your application to benefit from the continued exponential throughput advances in new processors, it will need to be a well-written concurrent (usually multithreaded) application.”

Sutter called this the next revolution in software development, and has evangelised concurrent programming ever since.

Although Sutter’s predictions with respect to CPU speeds have been proved broadly correct, massively parallel computing has been slow to come to our desktops. An Intel Core i7 runs at 2.5Ghz and has just 4 processor cores and 8 threads (the additional threads thanks to Hyper-Threading). Designed for servers, an Intel Xeon E5 has up to 8 cores and 16 threads. Of course you can get a great performance boost by making use of all available threads. However compare that to the latest NVidia dual-Kepler GPU (Graphic Processing Unit), the GTX 690, which has 3,072 cores. Even the humble (and affordable) GTS 450 has 192 cores.

These GPU cores are nothing like as powerful as the cores in an i7. However, they are real cores and they can be used for general-purpose programming as well as to drive a display. Suitable code runs dramatically faster than it would on the CPU alone, a fact which is exploited by recent versions of Adobe Photoshop and other desktop software, as well as in custom code for workstations and supercomputers.

Using the GPU for general-purpose programming is effective, but also challenging. The first obstacle is that GPUs are designed to drive displays, and the earliest efforts to exploit them for other purposes involved hacks where the programmer had to use graphics APIs in order to get results.

Aubrey Isle Intel CPU

Aubrey Isle, the CPU used by Intel’s prototype MIC board.

Fortunately the hardware vendors realised that this had to be made easier. Apple developed a specification called OpenCL, which was handed over to the Khronos cross-industry group for standardisation, while NVIDIA came up with CUDA, an architecture with supporting libraries for C and Fortran designed for general-purpose GPU programming. CUDA has been a success, and NVIDIA has a range of GPU boards branded Tesla which are designed specifically for High Performance Computing (HPC) applications. NVIDIA also supports CUDA in its Quadro range aimed at workstations, and in most of its desktop GPUs as well, making CUDA acceleration widely available, although only on NVIDIA GPUs.

AMD has also worked on supporting general-purpose GPU computing. It developed its own API and CUDA alternative called Close to the Metal, though its AMD APP (Accelerated Parallel Processing) SDK is now focused on OpenCL.

OpenCL is the vendor-neutral alternative to CUDA and is widely supported. There are even SoCs (Systems on a Chip) designed for smartphones and tablets which support OpenCL, including those using the Mali GPU design from ARM. GPU programming for a smartphone may seem an odd idea at first, but there is intensive processing involved in features like voice recognition and image processing, and GPU acceleration will be important in accelerating such processing.

Despite these advances, general-purpose GPU programming is still challenging, not least because the GPU has its own memory, so pointers are not valid across the CPU and the GPU and data has to be copied back and forth. The general approach is known as hybrid computing since it splits processing between the CPU, best for single-threaded operations, and accelerator devices, best for programming streams of data in parallel.


Following the supercomputer

The world’s fastest computers are massively parallel systems which reside in datacentres where they are used for academic, industrial and government research. There is friendly competition amongst them for who can perform the most floating point calculations per second (FLOPS or flop/s). The ranking is published every year as the Top 500 list, and the current leader is Japan’s K Computer which achieved 10.50 Petaflop/s (a Petaflop/s is 1015). There is much talk in supercomputing circles of the path to exascale, or more than an Exaflop/s, where an Exaflop/s is 1018. Most supercomputers (76.8 per cent of the top 500 according to the list) run Intel processors, but there is a trend towards GPU computing with 39 of the top 500 used GPU accelerators in November 2011, up from seventeen six months earlier and including those at 1, 4 and 5 in the list.

The fastest supercomputer in the USA is called Jaguar, at Oak Ridge National Laboratory (ORNL) in Tennessee, which in November was the third fastest in the world. Jaguar is being rebuilt under the name Titan, and has added nearly 1,000 NVIDIA GPUs earlier this year with plans for “a much larger GPU installation later in the year” according to the ORNL press release. By the autumn it will be capable of 10 to 20 Petaflop/s, taking it ahead of K Computer, though you cannot expect that other supercomputer installations will stand still.

Although the architectures of the world’s fastest computers may seem remote to most of us, they indicate the firm hold that heterogeneous computing has at the highest end of the market. The way the supercomputer industry has taken to GPU computing confirmed to Intel that it needed to take action in order to keep its products competitive.


Intel fights back

Intel's James Reinders

Intel’s chief evangelist for software products, James Reinders

Intel is fighting back with a three-strand strategy. Firstly there are its Multicore and Hyper Threading CPUs. Intel has been making multicore CPUs since 2005, with the Pentium D, and CPUs with support for Hyper Threading since the Pentium 4 in 2002. Hyper Threading is a technique which looks to the operating system like multiple cores, although it shares execution resources on a single CPU, so enabling more efficient use. The majority of Intel’s CPUs have two or more cores, including most of the low-end Atom range.

Then there is its support for OpenCL on CPUs and on HD Graphics. Intel released its first OpenCL SDK in Spring 2011, targeting Intel Core or Xeon processors on Windows or Linux. OpenCL is primarily designed to take advantage of accelerator devices such as GPUs, but can also be used with multicore CPUs. In April 2012 Intel extended its OpenCL SDK to include support for integrated HD Graphics 4000/2500 on Windows 7. HD Graphics 4000/2500 are GPUs that are integrated with the CPU in the range codenamed Ivy Bridge and launched as third generation Core Processors in April 2012.

NVIDIA's Steve Scott

NVIDIA on GPGPU versus Intel MIC

We spoke to NVIDIA CTO Steve Scott about how Intel MIC changes the hybrid computing landscape.

“I think that MIC is the right technical approach for Intel. It’s an acknowledgment of heterogeneous processing, of power efficiency being the primary constraint, and having to move to an architecture that combines the Xeon which provides fast single-thread performance with the MIC processor which provides more power-efficient performance for throughput-oriented parallel code. It’s very similar to the approach we’ve taken with GPUs,” Scott told us.

That said, Scott is sceptical of Intel’s claim that moving to MIC will be an easy transition for developers, thanks to the use of familiar Intel CPUs.

“First, the instruction set in general is really a non-issue. Most HPC programmers – indeed most programmers in general – don’t write assembly language. They’re not writing in the actual machine instruction set of the architecture; they write in C or Fortran or Java, and compile their code.

“The next question then is what sort of programming model do you use, or how will you use the machine? The programming model people will use in practice will be a hybrid programming model where they use both the Xeon and the MIC. They use the MIC to offload parallel regions and accelerate them. That programming model does not allow you to take your existing multicore code and just recompile it for MIC.

“If you are doing OpenMP, you are unlikely to have the existing OpenMP codes running on the Xeon, you’re unlikely to have enough parallelism exposed in those OpenMP codes to run well on MIC, because OpenMP loops traditionally have been too low down in the call tree and exposing parallelism at innermost loop level does not give sufficient parallelism to keep 50 vector cores profitably utilised.

“The programming model that people will use in practice will be a hybrid programming model where you modify the code to identify what regions run on Xeon, what regions will run on MIC, and use directives to express that. You have to think about copying the data back and forth from the host memory to the memory on the PCIe card where MIC lives, and that’s effectively the same programming model that you have on a GPU. We think it will open up GPU programming to a broader set of people than has used it in the past.”

Scott emphasises the importance of OpenACC, a new directive-based model for programming GPUs. “CUDA C and CUDA Fortran are good ways to write new programs, but if you have an existing program that’s running on x86 multicore processors, and you want to port it to CUDA, it can be a fair amount of work. That’s why we introduced the directive-based programming model; and that is very new. We just announced that last year, and the first production OpenACC compilers are only coming online later this quarter.”

The consequence, according to Scott, is that coding for the GPU need not be any harder than coding for MIC, though he adds that “I do expect that Intel is going to have a good tool story.”

From the developer’s perspective, it would be ideal if both Intel and NVIDIA supported the same directives, so that the same code would take advantage of either MIC or NVIDIA GPUs. Intel is not part of the OpenACC group, but does support OpenMP. “We are moving to get OpenACC into the OpenMP standard,” says Scott. “We do have enthusiastic support from OpenMP to make that happen.”

Perhaps NVIDIA’s biggest advantage is that it has CUDA GPUs out there now, which work on any computer from desktop to supercomputer, whereas MIC is a forthcoming product that initially is only intended for HPC.

The other mitigating factor for NVIDIA is that Intel MIC validates the hybrid approach. “NVIDIA and Intel, along with AMD and IBM, are all in strong agreement that the future is going to a hybrid computing architecture, because of power. It will be a growing market because it will be the way that people have to do computing in the future. We’re headed in the same direction, we’re going to have similar approaches, and we’re looking forward to having a real machine to talk about.”

In May 2010, Intel announced a new HPC product codenamed Knights Ferry and based on 50 or more processors on a single board. The Many Integrated Core (MIC) architecture uses multiple CPUs based on the older Pentium design, chosen for its power efficiency, but supplemented by over 100 new VPU (Vector Processing Unit) instructions and with 64-bit extensions. Although it is not in any sense a GPU, the MIC approach is similar to general-purpose GPU computing in that MIC is an accelerator board with its own memory, so it presents the same challenge to programmers in that data has to be copied back and forth between host and device. The production version of MIC, codenamed Knights Corner, is expected in late 2012 or early 2013.

“Knights Corner is superior to any GPGPU [General Purpose computing on GPUs] type solution for two reasons,” says Intel evangelist James Reinders. “First, we don’t have the extra power-sucking silicon wasted on graphics functionality when all we want to do is compute in a power efficient manner; and second, we can dedicate our design to being highly programmable because we aren’t a GPU (we’re an x86 core – a Pentium-like core for ‘in order’ power efficiency). These two turn out to be substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).”

MIC is initially aimed at HPC applications so it is not something you can expect to find in desktops or workstations. Rather, Intel must hope that it eats into NVIDIA’s market for Tesla.

The implication for developers is that, unless you are working with HPC, MIC is not yet something you can plan to target. However, by learning OpenCL or CUDA, you can take advantage of GPU acceleration now and be ready for MIC if and when Intel brings the technology more into the mainstream.


Coding for hybrid computing

Parallel programming presents multiple challenges. Multi-threaded programming is inherently harder than single-threaded development, and vulnerable to race conditions or synchronisation errors that deliver unpredictable results and can be hard to debug. Hybrid computing adds the further complexity of dealing with two distinct memory systems.

Tool and library vendors have risen to the challenge by coming up with a range of products and techniques to make concurrent programming easier, and to support hybrid computing. In June 2011 Microsoft announced C++ AMP (Accelerated Massive Parallelism) which is a new specification to support hybrid computing in C++. Microsoft’s implementation builds on DirectX, benefiting from the existing work the company and its hardware partners have invested in high performance DirectX support. C++ AMP is an STL-like library that is part of Visual C++ in the forthcoming Visual Studio 11.

Intel is revamping its parallel programming products to support MIC and OpenCL alongside multicore CPU development. Parallel Studio XE is a suite of tools for compiling, debugging and profiling applications, from multicore CPU to MIC and through to cluster computing.

Part of Parallel Studio, Intel’s Threading Building Blocks (TBB) is a C++ template library which scales a processing workload across available cores. Cilk Plus is an extension to C and C++ that enables task parallelism with three simple keywords: cilk_for to parallelise a for loop, cilk_spawn to specify the start of parallel execution, and cilk_sync to specify the end. OpenMP is an open standard which adopts a directive-based approach, so that by adding directives to your code you can instruct the compiler to generate parallel code. MPI, another open standard, is a Message Passing Interface which is critical for coding applications that run on computer clusters.

Another key component is Intel’s Math Kernel Library (MKL) for C and Fortran which automatically takes advantage of parallelisation when available.

Intel’s strategy is to support these same libraries, already in use for multicore CPU programming, on the MIC accelerator board too. “We want to make sure that the very same models you have used on the classical architecture will work on the MIC architecture too,” says Intel engineer Heinz Bast. “Yes we have some extensions, but there are no specific tools for MIC.”

Parallel computing jargon

Aubrey Isle: The processor used by Intel in the Knights Ferry prototype board.

C++ AMP: Standing for Accelerated Massive Parallelism, a new open specification created by Microsoft to enable hybrid computing on the Windows DirectX platform.

CUDA: First announced in 2006, Compute Unified Device Architecture is NVIDIA’s proprietary GPU-based parallel computing platform. You can code in C/C++, Fortran, or via wrappers in other languages including Java and Microsoft .NET. CUDA support is included in GeForce 8 and higher, as well as in Tesla and Quadro cards.

Host and Device: In many-core computing, the Host is the CPU and the system memory, and the Device is the GPU or MIC accelerator which has its own memory.

Kernel: A loop of code to be executed in parallel.

MIC: stands for Many Integrated Core which is Intel’s way of distinguishing its new accelerator boards, containing 50 or more processor cores, from its multicore CPUs. The prototype MIC board is called Knights Ferry, and the production version Knights Corner.

MPI: stands for Message Passing Interface and is designed to support parallel computing across a group of processors or physical machines by standardising communication and data transfer through a system of message passing.

OpenACC: A set of compiler directives to support hybrid programming with a host and an accelerator. OpenACC was developed by PGI, Cray and NVIDIA with support from CAPS, and is supported by compilers from PGI, Cray and CAPS. While it lacks the breadth of support enjoyed by OpenMP, the intent is that OpenACC eventually merges with the OpenMP specification.

OpenMP: A specification and API to support shared-memory parallel programming in C/C++ and Fortran. OpenMP does not currently support accelerators such as GPUs, but this is planned in association with OpenACC. In addition, Intel will support OpenMP on its MIC accelerator boards. The OpenMP specification is owned by a cross-industry Architecture Review Board (ARB), with members including AMD, Cray, HP, IBM, Intel, Microsoft, NVIDIA, Oracle and others. Its wide support and directive-based model make it a key standard for parallel computing.

OpenCL: stands for Open Compute Language, a C-like language for programming accelerators. It is managed by a cross-vendor group called Khronos which also manages OpenGL, WebGL and other visual computing standards. Knronos members include most of the names you would expect, but not Microsoft. NVIDIA GPUs support OpenCL but other vendor’s GPUs do not support CUDA, so OpenCL has the advantage of breadth, while CUDA has some additional features.

Tesla: NVIDIA’s brand name for high-end GPU cards designed specifically for HPC hybrid programming.

At a high level, the extensions to which Bast refers are in two categories. The first, which Intel expects to be the most common approach, is called the offload model. The idea is that you use a directive to specify that a particular block of code should be ‘offloaded’ to the MIC accelerator. For example, you might have some C code which uses ‘#pragma omp parallel for’: an OpenMP directive that instructs the compiler to generate parallel code for the loop which follows. By adding the further directive ‘#pragma offload target (MIC)’, you specify that the code should be run on MIC where available.

The second model is called virtual shared memory. This tackles the problem of two non-shared memory spaces by allowing virtual addresses: pointers which are valid on both the host and the device, while under the covers the data is copied back and forth.

Using these models, Intel says that all its parallel libraries, including MKL, TBB, OpenMP and Cilk Plus, will run on MIC as well as on multicore CPUs. This will make it relatively easy for developers familiar with Intel’s existing tools to transition to MIC. In addition, Intel is supporting OpenCL on CPUs, embedded HD Graphics GPUs, and in due course on MIC.

Another approach to hybrid computing which involves relatively little pain is to use third-party libraries such as the Rogue Wave IMSL libraries. The IMSL C Numerical Library provides a range of mathematical and statistical functions for C/C++, with a wrapper for Python, and supports NVIDIA CUDA for what the company says is a speed-up of over 100 times against single-threaded code, or up to four times faster than using four CPU threads. Rogue Wave also supports CUDA with IMSL Fortran Numerical Library.

And finally, the Rogue Wave TotalView Debugger, which is aimed at HPC developers, supports memory analysis and debugging across both CPU and GPU for CUDA applications on Linux.

NVIDIA supports CUDA with a range of tools, including the free CUDA toolkit for Windows, Linux and Mac, and Parallel Nsight, a free Visual Studio add-in for building and debugging CUDA, OpenCL and OpenGL applications.