Introduction (or “What is this GPU thing?”)
GPUs, or Graphics Processing Units, are dedicated hardware usually utilized for graphics' rendering. They can be found as an integrated chip in a motherboard or an add-in card plugged in an expansion slot on the motherboard. They are build for the execution
of highly parallel structures, like pixels, vertex and fragments. Programmers wrote code for the GPUs through
Considering Flynn's taxonomy, GPUs can be seen as SIMD machines (Single Instruction, Multiple Data), executing the same functions over large amounts of data (like the pixels on an image). These characteristics attracted the public, that began to use these
graphic processors for non-graphic applications. This was named GPGPU - General-Purpose computation on GPUs.
GPGPU applications used to map their data to pixels and fragments, so they could be run on the GPU using the APIs cited before. This was a hard and discouraging, but other ways to enjoy the graphic processor capabilities appeared. We can cite
Close to Metal
as languages or language extensions that make the programmers' lives easier. From now on, we will focus more on NVIDIA's CUDA as the object of our work, seen that it is the most utilized GPGPU programming language currently.
Motivation (or “Why should I use GPUs for HPC?”)
First, let's see some comparisons of GPUs and CPUs performance on GFLOPS (billions of floating point operations per second) and memory bandwidth (on gigabytes per second). All the following images were extracted from
NVIDIA CUDA Programming Guide
Figure 1. GFLOPS for the CPU and GPU._
As we can see in Figure 1, the newest GPUs can achieve almost 1 TFLOPS. This is almost one order of magnitude more than the fastest CPU. This GPU, GTX 280, can be bought for less than US$500.00 (November 2008) (8800 GTX can be found for less than US$100.00!!!).
So we now have a fast co-processor that can be added to any computer with enough power and a PCIe slot.
But how do they do that? Why is the GPU faster than the CPU? If you can remember, we said before that GPUs were built to accelerate the processing of pixels and fragments. These pixels have little or no dependencies among them. The architecture of the GPUs
is massively parallel, were many simple processors execute some instructions in many pixels in parallel. We can compare the usual CPU's and GPU's architectures in Figure 2.
Figure 2. Simplified comparison of the GPU and CPU architectures.
As illustrated in Figure 2, CPUs are designed with most of their transistors devoted to data caching and flow control. In contrast, the GPUs are build with little caches and most transistors dedicated to processing. Rough estimates show 20% of the transistors
of a CPU are dedicated to
computation, compared to 80% of GPU transistors.
You can see below more detailed information about NVIDIA's GPUs and CUDA.
Architecture characteristics (or “How does it work?”)
For illustrative reasons, all examples below are based on data about the GeForce GTX 280.
The GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs
). Each SM consists of 8 Scalar Processors (SPs
) cores. Each SP can execute 4 threads simultaneously (each instruction takes at least 4 cicles,
multiplexing the threads in time). All the SPs in a multiprocessor execute the same instruction simultaneously. For an example, the GTX 280 has 30 SMs. Considering that a SP can execute 3 operations in 1 cicle (using the multiply-add unit to perform a multiplication
and add, and the special functions unit to perform another multiplication, all in single precision), we can have 30 SMs, each with its 8 SPs executing threads. In total, we have (1296MHz * 30SMs * 8SPs * 3 floating-point operations) = 933.12GFLOPS.
Each multiprocessor has on chip memory of four types:
- A set of registers per processor. They are used only by the threads that execute in each scalar processor. They can be accessed with no cost (0 cycles) and happen to be in a limited number.
- A shared memory that is shared among all scalar processors in the same multiprocessor. It can be accessed with no cost. It has about 16 KB.
- A read only constant cache that is shared by all scalar processors and speeds up reads from the constant memory space in the device memory.
- A read only texture cache that is shared by all scalar processors and speeds up reads from the texture memory space (a read only space) in the device memory.
In addition, we have the local
memory spaces in the device, which are read-write and not cached in the multiprocessors. The memory hierarchy is depicted in Figure 3.
Figure 3. Memory hierarchy.
The performance of an application is usually linked with a smart use of the memory hierarchy. The latency can be hidden with the multiplexing of threads executing. Some access can be coalesced. It means that if the various threats in a same half-warp (which
will be explained really soon in the Thread hierarchy section) access an aligned piece of memory, it can be done with only one transaction (and not 16!!!).
Each thread is a part of a block of threads. Each block is mapped to a multiprocessor. In each SM we can have 32 threads executing at the same time. This is named a
. A half warp
is half of a warp (obviously). All threads in a same warp execute the same instruction in different data. If a thread diverges in the execution (because of an
instruction, for an example), it executes the same instructions that the other threads
it does not save its results. Different paths of execution in a same warp are serialized, compromising the performance.
This blocks can be organized in a grid
. The grid just helps organizing the different blocks in one or two dimensions. Each thread block is required to execute independently. It means that there can be no dependencies between blocks in a same grid. They
must have the capacity to be executed in parallel, in any order or in series.
Programming characteristics (or “How do I use it?”)
Programmers write functions to run in the GPU using CUDA. CUDA stands for Compute Unified Device Architecture
. It englobes the programming language extension of C/C++ and the direct mapping of code to NVIDIA's GPUs. At the moment (November 2008),
only NVIDIA has CUDA Compatible
graphics processors (and this will probably remain like that for some time).
As a I just said before, CUDA is an extension of C/C++. To write code in this language, you only need to add some qualifiers in your functions and some calls for memory transfer between host and device, and
! Ok, it may not be that simple to get some beautiful speedup, but you'll probably improve your performance with some more programming.
Here is an example of code. (I know, it is not an intelligent code, but is just an example)
- //Visual studio header
- #include "stdafx.h"
- //Headers for CUDA and I/O
- #include <cuda.h>
- #include <stdio.h>
- // Kernel that executes on the CUDA device
- __global__ void vector_plus_plus(float *vector)
- int idx = **blockIdx.x* * *blockDim.x* + *threadIdx.x *;
- vector[idx] = vector[idx] +1;
- //main routine that executes on the host_
- int main(void)
- float *vector_at_host, *vector_at_device; _// Pointer to host and device arrays_
- const int n = 256; _// Number of elements in the arrays_
- size_t size = n * sizeof(float); __// Size of the array. Float -> single precision floating-point_
- vector_at_host = (float *)malloc(size); _// Allocate array on host_
- **cudaMalloc*((void **) &vector_at_device, size); _// Allocate array on device_
- _// Initialize host array_
- for (float i=0; i<n; i++) vector_at_host[i] = i;
- // Copy the array to the device
- **cudaMemcpy*(vector_at_device, vector_at_host, size, *cudaMemcpyHostToDevice*);
- _// Vector++ on device:_
- int block_size = 32;
- int number_of_blocks = n/block_size;
- vector_plus_plus *<<<* number_of_blocks, block_size *>>>* (vector_at_device);
- __// Retrieve the results from device_
- *cudaMemcpy*(vector_at_host, vector_at_device, size, *cudaMemcpyDeviceToHost*);
- _// Print results_
- for (int i=0; i<n; i++) printf("%d++ == %f\n", i, vector_at_host[i]);
- _// Cleanup_
This little piece of code takes an array of 256 singe precision floating-points, ranging from 0 to 255, and adds one to each component of it. Let's take a look at some details.
- Line 7. The qualifier __global__ in the definition of the function means that it is a kernel. A kernel is a function that executes in the device (the GPU, if someone already forgot that). All threads in the same call (line 27) will execute the exact
same instructions (they cannot diverge without “ifs”) in different data.
- Line 27. This is how we call the function. The <<< >>> are used to give some informations about the size of each block, and the number of blocks. I defined 32 (line 25) as the size of the block, but it could have been a much bigger
value. In this code, we execute 8 blocks of 32 threads each. It is good to use multiples of 32 because this is the size of a warp.
- Line 9. blockDim.x is the size of each block in the x dimension. A block, as well as a grid, can have multiple dimensions. This is used to help mapping problems to the threads.
blockId.x is the identifier of the block among all the block in the x dimension of the grid.
threadIdx.x identifies the thread among the other threads in the same block. Every code written to the device will show some use of these environment variables.
- Line 19. Memory allocations to the device are made with different calls. CUDA functions usually begin with
- Lines 23 and 29. The data will not get in its own in the device memory! You have to copy all data between device and host with some special functions. The direction (device->host,
host->device, device->device_) of the copy is stipulated with the use of constants like
cudaMemcpyDeviceToHost and cudaMemcpyHostToDevice.
More examples can be found