EP Performance Results
Here, I will compare the performance of my CUDA implementation to the original OpenMP version. The tests were run in the same computer. Some parameters of these runs can be seen below:
Utilized hardware and software:
- Intel Core 2 Duo E8500 (3,16 GHz, FSB 1333 MHz, 6 MB Cache)
- OCZ Reaper HPC Edition 4GB (2048MB x 2) DDR2 1066 MHz
- XFX GeForce GTX 280 1024MB DDR3 XXX (GX-280N-ZDDU) (670 MHz, 1024MB DDR3@2500MHz)
- Ubuntu Linux 9.04 32 bits
- CUDA Driver 2.3 (190.18 Beta)
- nvcc compiler for the CUDA version, gcc 4.3.1 for the OpenMP version, both with -O2
Figure 1. Time comparison between the CUDA and OpenMP versions of EP benchmark.
We can see the time results of both versions in Figure 1. The presented results have a 95% confidence interval. The horizontal axis represents the different instances (classes) of the EP benchmark. The left vertical axis presents the time in seconds and the right vertical axis, the speedup. As we can see, even with OpenMP using both cores of the CPU, the CUDA version is much faster. The greater the size of the instance, the greater is the speedup. This happens because we have more computation on the GPU but the same amount of memory. With this, the overhead of memory transfer is absorbed by the computation time. This can be seen in Figure 2.Figure 2. Million operations per second comparison between the CUDA and OpenMP versions of EP benchmark.
Figure 2 presents the random numbers' generation throughput of both versions. The left vertical axis presents this in million operations per second. As we can see, the throughput of the OpenMP version is almost constant. This happens because it is already using all possible hardware in the host. The throughput of the CUDA version increases due to the absorption of the memory transfer overheads by the increase in computation time.
The EP benchmark showed us that it is possible to increase the performance of a parallel benchmark with CUDA even when having strong dependencies on double precision floating point instructions. The nature of the benchmark helped this because the computations are mostly independent of each other, the memory constraints could be circumvented by the reutilization of the allocated arrays, and there was enough computation to be done in the device.