CG Performance Results
Utilized hardware and software:
- Intel Core 2 Duo E8500 (3,16 GHz, FSB 1333 MHz, 6 MB Cache)
- OCZ Reaper HPC Edition 4GB (2048MB x 2) DDR2 1066 MHz
- XFX GeForce GTX 280 1024MB DDR3 XXX (GX-280N-ZDDU) (670 MHz, 1024MB DDR3@2500MHz)
- Ubuntu Linux 9.04 32 bits
- CUDA Driver 2.3 (190.18 Beta)
- nvcc 0.2.1221 compiler for the CUDA version, gcc 4.3.1 for the OpenMP version, both with -O3
Figure 1. Time comparison between the CUDA and OpenMP versions of the CG benchmark.Figure 2. Million operations per second comparison between the CUDA and OpenMP versions of the CG benchmark.
Figures 1 and 2 show the performance results in seconds and million operation per second for the CUDA and OpenMP versions of the CG benchmark. For the smaller instance (CG W), the dual core processor obtained a better performance than the CUDA GPU. This is mostly related to the small parallelism of the instance and the sparse memory access. Still, CUDA maintained the same throughput for both instances, while the dual core performance was reduced for the second instance. The increase in parallelism balanced the increase in sparse accesses for the GPU, but not for the CPU version.
The sparse memory accesses and synchronizations were a performance killer for the GPU. Still, this kind of application does not obtain great speedups in other parallel systems.