FT Performance Results

Utilized hardware and software:

  • Intel Core 2 Duo E8500 (3,16 GHz, FSB 1333 MHz, 6 MB Cache)
  • OCZ Reaper HPC Edition 4GB (2048MB x 2) DDR2 1066 MHz
  • XFX GeForce GTX 280 1024MB DDR3 XXX (GX-280N-ZDDU) (670 MHz, 1024MB DDR3@2500MHz)
  • Ubuntu Linux 9.04 32 bits
  • CUDA Driver 2.3 (190.18 Beta)
  • nvcc 0.2.1221 compiler for the CUDA version, gcc 4.3.1 for the OpenMP version, both with -O3

Tests

Two versions of the benchmark were implemented.
  • Version 1 (V1): without coalesced access to the global memory during the kernels, but with all memory permutations with coalescing;
  • Version 2 (V2): with coalesced access to the global memory during the kernels, but with one memory permutation without coalescing.

The main idea behind these tests is to see which approach obtains a better performance, since it is not obvious. The tests consider 2 instances of the benchmark: FT W and FT A. Their present different sizes of 3 dimensional matrices.

Results

ft_time.png
Figure 1. Time comparison between the two versions with CUDA and the OpenMP version of the FT benchmark.

ft_mops.png
Figure 2. Million operations per second comparison between the two versions with CUDA and the OpenMP version of the FT benchmark.

Figures 1 and 2 show the performance results in seconds and million operation per second for the 2 CUDA versions of the benchmark and the original one, developed with OpenMP. As the results show, the V2 implementation obtained the best performance. Still, while its throughput decreases with the increase in size of the test instances, the opposite happened for V1. To understand this behavior, Table 1 presents the time spent on each one of the 3 main functions of the benchmark for the 2 CUDA versions.

functions.png
Table 1. Times of the 3 main functions of FT benchmark.

As we can see on the table, the main difference in behavior between test instances happens with the function cuda_cffts1() for V2. This is the one function which permutes data without coalescing the memory access.

Conclusion

The FT benchmark obtained speedups when migrated to CUDA. Still, the different memory access patterns of the benchmark must be treated differently, since coalescing and its absence can influence the final performance in many ways.

Last edited Mar 10, 2010 at 11:26 AM by Pilla, version 9

Comments

No comments yet.