Benchmarks’ Execution on Windows Server 2008 with CUDA

In this section, I present you the details involved on migrating the selected NAS Benchmarks to Windows Server 2008 using Visual Studio 2008 and CUDA.
For each benchmark, we have to:
  1. Modify the C code to compile on Windows
  2. Modify the code to use CUDA
After each step, we have to ensure that our implementation is working properly! This is easy, since the benchmark comes with self-verifying code.

UNIX code to Interop code

Firstly, you will need the benchmarks. More details can be found here and here. I will use here the EP benchmark as an example.
When compiling EP on Linux, you will get a file named npbparams.h. It contains definitions about the class that you chose to compile (W, S, A, B or C). This is generated at compilation time. You may generate several instances of this file, each one for a size of the problem.

Visual Studio 2008 Project

To create a new project on VS2008, go to File -> New -> Project (or Ctrl+Shift+N). Choose Visual C++-> Win32 -> Win32 Console Application. In the Win32 Application Wizard Window, go Next > and choose the Empty project option.
Now go to Project -> Add Existing Item (or Ctrl+Shift+A). Add the following files to your project (copy those files to the project folder to make things easier):
  • npb-C.h : contains some base definitions, used by all benchmarks;
  • npbparams.h : as said before;
  • ep.c : the benchmark;
  • c_randdp.c : contains the randomization functions;
  • c_timers.c : timing functions;
  • wtime.c(.h), c_timers.c : contains the functions that measure time;
  • c_print_results.c : contains the function that prints the final results.

Changes for Interoperability

wtime.c

Original headers:
#include "wtime.h"
#include <sys/time.h>

Interop. version:
#include "wtime.h"
#ifdef _WIN32
#include <winsock.h>
#else
#include <sys/time.h>
#endif

#ifdef _WIN32
#include <sys/timeb.h>
#include <sys/types.h>
#include <winsock.h>
void gettimeofday(struct timeval* t,void* timezone)
{ struct _timeb timebuffer;
_ftime( &timebuffer );
t->tv_sec=timebuffer.time;
t->tv_usec=1000*timebuffer.millitm;
}
#endif

void wtime(double *t)
{
static int sec = -1;
struct timeval tv;
gettimeofday(&tv, (struct timezone *)0);
if (sec < 0) sec = tv.tv_sec;
*t = (tv.tv_sec - sec) + 1.0e-6*tv.tv_usec;
}

Since sys/time.h is a UNIX library, we have to change this file a little so it can work on Windows. Still, this is the only code change* needed.

Guarantees of performance

To guarantee that we will at least maintain the performance we get on Unix with the benchmark, I ran some tests. I chose 3 classes (sizes of problem) for this: W, A and B. Each instance of the EP Benchmark was executed 50 times on each Operating System (Windows Server 2008 64b and Ubuntu Linux 8.10). The machine was dedicated for this tests (which means that no one else was using the computer).

All instances were compiled with the Intel C Compiler on both OS. Figure 1 presents the results.
  • Computer's Configuration:
    • Intel Core 2 Duo E8500 (3,16 GHz, FSB 1333 MHz, 6 MB Cache)
    • MSI P7N SLI Platinum (LGA 775) Chipset: NVIDIA nForce 750i SLI
    • OCZ Reaper HPC Edition 4GB (2048MB x 2) DDR2 1066 MHz
    • Seagate 250 GB SATA 2(7200 RPM, 16 MB Cache) ST3250410AS
    • XFX GeForce GTX 280 1024MB DDR3 XXX (GX-280N-ZDDU) (670 MHz, 1024MB DDR3@2500MHz)

ep_performance.jpg
Figure 1. Performance of EP Benchmark on Windows Server 2008 64b and Ubuntu Linux 8.10.

As we can see in Figure 1, the benchmark has similar performance on both OS. We have a little more than 1% gain on the different sizes of the problem. With this, we can guarantee that the performance on Windows Server 2008 will be at least equal than on the Linux OS installed on the machine.

C code to CUDA code

Firstly, I changed all the .c files to .cu (CUDA extension). This made clear to the compiler that the files should be compiled by nvcc. Secondly, I created two files, parallel.cu and parallel.h, so that the kernels would be in the same file. Some parameters had to be moved to these files. It is important to emphasize that functions called by the kernels must be written in the same file of the kernel.

The EP benchmark works mainly with double precision floating point. I tried changing it to simple precision, but the results were not precise enough. Because of this, the benchmark can only execute in GPUs with Compute Capability 1.2 or above.

The computation kernel, which was implemented with OpenMP, was moved to a CUDA kernel that executes on the device. Some of the variables had to be changed to arrays, since thousands of threads execute the kernel in parallel. Consequently, we increase the memory needed by the benchmark. The results in the arrays are reduced to smaller arrays or variables with a reduction kernel that I implemented. This reduction could be optimized in the future.

Due to the memory increase, we cannot store the arrays or parts of them in the shared memory. To gain performance, the access pattern to the arrays was changed to coalesce. Due to memory limitations, some instances of the benchmark could not allocate enough memory in the device. To deal with this, we changed the kernel to repeat the computations using the same memory space per thread. This is not a problem, since the original implementation with OpenMP does the same thing. Still, this has a side effect. Since my GPU is not dedicated for computations (my display is connected to it), the driver’s watchdog kills the kernel if it executes for more than 5 seconds. In Linux I can execute for more time, but there is still a watchdog.

With all this and the removal of the dead code, we have an implementation of the EP benchmark that runs with CUDA.
Some experimental results can be seen here.

Last edited Sep 22, 2009 at 3:21 PM by Pilla, version 15

Comments

No comments yet.