OpenCL on Kaveri

Kaveri APUs from AMD are the first APUs with hUMA support. This is a big step for OpenCL development. We can now read and write directly from the GPU to global RAM. Copying huge amount of memory from RAM to GPU memory and back is now needless. I want to give a short overview of the characteristics of OpenCL programming with Kaveri and its performance.

Overview

By default your Kernel is compiled with 32 bit address width. You should set the environment variable GPU_FORCE_64BIT_PTR to 1 to access the complete RAM. The GPU device of my Kaveri (A10-7850k) has the following specifications:

Device Name: Spectre (AMD Accelerated Parallel Processing, OpenCL 1.2 AMD-APP (1445.5))
Address Bits: 64
Little Endian: true
Global Memory Size: 512 mb
Base Address Alignment Bits: 2048
Global Memory Cache Size: 16 kb
Local Memory Size: 32 kb
Clock Frequency: 720 MHz
Compute Units: 8
Constant Buffer Size: 64 kb
Max Workgroup Size: 256

Since the mentioned GPU has 512 processing units, we get a wave front size of 64 which is typical for AMD. The global memory size is a bit confusing. It pretends that we can only access 512 MB global memory, which is not true.

Performance

Now lets briefly look at the OpenCL performance of Kaveri and compare it to the CPU performance.

Example 1

Our first example is very simple. We have an array of n elements of type int. We want to increment every element.
A C function for this task would look like this:

void increment(int* buf, size_t n)
{
    for (size_t index = 0; index < n; ++index)
	 ++buf[index];
}

Our OpenCL kernel looks like that:

__kernel void increment(__global int* buf)
{
    const size_t index = get_global_id(0);
    ++buf[index];
}

To access the host memory from our kernel we have to call clCreateBuffer with the flag CL_MEM_USE_HOST_PTR. For example

clCreateBuffer(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE, 
               bufferSize, buffer, &error);

We can now omit the mapping of a buffer, because the GPU accesses the host memory directly.
To compare the performance of OpenCL on GPU with the CPU performance, we increment 2^29 Integers, which equals 2GB. The C function is executed on the CPU, the OpenCL kernel on the GPU.

CPU: 1071 ms
OpenCL+GPU: 493 ms

The GPU was slightly more then twice as fast as the CPU, which is not much for OpenCL. But the test function was cleary memory bound, so this was not a good example for showing the advanteges of Kaveri.

Example 2

If we increase the computation complexity the advantage becomes more clear.

void randomcalutation(int* buf, size_t n)
{
    for (size_t index = 0; index < n; ++index)
    {
        const int v = buf[index];
	buf[index] = ((v-1379) / 7 + 13) * 5 / 13 * v * 17 / 23 * (20+v);
    }
}
__kernel void randomcalutation(__global int* buf)
{
    const size_t index = get_global_id(0);
    const int v = buf[index];
    buf[index] = ((v-1379) / 7 + 13) * 5 / 13 * v * 17 / 23 * (20+v);
}

Now the GPU is seven times faster then the CPU.

CPU: 3479 ms
OpenCL+GPU: 501 ms

The difference in execution time between CPU and GPU increases even more with more complex functions. The following diagram is the result of some tests I did by adding more complexity to the test function.
Execution time by function complexity

Additional information

  • For these tests I used a workgroup size of 256.
  • I excluded the compilation time of the kernel from the messured execution time.
  • The test system has 16 GB RAM @2133 MHz

One thought on “OpenCL on Kaveri

  1. Hi Abrok. Do you think it is possible to port SF into OpenCL? A GPU-powered SF would be unbeatable. Maybe memory bottleneck is a problem though.

Comments are closed.