Kaveri APUs from AMD are the first APUs with hUMA support. This is a big step for OpenCL development. We can now read and write directly from the GPU to global RAM. Copying huge amount of memory from RAM to GPU memory and back is now needless. I want to give a short overview of the characteristics of OpenCL programming with Kaveri and its performance.
By default your Kernel is compiled with 32 bit address width. You should set the environment variable GPU_FORCE_64BIT_PTR to 1 to access the complete RAM. The GPU device of my Kaveri (A10-7850k) has the following specifications:
Device Name: Spectre (AMD Accelerated Parallel Processing, OpenCL 1.2 AMD-APP (1445.5))
Address Bits: 64
Little Endian: true
Global Memory Size: 512 mb
Base Address Alignment Bits: 2048
Global Memory Cache Size: 16 kb
Local Memory Size: 32 kb
Clock Frequency: 720 MHz
Compute Units: 8
Constant Buffer Size: 64 kb
Max Workgroup Size: 256
Since the mentioned GPU has 512 processing units, we get a wave front size of 64 which is typical for AMD. The global memory size is a bit confusing. It pretends that we can only access 512 MB global memory, which is not true.