OpenCL on i.MX6
As a GPGPU aficionado, I am pretty excited to be able to run OpenCL on a small, low-power device. While playing around with it I wrote down some notes and performance tips.
I'll begin with mentioning the software versions used, and a workaround that had to be in place to be able to build against OpenCL at all. After that I'll list some performance tips, and the article ends with some limitations encountered along the way.
Hardware: GC2000, revision 0x5108, 4 shader cores.
Software: I wanted to put the CL compiler version here, but alas the driver refused to cough up any specific version information using the OpenCL API. Anyhow, it is included with
- Freescale rootfs "L3.0.35_1.1.0_oneiric.tgz"
- libHAL.so reports Vivante version "4.6.9:1478"
Newer versions of the driver may have solved (some of) the later mentioned limitations.
=== 1 OpenCL platform(s) found: === -- 0 -- PROFILE = EMBEDDED_PROFILE VERSION = OpenCL 1.1 NAME = Vivante OpenCL Platform VENDOR = Vivante Corporation EXTENSIONS = === 1 OpenCL device(s) found on platform: -- 0 -- DEVICE_NAME = Vivante OpenCL Device DEVICE_VENDOR = Vivante Corporation DEVICE_VERSION = OpenCL 1.1 DEVICE_EXTENSIONS = cl_khr_byte_addressable_store DRIVER_VERSION = OpenCL 1.1 DEVICE_MAX_COMPUTE_UNITS = 4 DEVICE_MAX_CLOCK_FREQUENCY = 500 DEVICE_GLOBAL_MEM_SIZE = 100663296
Workaround: link against libGALAt least on my Freescale rootfs, libOpenCL needs functions from libGAL but is not linked against it. This means that
-lGALhas to be added to the linker command line along with
PerformanceFloating point performance can be up to 16 GFLOPS according to shoc OpenCL benchmark maxflops (but beware, the board this was executed on was likely not tuned for full performance). If you want to repeat the benchmark yourself you can use my shoc version adapted to run on this hardware.
Mind your compiler optimizations
A quick test revealed that adding
-cl-mad-enable -cl-unsafe-math-optimizations -cl-fast-relaxed-math to
clBuildProgram halved the number of generated instructions for a simple A*B kernel. The worst-case path was explicitly comparing to NaN and Inf, likely to enforce IEEE 754 edge-case compliance. In some cases this may be required, but keep in mind that full performance of the device can only be reached with those options enabled.
Like on ATI hardware, float2/3/4 operations directly translate to vectorized instructions. Floating point instructions always perform operations on four values, so using smaller vectors under-utilizes the hardware. For integer operations it does not matter as the underlying instructions are scalar.
See for example the Madd test results from shoc:
Add1-SP Size:2097152 GFLOPS 0.891504 Add4-SP Size:2097152 GFLOPS 8.00293 Add8-SP Size:2097152 GFLOPS 1.59215 Add16-SP (result missing) MAdd1-SP Size:2097152 GFLOPS 4.58433 MAdd4-SP Size:2097152 GFLOPS 16.0061 MAdd8-SP Size:2097152 GFLOPS 3.1843 MAdd16-SP *Size:2097152 GFLOPS 14.8157
float4 and float16 are clearly the sweet spots. I am not sure why float8 suffers, it could be due to certain instruction interleaving pattern or a compiler limitation.
libCLC.so shows that the CL compiler is based on clang+LLVM, which is an advanced optimizing compiler combo. This means there is likely some room for further performance tuning.
LimitationsHere I've listed some limitations compared to OpenCL on giant, power-hungry PC graphics cards.
Global work size
The GLOBAL_WORK_SIZE per dimension cannot exceed 1<<16 with the Vivante driver. This limitation is not documented anywhere and I am not sure if this is violation of spec, however the documentation for
The values specified in global_work_size + corresponding values specified in global_work_offset cannot exceed the range given by the sizeof(size_t) for the device on which the kernel execution will be enqueued. The sizeof(size_t) for a device can be determined using CL_DEVICE_ADDRESS_BITS in the table of OpenCL Device Queries for clGetDeviceInfo. If, for example, CL_DEVICE_ADDRESS_BITS = 32, i.e. the device uses a 32-bit address space, size_t is a 32-bit unsigned integer and global_work_size values must be in the range 1 .. 2^32 - 1. Values outside this range return a CL_OUT_OF_RESOURCES error.
CL_DEVICE_ADDRESS_BITS is 32, so if we were to believe the above snippet it should allow for very large global work sizes. Maybe it is different for the embedded profile.
In the shoc maxflops benchmark I worked around this by emulating a larger global work size by using a 2D global work size, ie.
Another possibility proposed by @pixelio is to invoke
clEnqueueNDRangeKernel in a loop with increasing offset. But this turned out to be problematic, as the limit holds for the work offset as well.
Allowed workgroup size very strongly depends on the kernel, and can be anywhere between 16 to 256 (256 is the number of threads per shader core). Do not assume that the reported maximum work group size (1024) can be used but query it for every kernel separately. shoc fails to do this, as well as many OpenCL examples found on the web.
Another potential pitfall: the returned maximum workgroup size is not always a power of two (and there is nothing in the standard that says it has to be).
Keep your kernels small.
As could be expected on an embedded device, it is quite easy to run out of shader memory. For example, the bitcoin miner is too large, as well as a few of the kernels used in shoc. This will result in
CL_OUT_OF_RESOURCES errors during
The "query chip identity" ioctl call on GC2000 reports an instructionCount of 512. Looking at command stream dumps from etna_viv the device appears to have 0x0E000 - 0x0C000 = 8192 bytes of instruction memory, with Vivante's 128 bit instructions this confirms that there is only room for 512 instructions.
Unfortunately, I am not currently aware of a way to query the shader size in native instructions from the OpenCL API, as
CL_PROGRAM_BINARY_SIZES shows the size of the IR.
The OpenCL 1.1 doc lists atomic functions such as
atomic_add, but any attempt to use them (on global and local memory) resulted in "atomic function XXX not supported" compiler error.
I've tried various ways to work around this, such as specifying
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable (shouldn't be needed for 1.1 though), but that's greeted with "warning: pragma OPENCL EXTENSION is not supported, ignoring pragma".