AMD OpenCL on Ubuntu 16.04

I was interested in using my brand new “gaming” card for parallel computation. Of course I want to do this by using using free software if possible, not by installing proprietary drivers.

Ubuntu 16.04 has an mesa-opencl-icd package, as well as the libclc-* packages that should be enough to support open source OpenCL on AMD hardware.

However it turned out that my card is not supported yet.

Having gained some experience from etnaviv I wasn’t exactly looking forward to building the Mesa graphics driver and all its dependencies from source. Thankfully Paulo Dias supplies a PPA that has the bleeding edge packages build directly from LLVM and Mesa necessary.

$ sudo add-apt-repository ppa:paulo-miguel-dias/mesa 
$ sudo apt-get update
$ sudo apt-get install libclc-amdgcn mesa-opencl-icd

I experimented in a headless machine so did not have to worry about graphics drivers and X. If you do, be careful as this may (or may not) break the graphical environment. The following section should help in that case.

Removing a PPA

To revert to standard Ubuntu drivers:

$ sudo apt-get install ppa-purge
$ sudo ppa-purge ppa:oibaf/graphics-drivers

Enumerating devices

To see if everything went well, you can run the following to list the OpenCL platforms and devices. Get devices.c and compile it

$ gcc devices.c -o hello -O2 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 
$ ./devices

Example output:

1. Platform
  Profile: FULL_PROFILE
  Version: OpenCL 1.1 MESA 11.3.0-devel (padoka PPA)
  Name: Clover
  Vendor: Mesa
  Extensions: cl_khr_icd
1. Device: AMD TONGA (DRM 3.1.0, LLVM 3.9.0)
 1.1 Hardware version: OpenCL 1.1 MESA 11.3.0-devel (padoka PPA)
 1.2 Software version: 11.3.0-devel (padoka PPA)
 1.3 OpenCL C version: OpenCL C 1.1 
 1.4 Parallel compute units: 28

Running an OpenCL test

A small OpenCL example that squares a bunch of numbers can be found here. Additionally it dumps the binary to disk for disassembly (see below).

$ gcc hello.c -o hello -O2 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
$ ./hello 
success: got back 1 binaries, total size 8450
binary 0: size 8450 dumped to square0.gallium_bin
Computed '1024/1024' correct values!

Disassembly

I’m always interested in how the underlying machine code looks. A good tool for assembling and disassembling binaries for this particular architecture is CLRX.

I needed to do some patching before it would accept the binaries, possibly because Gallium or LLVM switched the format recently. I reported this so possibly by the time you read this it has been fixed and you don’t need the patch. This was fixed.

git clone https://github.com/CLRX/CLRX-mirror.git
cd CLRX-mirror
mkdir build
cd build
cmake ..
make -j4
export CLRXPATH=$PWD/programs

This will create an executable clrxdisasm in programs.

The disassembler doesn’t, for Gallium binaries, currently detect the GPU architecture (I am not even sure whether this is encoded), so this has to be specified explicitly. If not, the instructions will look strange, as there is a significant difference between GCN 1.0, 1.1 and 1.2 ISAs.

Using it with the square0.gallium_bin that was produced by hello.c

${CLRXPATH}/clrxdisasm -g tonga square0.gallium_bin

This should give an output such as

.gallium
.gpu Tonga
.kernel square
.text
square:
        s_load_dword    s3, s[0:1], 0x18
        s_nop           0x0
        s_load_dword    s4, s[0:1], 0x34
        s_waitcnt       lgkmcnt(0)
        s_mul_i32       s3, s3, s2
        v_add_u32       v0, vcc, s3, v0
        v_cmp_gt_u32    vcc, s4, v0
        s_and_saveexec_b64 s[2:3], vcc
        s_xor_b64       s[2:3], exec, s[2:3]
        s_cbranch_execz .L136_0
        s_load_dwordx2  s[4:5], s[0:1], 0x2c
        s_nop           0x0
        s_load_dwordx2  s[0:1], s[0:1], 0x24
        v_ashrrev_i32   v1, 31, v0
        v_lshlrev_b64   v[0:1], 2, v[0:1]
        s_waitcnt       lgkmcnt(0)
        v_add_u32       v2, vcc, s4, v0
        v_mov_b32       v3, s5
        v_addc_u32      v3, vcc, v1, v3, vcc
        v_add_u32       v4, vcc, s0, v0
        v_mov_b32       v0, s1
        v_addc_u32      v5, vcc, v1, v0, vcc
        flat_load_dword v0, v[4:5]
        s_waitcnt       vmcnt(0) & lgkmcnt(0)
        v_mul_f32       v0, v0, v0
        flat_store_dword v[2:3], v0
        s_waitcnt       vmcnt(0) & lgkmcnt(0)
.L136_0:
        s_or_b64        exec, exec, s[2:3]
        s_endpgm

There is also an assembler clrxdisasm that should be able to convert this back to a binary, but I haven’t experimented with that yet.

This reminds me of the old times with decuda and cudasm, except that there is no need for any reverse engineering here. AMD helpfully provides everything necessary to write free software drivers.

References

Written on May 6, 2016
Tags:
Filed under