AMD OpenCL on Ubuntu 16.04
I was interested in using my brand new “gaming” card for parallel computation. Of course I want to do this by using using free software if possible, not by installing proprietary drivers.
Ubuntu 16.04 has an mesa-opencl-icd
package, as well as the libclc-* packages that should be enough to support
open source OpenCL on AMD hardware.
However it turned out that my card is not supported yet.
Having gained some experience from etnaviv I wasn’t exactly looking forward to building the Mesa graphics driver and all its dependencies from source. Thankfully Paulo Dias supplies a PPA that has the bleeding edge packages build directly from LLVM and Mesa necessary.
$ sudo add-apt-repository ppa:paulo-miguel-dias/mesa
$ sudo apt-get update
$ sudo apt-get install libclc-amdgcn mesa-opencl-icd
I experimented in a headless machine so did not have to worry about graphics drivers and X. If you do, be careful as this may (or may not) break the graphical environment. The following section should help in that case.
Removing a PPA
To revert to standard Ubuntu drivers:
$ sudo apt-get install ppa-purge
$ sudo ppa-purge ppa:oibaf/graphics-drivers
Enumerating devices
To see if everything went well, you can run the following to list the OpenCL platforms and devices. Get devices.c and compile it
$ gcc devices.c -o hello -O2 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
$ ./devices
Example output:
1. Platform
Profile: FULL_PROFILE
Version: OpenCL 1.1 MESA 11.3.0-devel (padoka PPA)
Name: Clover
Vendor: Mesa
Extensions: cl_khr_icd
1. Device: AMD TONGA (DRM 3.1.0, LLVM 3.9.0)
1.1 Hardware version: OpenCL 1.1 MESA 11.3.0-devel (padoka PPA)
1.2 Software version: 11.3.0-devel (padoka PPA)
1.3 OpenCL C version: OpenCL C 1.1
1.4 Parallel compute units: 28
Running an OpenCL test
A small OpenCL example that squares a bunch of numbers can be found here. Additionally it dumps the binary to disk for disassembly (see below).
$ gcc hello.c -o hello -O2 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
$ ./hello
success: got back 1 binaries, total size 8450
binary 0: size 8450 dumped to square0.gallium_bin
Computed '1024/1024' correct values!
Disassembly
I’m always interested in how the underlying machine code looks. A good tool for assembling and disassembling binaries for this particular architecture is CLRX.
I needed to do some patching
before it would accept the binaries, possibly because Gallium or LLVM switched
the format recently. I reported this so possibly by the time you read
this it has been fixed and you don’t need the patch. This was fixed.
git clone https://github.com/CLRX/CLRX-mirror.git
cd CLRX-mirror
mkdir build
cd build
cmake ..
make -j4
export CLRXPATH=$PWD/programs
This will create an executable clrxdisasm in programs.
The disassembler doesn’t, for Gallium binaries, currently detect the GPU architecture (I am not even sure whether this is encoded), so this has to be specified explicitly. If not, the instructions will look strange, as there is a significant difference between GCN 1.0, 1.1 and 1.2 ISAs.
Using it with the square0.gallium_bin that
was produced by hello.c
${CLRXPATH}/clrxdisasm -g tonga square0.gallium_bin
This should give an output such as
.gallium
.gpu Tonga
.kernel square
.text
square:
s_load_dword s3, s[0:1], 0x18
s_nop 0x0
s_load_dword s4, s[0:1], 0x34
s_waitcnt lgkmcnt(0)
s_mul_i32 s3, s3, s2
v_add_u32 v0, vcc, s3, v0
v_cmp_gt_u32 vcc, s4, v0
s_and_saveexec_b64 s[2:3], vcc
s_xor_b64 s[2:3], exec, s[2:3]
s_cbranch_execz .L136_0
s_load_dwordx2 s[4:5], s[0:1], 0x2c
s_nop 0x0
s_load_dwordx2 s[0:1], s[0:1], 0x24
v_ashrrev_i32 v1, 31, v0
v_lshlrev_b64 v[0:1], 2, v[0:1]
s_waitcnt lgkmcnt(0)
v_add_u32 v2, vcc, s4, v0
v_mov_b32 v3, s5
v_addc_u32 v3, vcc, v1, v3, vcc
v_add_u32 v4, vcc, s0, v0
v_mov_b32 v0, s1
v_addc_u32 v5, vcc, v1, v0, vcc
flat_load_dword v0, v[4:5]
s_waitcnt vmcnt(0) & lgkmcnt(0)
v_mul_f32 v0, v0, v0
flat_store_dword v[2:3], v0
s_waitcnt vmcnt(0) & lgkmcnt(0)
.L136_0:
s_or_b64 exec, exec, s[2:3]
s_endpgm
There is also an assembler clrxdisasm that should be able to convert this
back to a binary, but I haven’t experimented with that yet.
This reminds me of the old times with decuda and cudasm, except that there is no need for any reverse engineering here. AMD helpfully provides everything necessary to write free software drivers.
References
- AMD GCN3 Instruction Set Architecture - document describing the GCN1.2 (“generation 3”) ISA. AMD is super helpful in providing this instruction documentation.