VPU proof of concept Ingenic JZ4770
Lately I've tried to get to the second (AUX) core of the Ingenic JZ4770 in the GCW Zero. This is part of the VPU (Video Processing Unit) and not really documented, so this was the result of quite some trial and error. But after clocking down the AHB1 bus to 166MHz I was suddenly able to reliably run code on the extra core. The interesting thing about the VPU in the JZ4770 is that it simply runs MIPS code like the main core (albeit at half clock rate) and not another "secret" ISA.
The repository contains source for the tests in
src/tests, the code for the AUX core is in
src/firmware. Build instructions are provided in
README.md in the repository.
Here comes an overview of the current test cases.
The first proof of concept for running code on the AUX core. Firmware is loaded an executed that updates values at specified locations in memory. The main core polls for this change and displays the result.
Loading code... Firmware size: 112 Executing code... Result: 87654321 TCSM0 87654321 87654322 87654323 87654324 87654325 87654326 87654327 87654328 87654329 8765432a 8765432b 8765432c 8765432d 8765432e 8765432f 87654330
Shows a moving pattern written by VPU directly to the framebuffer. It appears that the VPU can read and write directly from and to any physical address.
Shows using an interrupt for completion notification. This test does a memory benchmark (details of the memory area can be configured in test4_p1.h).
Allocated physical memory buffer at 0eca0000 Loading code... Firmware size: 124 Executing code... Completion token: deadcafe Elapsed time: 4.00s Total writes: 400.00MB Write rate: 99.92MB/s
The VPU has three DMA engines: GP0 GP1 and GP2. In my experiments it appears that only the third (GP2) can be used to copy to main memory. The DMA engines can copy 2D surfaces which means that the length of a row and the source and destination stride can be supplied. They take a task list, so multiple commands can be queued at the same time.
This test case generates a 64x64 pattern in SRAM, then creates a task list that tiles it to the framebuffer with a single DMA invocation. Output:
Allocated physical memory buffer at 0eca0000 Loading code... Firmware size: 496 Executing code... Bytes 0e70 Task 11 End=0 Bytes 0f88 Task 14 End=0 Bytes 0f60 Task 15 End=0 Bytes 0e98 Task 16 End=0 Bytes 0e20 Task 17 End=0 Bytes 0d30 Task 18 End=0 Bytes 0c40 Task 19 End=0 Bytes 0000 Task 19 End=1 Completion token: b01dface Elapsed time: 0.00s
With these capabilities it looks like the VPU could be used for offloading some tasks in games, even ignoring the video-specific hardware blocks. A developer-friendly framework around it would be useful, but for now I don't have time to do anything beyond these basic proof of concepts. If you'd like to use this or pick up from here and have any questions, let me know.