Not so long ago a number of our customers ran into a peculiar problem - applications which used the GPU on Raspberry Pi devices were not working. This included both 2D and 3D acceleration, so not only was 3D rendering affected, but also video playback and even accelerated window managers as well.
The symptoms presented themselves rather strangely - the application would start and perhaps play a few frames, then freeze and fail to come back to life.
The first task in solving the problem was to reproduce it with a small example program. we chose the Hello Triangle example application provided by the Raspberry Pi Foundation in their Firmware repo which displays a spinning 3D cube with different images on each face.
Straight away we were able to reproduce the issue - when run as an resin.io application the cube would rotate for a few frames and freeze. What was puzzling however was that this application would run perfectly well on a Raspberry Pi when executed stand-alone (as you might expect being an example program.)
It turned out that the key difference was running the application in a container - outside of a container the 3D would work correctly, inside it'd freeze.
One huge clue was the following error message that would show up using the
vcdbg log msg GPU diagnosis tool:
025819.942: *** No KHAN handle found for pid 24
This indicated to us that the interface with the GPU used PID to identify the client process. What made this particularly pertinent to containers is that a core part of container technology is PID namespacing:
A PID namespace 'renumbers' PIDs for all processes that reside in that namespace so, as far as all the processes that live there are aware, their PIDs are assigned starting from 1 again. This is obviously very useful for containers which run processes that need to 'believe' that they are the only processes running on the system.
Regardless that the process is namespaced in this way, the kernel observes the process's 'global' PID, i.e. the PID the process would have been assigned had no renumbering taken place that is unique across the system.
Coming back to our problem - the machinery surrounding communication with the GPU uses PID as a unique identifier - so any mismatch between global PID in kernel-side code and virtual PID on the userland code would explain the error.
My colleague (and resin.io CTO) Petros was able to confirm that PID namespacing alone triggered the issue by creating minimal repro code that simply enabled namespacing and executed Hello Triangle demonstrated the issue once again.
Raspberry Pi devices use the VideoCore processor architecture to implement 3D and accelerated graphics. Part of the mechanism for interfacing with this architecture is a message queue system which allows userspace to communicate with the GPU via a kernel component - VCHIQ - the VideoCore Host Interface Queue (this thread has some more details.)
Userland processes interface with VCHIQ by opening the
/dev/vchiq file and sending ioctls to access functions and send messages to the GPU.
Digging around in the VCHIQ driver code in the Raspberry Pi kernel fork my colleague Andrei found a highly pertinent function - vchiq_open(). This is called when
/dev/vchiq is first opened and initialises a new VCHIQ session associated with the process that opened it.
This function stores the current 'thread group ID' into the VCHIQ instance
pid field (in context):
instance->pid = current->tgid;
The thread group ID is the PID of the process that started the currently running thread, so regardless of which thread interfaced with VCHIQ in a multi-threaded application, the stored PID would be the same.
What's interesting here is that as this retrieves the kernel-observed PID, it is storing the global value not the PID namespaced one.
We were able to observe that messages sent via VCHIQ from userspace programs were embedding their observed PID, so a mismatch between those PIDs and the instance PID was a huge contender for the source of our problems.
It's important to note here that the use of PIDs as a unique identifier is fairly arbitrary - all the VCHIQ machinery needs to do is to uniquely identify each client so messages get to the right place, it's just unfortunate that PIDs were used for this.
Fix Attempt 1: The Simple Solution
Andrei experimented with simply changing this line to the below which sets the
pid field to the namespaced rather than global PID:
instance->pid = task_tgid_vnr(current);
This completely fixed the issue, proving our theory correct. Andrei then submitted a PR with this change with a request for comment.
It was clear that this could not be a permanent fix - it was possible for more than one process to share the same ID which would be disastrous - messages intended for one VCHIQ instance would end up with another and vice-versa.
It was however a useful temporary solution - in the usual resin.io use case it would be unlikely there'd be multiple clients running in different namespaces.
Regardless, we were committed to finding a better solution and one that could be contributed back to the upstream project, so pressed forward and researched the problem further.
Fix Attempt 2: Kernel Spelunking
At this point I got involved in the project and started by exploring the kernel to see whether it might be able to perform some kind of translation between the PID provided in the userspace messages and process global PID. This way, the kernel could be altered to silently 'fix up' the issue and neither the userland process nor the GPU using the PIDs as unique identifiers would experience any change in behaviour.
Unfortunately it turned out to not be possible to do this in any nice way - the messages are in large part transmitted as raw bytes, within which at least some contained header values including process PID.
I was (with help from Petros) able however to get some spelunky code together which manually changed the header values and performed a sort of 'man in the middle' fix up even in these raw bytes.
If I was able to get a clear list of possible message formats and positively identify which messages contained a PID in their header then fix them up this solution could work. However it was messy, potentially slow, fragile to any API change and generally something of a hack (albeit a clever one.)
In order to get some insight from the Raspberry Pi team I created an issue discussing the approach and showing some early code.
Through this it became obvious this approach was sadly not going to be workable, though perhaps we could create our own hacky fix if we needed to. Of course we weren't going to be satisfied with that :)
Fix Attempt 3: Userland
I was inspired by a comment from Phil Elwell of the Raspberry Pi team suggesting I create a new ioctl to obtain global PID and update the userland code to use this, rather than attempt to modify things on the kernel side.
With this in mind I dived back into experimenting with the code, and used my explorations kernel-side to have a look around the userland tools. As I explored I realised there was an existing ioctl -
GET_CLIENT_ID - that already did what we needed, and there was even a (then unused) helper function for using it -
I carefully checked for functions which retrieved PID for use in VCHIQ and found that the problematic area was centred around the khronos interface and in particular the
Part of the task was to change as little code as possible and avoid risking changing generic PID helper functions which might break other userland applications. It seemed that changing code only in the khronos interface was the way to go in this regard.
I first ensured that callers would have sufficient state to be able to interface with VCHIQ -
khronos_platform_get_process_id() is called without parameters, so if there was no means to identify how to send an ioctl in the controlling process I'd have to create an explicit new ioctl for translation between virtual and global PIDs which would not only be inefficient, it'd be something of an icky potential new information leak security-wise.
Thankfully I found that there was thread state sufficient to be able to interface with any ioctl I chose, and so I was able to adjust the
khronos_platform_get_process_id() to use my ioctl wrapper,
Since all of the callers of this function had already retrieved at least the local thread state, it turned out it was unnecessary to use the khronos function at all directly, so I changed all of the callers to use
rpc_client_id() - I wanted to avoid harming performance as much as possible (the khronos function is exported externally so it was still worth keeping it in place.)
We Love Open Source
After a lot of testing which confirmed the approach worked, I submitted a PR containing the change and as much description of the problem as I could provide.
I had a number of concerns about the patch ranging from performance to completeness (was only updating the khronos portion of the userland code sufficient?), and raised each of these. I really think it's important to be as clear as possible in open source discussions to try to get the best input from the most knowledgeable people.
While waiting for feedback, the resin.io team wanted to get the fix out to our users as quickly as possible so we created a patched userland package for raspbian and set up our Docker base images to put this package at a higher priority than the official ones meaning anybody using the userland libraries would get the fixed versions.
We were able to directly assist a number of users to get their applications working and looked forward to contributing the fix upstream also so nobody else had to encounter it.
Thankfully, the Raspberry Pi team came back positively and merged the PR, indicating that the patch had no impact on performance (in fact it seemed oddly to have slightly enhanced it) and that the implementation was the correct one!
At resin.io we really love open source so were very happy to take part in the kind of collaboration open source allows between two disparate teams who are trying to make the best software they can and solve problems for users.
As Linus's law states - given enough eyeballs, all bugs are shallow.
Do you want to contribute to a Linux distribution optimised for containers on embedded devices? We're hiring! Drop us a line at email@example.com!