Skip to content

drm/v3d: Address race-condition between per-fd GPU stats and fd release #6973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mairacanal
Copy link
Contributor

This patch series was initially motivated by a race condition (exposed in PATCH 4/4) where we lacked synchronization for job->file access. This led to use-after-free issues when a file descriptor was closed while a job was still running.

However, beyond fixing this specific race, the series introduces broader improvements to active job management and locking. While PATCH 1/4 and 2/4 are primarily code refactors, PATCH 3/4 introduces changes to the locking scheme. Previously, all queues shared the same spinlock, which caused unnecessary contention during high GPU usage across different queues. PATCH 3/4 allows queues to operate more independently.

[1] Upstream series: https://lore.kernel.org/dri-devel/[email protected]/T/

Closes #6958

Instead of storing a pointer to the DRM file data, store a pointer
directly to the private V3D file struct.

Signed-off-by: Maíra Canal <[email protected]>
Instead of storing the queue's active job in four different variables,
store the active job inside the queue's state. This way, it's possible
to access all active jobs using an index based in `enum v3d_queue`.

Signed-off-by: Maíra Canal <[email protected]>
Each V3D queue works independently and all the dependencies between the
jobs are handled through the DRM scheduler. Therefore, there is no need
to use one single lock for all queues. Using it, creates unnecessary
contention between different queues that can operate independently.

Replace the global spinlock with per-queue locks to improve parallelism
and reduce contention between different V3D queues (BIN, RENDER, TFU,
CSD). This allows independent queues to operate concurrently while
maintaining proper synchronization within each queue.

Signed-off-by: Maíra Canal <[email protected]>
When the file descriptor is closed while a job is still running,
there's a race condition between the job completion callback and the
file descriptor cleanup. This can lead to accessing freed memory when
updating per-fd GPU stats, such as the following example:

[56120.512903] Unable to handle kernel paging request at virtual address 0000330a92b9688a
[56120.520881] Mem abort info:
[56120.523687] ESR = 0x0000000096000005
[56120.527454] EC = 0x25: DABT (current EL), IL = 32 bits
[56120.532785] SET = 0, FnV = 0
[56120.535847] EA = 0, S1PTW = 0
[56120.538995] FSC = 0x05: level 1 translation fault
[56120.543891] Data abort info:
[56120.546778] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[56120.552289] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[56120.557362] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[56120.562690] user pgtable: 16k pages, 47-bit VAs, pgdp=0000000023f54000
[56120.569239] [0000330a92b9688a] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[56120.577975] Internal error: Oops: 0000000096000005 [raspberrypi#1] PREEMPT SMP
 	       CPU: 0 UID: 1000 PID: 1497409 Comm: mpv Not tainted 6.12.37-ncvm5+ raspberrypi#1
 	       Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT)
 	       pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 	       pc : v3d_job_update_stats+0x64/0x168 [v3d]
 	       lr : v3d_job_update_stats+0x40/0x168 [v3d]
 	       sp : ffffc00080003e60
 	       x29: ffffc00080003e60 x28: ffff800002860000 x27: 0000000000000000
 	       x26: 0000000000000000 x25: ffff800002860000 x24: ffff800002630800
 	       x23: ffff800060786000 x22: 0000330a933c31fb x21: 0000000000000001
 	       x20: 0000330a92b96302 x19: ffff800060786b10 x18: 0000000000000000
 	       x17: ffffaf90506a0000 x16: ffffd06fce57c360 x15: 0000000000000000
 	       x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
 	       x11: 0000000000000000 x10: 0000000000000000 x9 : ffffd06f5d0fec40
 	       x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000002978dbd535a
 	       x5 : 00ffffffffffffff x4 : 0000000000000015 x3 : 0000300001fddf88
 	       x2 : 0000000000000020 x1 : 0000000000010001 x0 : 0000330a92b96872
 	       Call trace:
		 v3d_job_update_stats+0x64/0x168 [v3d]
		 v3d_irq+0x118/0x2e0 [v3d]
		 __handle_irq_event_percpu+0x60/0x220

Fix such an issue by protecting all accesses to `job->file_priv` with
the queue's lock. With that, we can clear `job->file_priv` before the
V3D per-fd structure is freed and assure that `job->file_priv` exists
during the per-fd GPU stats updates.

Fixes: e1bc3a1 ("drm/v3d: Avoid NULL pointer dereference in `v3d_job_update_stats()`")
Signed-off-by: Maíra Canal <[email protected]>
@popcornmix
Copy link
Collaborator

Changes look sane, and a test of kodi, neverball and webgl aquarium worked fine.

@mairacanal
Copy link
Contributor Author

Changes look sane, and a test of kodi, neverball and webgl aquarium worked fine.

I also built a kernel out of this branch and submitted it to Mesa CI to run the full CTS testing suite. No regressions were identified.

Thanks for your review!

@pelwell pelwell merged commit 5a72e3a into raspberrypi:rpi-6.12.y Jul 25, 2025
11 of 12 checks passed
popcornmix added a commit to raspberrypi/firmware that referenced this pull request Aug 1, 2025
kernel: drm/v3d: Address race-condition between per-fd GPU stats and fd release
See: raspberrypi/linux#6973

kernel: pwm: rp1: use pwmchip_get_drvdata() instead of container_of()
See: raspberrypi/linux#6972
popcornmix added a commit to raspberrypi/rpi-firmware that referenced this pull request Aug 1, 2025
kernel: drm/v3d: Address race-condition between per-fd GPU stats and fd release
See: raspberrypi/linux#6973

kernel: pwm: rp1: use pwmchip_get_drvdata() instead of container_of()
See: raspberrypi/linux#6972
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pi 5: Crash after heavy GPU load for hours.
3 participants