Issue description

When an already-exiting task oopses (for example, in a file_operations::release handler), make_task_dead() calls do_task_dead() with preemption enabled. That makes it possible to get preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly. That is bad because finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case).
Fixing it

do_exit() calls preempt_disable() a few lines before calling do_task_dead(); I think also doing that in make_task_dead() will fix it.

I have written a proposed patch and will send that in a minute; but since this is scheduler magic, I don't feel very confident in my patch.
Affected versions

I have tested this at kernel commit 8ab992f815d6736b5c7a6f5fd7bfe7bc106bb3dc (current mainline). I have also tested that v7.0.4 (latest stable) is affected as well.

The bug seems to have been introduced in commit 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead").
Impact

This bug can turn a simple NULL dereference into a task stack UAF or double-free, meaning that two tasks can end up running on the same stack, which leads to all kinds of memory corruption. Since I think NULL dereferences are typically not considered to be security bugs, I think this is a security issue.

panic_on_oops mitigates this because it panics before make_task_dead() can be reached.

(I stumbled over this by accident: I unintentionally hit a kernel NULL deref, and was very surprised when the kernel started spitting out errors about stack cookie check failures and such.)

Disclosure deadline

This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2026-08-05.

For more details, see the Project Zero vulnerability disclosure policy: https://projectzero.google/vulnerability-disclosure-policy.html

Reproducer

I have tested this at kernel commit 8ab992f815d6736b5c7a6f5fd7bfe7bc106bb3dc (current mainline). I have also tested that v7.0.4 (latest stable) is affected as well.

Triggering the bug requires that the scheduler is configured to do preemption (for example with CONFIG_PREEMPT=y).

To trigger this bug, two things need to happen, I think:

    a NULL deref or such needs to happen during task exit (for example, in a file_operations::release handler)
    a preemption needs to happen in the middle of do_task_dead()

I have attached a kernel diff that creates an artificial NULL deref on eventfd_release(), adds an artificial delay in do_task_dead(), and adds a bunch of debug logging in the scheduler. With that patch applied, the following trigger should cause some kind of kernel memory corruption:

#include <sys/prctl.h>
#include <sys/eventfd.h>
int main(void) {
  prctl(PR_SET_NAME, "test123");
  eventfd(0, 0);
}

This should first generate a normal NULL deref oops message, followed by debug logs like:

[  138.857730][  T553] note: test123[553] exited with irqs disabled
[  138.862099][  T553] Fixing recursive fault but reboot is needed!
[  138.863685][  T553] context_switch: GOING TO SWITCH FROM DEAD 553 to kworker/0:2/55; sched_mode=1
[  138.865899][  T140] __pick_next_task: PICKED DEAD TASK (point1) (kworker/2:2/140 -> test123/553)
[  138.865911][  T140] context_switch: SWITCHING TO DEAD TASK (kworker/2:2/140 -> test123/553); sched_mode=0
[  138.865925][   T55] finish_task_switch: SWITCHED FROM DEAD 553 to kworker/0:2/55 on cpu 0
[  138.865936][   T55] finish_task_switch: DONE SWITCHING FROM DEAD to kworker/0:2/55 on cpu 0
[  138.868014][  T553] context_switch: GOING TO SWITCH FROM DEAD 553 to ksoftirqd/2/27; sched_mode=1
[  138.877776][  T120] __pick_next_task: PICKED DEAD TASK (point1) (systemd-journal/120 -> test123/553)
[  138.877782][   T27] finish_task_switch: SWITCHED FROM DEAD 553 to ksoftirqd/2/27 on cpu 2
[  138.877785][  T120] context_switch: SWITCHING TO DEAD TASK (systemd-journal/120 -> test123/553); sched_mode=0

You should then see some indications of memory corruption, which often include a refcount underflow in put_task_stack(); for example, on one run, I'm seeing the following:

[  138.877836][    C3] WARNING: kernel stack frame pointer at ffffc90000288fe8 in test123:553 has bad value ffffc9000167fb98
[...]
[  138.883415][   T27] WARNING: kernel/cgroup/cgroup.c:7030 at cgroup_task_dead+0x237/0x290, CPU#2: ksoftirqd/2/27
[...]
[  139.324134][   T27] refcount_t: underflow; use-after-free.
[  139.324138][   T27] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x8f/0xd0, CPU#2: ksoftirqd/2/27
[...]
 139.351068][   T27] Modules linked in:
[  139.351843][   T27] CPU: 2 UID: 0 PID: 27 Comm: ksoftirqd/2 Tainted: G    B D W           7.1.0-rc2-00117-g8ab992f815d6-dirty #7 PREEMPT
[  139.354291][   T27] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN
[...]
[  139.357239][   T27] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[  139.362850][   T27] RIP: 0010:refcount_warn_saturate+0x8f/0xd0
[  139.364037][   T27] Code: 77 79 1e 03 67 48 0f b9 3a eb 4a e8 bb 60 5d ff 48 8d 3d 74 79 1e 03 67 48 0f b9 3a eb 37 e8 a8 60 5d ff 48 8d 3d 71 79 1e 03 <67> 48 0f b9 3a eb 24 e8 95 60 5d ff 48 8d 3d 6e 79 1e 03 67 48 0f
[...]
[  139.369656][   T27] RSP: 0018:ffffc900001f7cb0 EFLAGS: 00010293
[  139.370893][   T27] RAX: ffffffff82074d18 RBX: 0000000000000003 RCX: ffff8881026a1200
[  139.372440][   T27] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff8525c690
[  139.374046][   T27] RBP: ffffc900001f7cc0 R08: ffff88811a4bafbb R09: 1ffff110234975f7
[...]
[  139.380290][   T27] R10: dffffc0000000000 R11: ffffed10234975f8 R12: ffff8881026a19d8
[  139.381878][   T27] R13: ffff88811a4ba400 R14: ffff88811a4bafb8 R15: 0000000000000000
[  139.383426][   T27] FS:  0000000000000000(0000) GS:ffff88827198b000(0000) knlGS:0000000000000000
[  139.385187][   T27] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[...]
[  139.388226][   T27] CR2: 00007f61c26eef90 CR3: 000000010d31a000 CR4: 0000000000350ef0
[  139.389844][   T27] Call Trace:
[  139.393462][   T27]  <TASK>
[  139.394158][   T27]  put_task_stack+0x109/0x110
[  139.395101][   T27]  finish_task_switch+0x552/0x630
[...]
[  139.397917][   T27]  ? __switch_to+0x5a9/0x8d0
[  139.398831][   T27]  __schedule+0xf1e/0x17a0
[  139.399754][   T27]  ? asm_sysvec_call_function+0x1b/0x20
[  139.400856][   T27]  ? handle_softirqs+0x31b/0x350
[  139.401874][   T27]  schedule+0x8d/0x140
[  139.402682][   T27]  smpboot_thread_fn+0x343/0x4e0
[  139.403659][   T27]  ? __pfx_smpboot_thread_fn+0x10/0x10
[  139.404735][   T27]  kthread+0x20c/0x260
[...]
[  139.407351][   T27]  ? __pfx_smpboot_thread_fn+0x10/0x10
[  139.411424][   T27]  ? __pfx_kthread+0x10/0x10
[  139.412342][   T27]  ret_from_fork+0x16d/0x370
[  139.413256][   T27]  ? __pfx_kthread+0x10/0x10
[  139.414208][   T27]  ret_from_fork_asm+0x1a/0x30
[  139.415195][   T27]  </TASK>


Related CVE Number: CVE-2026-46173.

Credit: Jann Horn


