Skip to content

Conversation

@ryanbreen
Copy link
Owner

Summary

  • Fixes CI failure on QEMU 8.x (GitHub runners) that passes on QEMU 10.x (local)
  • Root cause: returning to idle_loop using IST stack instead of proper kernel stack
  • IST stacks are small (~4KB) and overflow when timer interrupts fire during idle_loop

Root Cause Analysis

When a userspace process faults and the page fault handler terminates it, the handler
sets up the exception frame to return to idle_loop. Previously, we used:

let current_rsp: u64;
core::arch::asm!("mov {}, rsp", out(reg) current_rsp);
frame.stack_pointer = x86_64::VirtAddr::new(current_rsp + 256);

This captures the CURRENT RSP which is on the IST page fault stack (IST[1]).
When IRET returns to idle_loop with this RSP, idle_loop runs on the small IST stack.
Timer interrupts then push frames onto this stack, eventually causing overflow and
corrupting RSP to values like 0xffffc97ffffffff0.

Fix

Use per_cpu::kernel_stack_top() to get the idle thread's actual kernel stack,
which is large (64KB) and meant for normal execution.

Test plan

  • Local boot-stages test passes (138/138)
  • CI boot-stages test passes

🤖 Generated with Claude Code

When returning to idle_loop from exception handlers (page fault, etc.),
we were using `current_rsp + 256` as the stack pointer. This is wrong
when running on IST stacks (page fault uses IST[1]).

IST stacks are small (~4KB) and meant only for exception handling.
When idle_loop runs on the IST stack and timer interrupts fire,
the interrupt frames and nested calls can overflow the small IST stack,
causing memory corruption and crashes.

This bug manifested as kernel page faults at 0xffffc97ffffffff0 (top of
PML4[402] region) - a corrupted RSP value. It only appeared on QEMU 8.x
(GitHub CI) but not QEMU 10.x (local) due to timing differences.

Fix: Use per_cpu::kernel_stack_top() which returns the idle thread's
actual kernel stack, which is large enough for normal execution.

Changed in two places:
- kernel/src/interrupts.rs: page fault handler recovery path
- kernel/src/interrupts/context_switch.rs: setup_idle_return()

Co-Authored-By: Ryan Breen <ryanbreen@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ryanbreen ryanbreen merged commit 4f8782e into main Jan 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants