-
Notifications
You must be signed in to change notification settings - Fork 90
Description
I’m seeing a reproducible kernel crash when node-agent 1.27.0 is running and the host is under high-throughput workload. The crash is a kernel Oops: general protection fault in perf_callchain_user() and the call trace shows it’s triggered from a BPF perf_event program calling bpf_get_stackid().
Environment
- OS: Ubuntu 24.04 LTS + HWE kernel
- Kernel at time of crash:
6.11.0-19-generic (#19~24.04.1-Ubuntu, PREEMPT_DYNAMIC, NOPTI) - only eBPF-capable agent running is node-agent
- Workload: several high network throughput Ruby processes (Docker Swarm deployment, Sidekiq workers)
- Host is bare metal, not a VM/VPS
Impact
- Host "disappears" - journalctl stops abruptly
- Host with default configuration -> seems to hang indefinitely until cold-booted; this happened in production, so we moved quickly to bring it back up
- Host configured to panic on oops -> panics and reboots
Reproduction notes
- The failure isn't instant; in a few reproductions we've seen the host run the workload and node-agent for 1-3 hours before
Oopsing (when configured with a crash kernel) or disappearing off the network - Does not appear to reproduce on 6.14.0-37 - same node, same production workload, longer timelines (several hours)
- Does not appear to reproduce with a heavy synthetic load, similar to the real one (database/Redis chatter, HTTP load, etc) - same node, much longer timelines (days)
dmesg excerpt
Note that client.rb is likely the process that was interrupted, not the culprit - I'm pretty sure the Ruby workload does not do anything eBPF-related by itself.
Oops: general protection fault, probably for non-canonical address 0x7c31a167d3ba9ce9: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 UID: 450 PID: 224431 Comm: client.rb:283 Kdump: loaded Not tainted 6.11.0-19-generic #19~24.04.1-Ubuntu
Hardware name: [redacted]
RIP: 0010:perf_callchain_user+0x223/0x330
...
Call Trace:
<IRQ>
get_perf_callchain+0x15a/0x250
bpf_get_stackid+0x61/0xc0
bpf_get_stackid_pe+0xdd/0x110
bpf_prog_002ea22e11430d4a_do_perf_event+0x600/0x673
__perf_event_overflow+0x25d/0x340
perf_swevent_hrtimer+0xd4/0x150
...
</IRQ>
I reviewed newer releases for promising commits, but I haven't found anything suggesting this might already be addressed. Reproducing this is unfortunately non-trivial/not very safe - I need to divert a portion of production workloads onto a known-failing node :) That being said, I'm open to limited experimentation - let me know what would be helpful. Thank you 🙏