Skip to content

Kernel crash on Ubuntu 24.04 HWE 6.11.0-19 with node-agent 1.27.0 #275

@paweljw

Description

@paweljw

I’m seeing a reproducible kernel crash when node-agent 1.27.0 is running and the host is under high-throughput workload. The crash is a kernel Oops: general protection fault in perf_callchain_user() and the call trace shows it’s triggered from a BPF perf_event program calling bpf_get_stackid().

Environment

  • OS: Ubuntu 24.04 LTS + HWE kernel
  • Kernel at time of crash: 6.11.0-19-generic (#19~24.04.1-Ubuntu, PREEMPT_DYNAMIC, NOPTI)
  • only eBPF-capable agent running is node-agent
  • Workload: several high network throughput Ruby processes (Docker Swarm deployment, Sidekiq workers)
  • Host is bare metal, not a VM/VPS

Impact

  • Host "disappears" - journalctl stops abruptly
  • Host with default configuration -> seems to hang indefinitely until cold-booted; this happened in production, so we moved quickly to bring it back up
  • Host configured to panic on oops -> panics and reboots

Reproduction notes

  • The failure isn't instant; in a few reproductions we've seen the host run the workload and node-agent for 1-3 hours before Oopsing (when configured with a crash kernel) or disappearing off the network
  • Does not appear to reproduce on 6.14.0-37 - same node, same production workload, longer timelines (several hours)
  • Does not appear to reproduce with a heavy synthetic load, similar to the real one (database/Redis chatter, HTTP load, etc) - same node, much longer timelines (days)

dmesg excerpt

Note that client.rb is likely the process that was interrupted, not the culprit - I'm pretty sure the Ruby workload does not do anything eBPF-related by itself.

Oops: general protection fault, probably for non-canonical address 0x7c31a167d3ba9ce9: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 UID: 450 PID: 224431 Comm: client.rb:283 Kdump: loaded Not tainted 6.11.0-19-generic #19~24.04.1-Ubuntu
Hardware name: [redacted]
RIP: 0010:perf_callchain_user+0x223/0x330
...
Call Trace:
 <IRQ>
  get_perf_callchain+0x15a/0x250
  bpf_get_stackid+0x61/0xc0
  bpf_get_stackid_pe+0xdd/0x110
  bpf_prog_002ea22e11430d4a_do_perf_event+0x600/0x673
  __perf_event_overflow+0x25d/0x340
  perf_swevent_hrtimer+0xd4/0x150
  ...
 </IRQ>

I reviewed newer releases for promising commits, but I haven't found anything suggesting this might already be addressed. Reproducing this is unfortunately non-trivial/not very safe - I need to divert a portion of production workloads onto a known-failing node :) That being said, I'm open to limited experimentation - let me know what would be helpful. Thank you 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions