Kernel crash on Ubuntu 24.04 HWE 6.11.0-19 with node-agent 1.27.0

I’m seeing a reproducible kernel crash when node-agent 1.27.0 is running and the host is under high-throughput workload. The crash is a kernel `Oops: general protection fault in perf_callchain_user()` and the call trace shows it’s triggered from a BPF perf_event program calling `bpf_get_stackid()`.

## Environment

* OS: Ubuntu 24.04 LTS + HWE kernel
* Kernel at time of crash: `6.11.0-19-generic (#19~24.04.1-Ubuntu, PREEMPT_DYNAMIC, NOPTI)`
* only eBPF-capable agent running is node-agent
* Workload: several high network throughput Ruby processes (Docker Swarm deployment, Sidekiq workers)
* Host is bare metal, not a VM/VPS

## Impact

* Host "disappears" - journalctl stops abruptly
* Host with default configuration -> seems to hang indefinitely until cold-booted; this happened in production, so we moved quickly to bring it back up
* Host configured to panic on oops -> panics and reboots

## Reproduction notes

* The failure isn't instant; in a few reproductions we've seen the host run the workload and node-agent for 1-3 hours before `Oops`ing (when configured with a crash kernel) or disappearing off the network
* Does not appear to reproduce on 6.14.0-37 - same node, same production workload, longer timelines (several hours)
* Does not appear to reproduce with a heavy synthetic load, similar to the real one (database/Redis chatter, HTTP load, etc) - same node, much longer timelines (days)

## `dmesg` excerpt

Note that `client.rb` is likely the process that was interrupted, not the culprit - I'm pretty sure the Ruby workload does not do anything eBPF-related by itself.

```
Oops: general protection fault, probably for non-canonical address 0x7c31a167d3ba9ce9: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 UID: 450 PID: 224431 Comm: client.rb:283 Kdump: loaded Not tainted 6.11.0-19-generic #19~24.04.1-Ubuntu
Hardware name: [redacted]
RIP: 0010:perf_callchain_user+0x223/0x330
...
Call Trace:
 <IRQ>
  get_perf_callchain+0x15a/0x250
  bpf_get_stackid+0x61/0xc0
  bpf_get_stackid_pe+0xdd/0x110
  bpf_prog_002ea22e11430d4a_do_perf_event+0x600/0x673
  __perf_event_overflow+0x25d/0x340
  perf_swevent_hrtimer+0xd4/0x150
  ...
 </IRQ>
```

---

I reviewed newer releases for promising commits, but I haven't found anything suggesting this might already be addressed. Reproducing this is unfortunately non-trivial/not very safe - I need to divert a portion of production workloads onto a known-failing node :) That being said, I'm open to limited experimentation - let me know what would be helpful. Thank you 🙏 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel crash on Ubuntu 24.04 HWE 6.11.0-19 with node-agent 1.27.0 #275

Environment

Impact

Reproduction notes

`dmesg` excerpt

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel crash on Ubuntu 24.04 HWE 6.11.0-19 with node-agent 1.27.0 #275

Description

Environment

Impact

Reproduction notes

dmesg excerpt

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`dmesg` excerpt