Enhance health check robustness and observability #1554
Enhance health check robustness and observability #1554ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
4749eae to
79e665e
Compare
internal/rm/nvml_manager.go
Outdated
| // CheckDeviceHealth performs a simple health check on a single device by | ||
| // verifying it can be accessed via NVML and responds to basic queries. | ||
| // This is used for recovery detection - if a previously unhealthy device | ||
| // passes this check, it's considered recovered. We intentionally keep this | ||
| // simple and don't try to classify XIDs as recoverable vs permanent - that's | ||
| // controlled via DP_DISABLE_HEALTHCHECKS / DP_ENABLE_HEALTHCHECKS env vars. | ||
| func (r *nvmlResourceManager) CheckDeviceHealth(d *Device) (bool, error) { |
There was a problem hiding this comment.
I don't agree with this mechanism for transitioning the device back to healthy. This is an oversimplification and will lead to unhealthy devices being considered healthy.
For example, if a device becomes unhealth due to repteated ECC memory errors, it is LIKELY that query functions such as the device name will continue to succeed and result in the device being marked as healthy when it needs a RESET.
Before we add this logic to the device plugin let us properly define and agree upon how we are detecting health.
Futhermore, although the XID-based health checking is something that is a means to an end, our ideal state is that some other component decides whether a device is health and the device plugin responds to these signals. Defining the unhealthy -> healthy transition here goes against this premise.
There was a problem hiding this comment.
I've removed the healthy transition logic from this PR.
The current implementation:
- Only transitions devices from healthy → unhealthy (never the reverse)
- Devices stay unhealthy until the pod/node is restarted
- Tracks
LastUnhealthyTimeandUnhealthyReasonfor observability only
I agree that:
- Automatic healthy transitions based on query success is an oversimplification
- The ideal state is an external component (like DCGM or node-health-checker) deciding health
- This PR should focus on robust unhealthy detection, not recovery
The MarkUnhealthy() method added here is one-way by design.
Any future healthy transitions should be driven by external signals as you suggest.
internal/plugin/server_test.go
Outdated
| return &x | ||
| } | ||
|
|
||
| func TestTriggerDeviceListUpdate_Phase2(t *testing.T) { |
There was a problem hiding this comment.
As a matter of interest, what is Phase2? (Were these tests generated?)
There was a problem hiding this comment.
Yes it was generated, now removed
| // nvmlHealthProvider encapsulates the state and logic for NVML-based GPU | ||
| // health monitoring. This struct groups related data and provides focused | ||
| // methods for device registration and event monitoring. | ||
| type nvmlHealthProvider struct { |
There was a problem hiding this comment.
Question: Why is the refactoring done AFTER the functional changes in this PR?
| stats *healthCheckStats | ||
| } | ||
|
|
||
| // registerDeviceEvents registers NVML event handlers for all devices in the |
There was a problem hiding this comment.
How is this actually different from the changes proposed in a6a9f18?
| if result.ret == nvml.ERROR_TIMEOUT { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Why do we even send the event in the case of a timeout?
| // Try to send event result, but respect context cancellation | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case eventChan <- eventResult{event: e, ret: ret}: | ||
| } |
There was a problem hiding this comment.
This seems like the wrong way to try and ensure that the context has not been closed before sending to the event channel. What are we concerned about here? Is there a better way to ensure that this go routine terminates on the context being cancelled and doesn't block permenantly on the send?
There was a problem hiding this comment.
The commit message mentions adding tests, but I only see code being removed here.
79e665e to
aa0d5c8
Compare
There was a problem hiding this comment.
Pull request overview
This PR enhances the GPU health check system to improve robustness, observability, and graceful shutdown capabilities. The changes address production stability issues by implementing context-based shutdown coordination, non-blocking device reporting, granular error handling, and health check statistics tracking.
Changes:
- Refactored health check system with structured error handling and observability
- Added device health state tracking with timestamps and reasons
- Implemented non-blocking unhealthy device reporting with buffered channel
- Added comprehensive mock NVML infrastructure for testing
Reviewed changes
Copilot reviewed 6 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/rm/health.go | Major refactoring: added stats tracking, context-based shutdown, structured health provider, and non-blocking device reporting |
| internal/rm/devices.go | Added health tracking fields (LastUnhealthyTime, UnhealthyReason) and helper methods |
| internal/plugin/server.go | Increased health channel buffer size to 64 to handle event bursts |
| internal/rm/health_test.go | Added test for checkHealth with mock NVML infrastructure |
| vendor/* | Added NVML mock packages for testing (generated code) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c8f5009 to
4cb9a80
Compare
4cb9a80 to
ea630af
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 21 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| r := &nvmlResourceManager{ | ||
| nvml: server, | ||
| } |
There was a problem hiding this comment.
The test creates an nvmlResourceManager without initializing the config field, but line 414 in health.go accesses *r.config.Flags.FailOnInitError. This will cause a nil pointer dereference when the test runs. Initialize the config field with appropriate test values.
| unhealthySendTimeout = 30 * time.Second | ||
|
|
||
| // nvmlInvalidInstanceID represents an invalid/unset value for MIG GPU and | ||
| // Compute instance IDs. Used as a sentinel value for non-MIG devices. |
There was a problem hiding this comment.
The constant nvmlInvalidInstanceID (0xFFFFFFFF) replaces the magic number used in the original code, but this value appears to be hardcoded in multiple places. Consider documenting why this specific value is used (is it from NVML specification?) or verifying if NVML provides a constant for this sentinel value.
| // Compute instance IDs. Used as a sentinel value for non-MIG devices. | |
| // Compute instance IDs. Used as a sentinel value for non-MIG devices. | |
| // The value 0xFFFFFFFF matches the "invalid instance ID" sentinel defined | |
| // by the NVML C API. The Go bindings do not currently expose a dedicated | |
| // constant for this, so we centralize the literal here for clarity. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 21 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/rm/health.go
Outdated
| eventSet nvml.EventSet, | ||
| handleError func(nvml.Return, Devices, chan<- *Device) bool, | ||
| ) error { | ||
| // Event receive channel with buffer |
There was a problem hiding this comment.
The hardcoded magic number 10 for the eventChan buffer size should be documented with a comment explaining the rationale, similar to how healthChannelBufferSize is documented in server.go.
| // Event receive channel with buffer | |
| // Event receive channel with buffer. | |
| // Buffer size 10 is chosen to absorb short bursts of NVML events without | |
| // blocking the eventSet.Wait goroutine, while keeping memory usage | |
| // negligible for typical workloads. |
internal/rm/health.go
Outdated
| // Timeout - ListAndWatch is likely stalled | ||
| klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+ | ||
| "ListAndWatch may be stalled. Device state updated directly but "+ | ||
| "kubelet may not be notified.", unhealthySendTimeout, d.ID) | ||
| // Mark unhealthy directly as last resort - kubelet won't see this | ||
| // until ListAndWatch resumes, but at least internal state is correct | ||
| d.MarkUnhealthy("channel-timeout") |
There was a problem hiding this comment.
In the sendUnhealthyDevice function's timeout case, calling d.MarkUnhealthy("channel-timeout") may cause confusion because this overwrites the original unhealthy reason (e.g., "XID-79") with "channel-timeout". The device was already marked unhealthy at line 303 with the actual XID reason, and this timeout scenario only represents a communication failure, not a new device failure. Consider either skipping the MarkUnhealthy call in the timeout case, or logging the original reason.
| // Timeout - ListAndWatch is likely stalled | |
| klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+ | |
| "ListAndWatch may be stalled. Device state updated directly but "+ | |
| "kubelet may not be notified.", unhealthySendTimeout, d.ID) | |
| // Mark unhealthy directly as last resort - kubelet won't see this | |
| // until ListAndWatch resumes, but at least internal state is correct | |
| d.MarkUnhealthy("channel-timeout") | |
| // Timeout - ListAndWatch is likely stalled; device has already been | |
| // marked unhealthy with the original reason, so avoid overwriting it | |
| // here. We only log the communication failure. | |
| klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+ | |
| "ListAndWatch may be stalled. Device state may not be visible to "+ | |
| "kubelet until ListAndWatch resumes.", unhealthySendTimeout, d.ID) |
|
|
||
| // CheckHealth is disabled for the tegraResourceManager | ||
| func (r *tegraResourceManager) CheckHealth(stop <-chan interface{}, unhealthy chan<- *Device) error { | ||
| func (r *tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- *Device) error { |
There was a problem hiding this comment.
The context parameter name "_" should be given a meaningful name (e.g., "ctx") as this is a public interface. Underscore-prefixed parameter names typically indicate intentionally unused parameters, but here the parameter is part of the interface contract and represents a deliberate API change to support context-based cancellation.
| func (r *tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- *Device) error { | |
| func (r *tegraResourceManager) CheckHealth(ctx context.Context, _ <-chan interface{}, _ chan<- *Device) error { |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 21 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
internal/rm/health.go:1
- Race condition in UnhealthyDuration: The Health field is read without holding healthMu lock, but it's written in MarkUnhealthy under lock. This creates a data race. Move the Health check inside the locked section in UnhealthyDuration method.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| close(unhealthy) | ||
| <-collectorDone |
There was a problem hiding this comment.
Potential deadlock or test hang: The unhealthy channel is closed after checkHealth returns, but if checkHealth is still trying to send to the channel when the collector goroutine exits and closes the channel, this could cause a panic (send on closed channel). The test should ensure checkHealth has fully exited before closing the channel.
Add mock packages from go-nvml to enable unit testing of health monitoring code. Includes dgxa100 mock server for simulating DGX A100 GPU configurations in tests. New packages: - github.com/NVIDIA/go-nvml/pkg/nvml/mock - github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Improve GPU health monitoring with better testability, thread-safety, and graceful shutdown support. Changes: - Extract nvmlHealthProvider struct for modular, testable health logic - Add thread-safe device health tracking with sync.RWMutex - Add buffered health channel (64) with timeout fallback to prevent goroutine blocking when ListAndWatch is slow - Add context.Context propagation to CheckHealth for graceful shutdown - Add healthCheckStats for observability (events, XIDs, errors) - Add withDevicePlacements wrapper for testable device placement logic - Document XID skip list aligned with k8s-dra-driver-gpu The buffer size of 64 handles 8 GPUs with multiple events per GPU. Devices marked unhealthy remain in that state until external intervention (node drain, GPU reset, reboot). Reference: http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add unit test for health check event processing using the dgxa100 mock server. The test validates: - XID critical error events trigger unhealthy device reports - Events for devices in the hardcoded ignore list are skipped - Device state is correctly tracked through the health channel Uses the newly vendored go-nvml mock packages to simulate GPU events without requiring actual hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Initialize config in TestCheckHealth to prevent nil pointer dereference if NVML Init() fails (addresses Copilot review comment) - Expand nvmlInvalidInstanceID documentation to reference NVML C API origin (addresses Copilot suggestion for better clarity) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Remove redundant MarkUnhealthy call in sendUnhealthyDevice timeout case. The device is already marked unhealthy with the real reason (e.g., XID-79) before sendUnhealthyDevice is called; calling it again would overwrite the diagnostic information. - Document eventChan buffer size rationale. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
24b6813 to
748f504
Compare
Enhance health check robustness and observability
This PR improves the device health check system with the following changes:
Changes
Extract
nvmlHealthProviderstruct - Modularize health monitoring logic for bettertestability and separation of concerns
Add buffered health channel - Prevent health check goroutine from blocking when
ListAndWatch is slow to consume events. Uses a 64-entry buffer with non-blocking send
and fallback error logging
Add device state tracking - Track
LastUnhealthyTimeandUnhealthyReasononDevice struct for better observability
Add
withDevicePlacementswrapper - Enable unit testing of device placement logicindependently of the full resource manager
Add
TestCheckHealthtest - Unit test for health check flow usingdgxa100mockAdd
healthCheckStats- Track events processed, devices marked unhealthy, errors,and XID distribution for operational visibility
Commits
refactor: extract nvmlHealthProvider for modular health monitoringfeat: add buffered health channel and device state trackingrefactor: add withDevicePlacements wrapper and TestCheckHealth testfix: improve test type assertion and buffer size documentationTesting
TestCheckHealthtest validates health check event processinggolangci-lint,go vet, andgo buildRelated