Skip to content

Enhance health check robustness and observability #1554

Open
ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/modular-health-check
Open

Enhance health check robustness and observability #1554
ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/modular-health-check

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Dec 4, 2025

Enhance health check robustness and observability

This PR improves the device health check system with the following changes:

Changes

  1. Extract nvmlHealthProvider struct - Modularize health monitoring logic for better
    testability and separation of concerns

  2. Add buffered health channel - Prevent health check goroutine from blocking when
    ListAndWatch is slow to consume events. Uses a 64-entry buffer with non-blocking send
    and fallback error logging

  3. Add device state tracking - Track LastUnhealthyTime and UnhealthyReason on
    Device struct for better observability

  4. Add withDevicePlacements wrapper - Enable unit testing of device placement logic
    independently of the full resource manager

  5. Add TestCheckHealth test - Unit test for health check flow using dgxa100 mock

  6. Add healthCheckStats - Track events processed, devices marked unhealthy, errors,
    and XID distribution for operational visibility

Commits

  • refactor: extract nvmlHealthProvider for modular health monitoring
  • feat: add buffered health channel and device state tracking
  • refactor: add withDevicePlacements wrapper and TestCheckHealth test
  • fix: improve test type assertion and buffer size documentation

Testing

  • All existing tests pass
  • New TestCheckHealth test validates health check event processing
  • Verified with golangci-lint, go vet, and go build

Related

Comment on lines +98 to +104
// CheckDeviceHealth performs a simple health check on a single device by
// verifying it can be accessed via NVML and responds to basic queries.
// This is used for recovery detection - if a previously unhealthy device
// passes this check, it's considered recovered. We intentionally keep this
// simple and don't try to classify XIDs as recoverable vs permanent - that's
// controlled via DP_DISABLE_HEALTHCHECKS / DP_ENABLE_HEALTHCHECKS env vars.
func (r *nvmlResourceManager) CheckDeviceHealth(d *Device) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this mechanism for transitioning the device back to healthy. This is an oversimplification and will lead to unhealthy devices being considered healthy.

For example, if a device becomes unhealth due to repteated ECC memory errors, it is LIKELY that query functions such as the device name will continue to succeed and result in the device being marked as healthy when it needs a RESET.

Before we add this logic to the device plugin let us properly define and agree upon how we are detecting health.

Futhermore, although the XID-based health checking is something that is a means to an end, our ideal state is that some other component decides whether a device is health and the device plugin responds to these signals. Defining the unhealthy -> healthy transition here goes against this premise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the healthy transition logic from this PR.

The current implementation:

  • Only transitions devices from healthy → unhealthy (never the reverse)
  • Devices stay unhealthy until the pod/node is restarted
  • Tracks LastUnhealthyTime and UnhealthyReason for observability only

I agree that:

  1. Automatic healthy transitions based on query success is an oversimplification
  2. The ideal state is an external component (like DCGM or node-health-checker) deciding health
  3. This PR should focus on robust unhealthy detection, not recovery

The MarkUnhealthy() method added here is one-way by design.
Any future healthy transitions should be driven by external signals as you suggest.

return &x
}

func TestTriggerDeviceListUpdate_Phase2(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of interest, what is Phase2? (Were these tests generated?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was generated, now removed

// nvmlHealthProvider encapsulates the state and logic for NVML-based GPU
// health monitoring. This struct groups related data and provides focused
// methods for device registration and event monitoring.
type nvmlHealthProvider struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is the refactoring done AFTER the functional changes in this PR?

stats *healthCheckStats
}

// registerDeviceEvents registers NVML event handlers for all devices in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this actually different from the changes proposed in a6a9f18?

Comment on lines +198 to +200
if result.ret == nvml.ERROR_TIMEOUT {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even send the event in the case of a timeout?

Comment on lines +175 to +180
// Try to send event result, but respect context cancellation
select {
case <-ctx.Done():
return
case eventChan <- eventResult{event: e, ret: ret}:
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong way to try and ensure that the context has not been closed before sending to the event channel. What are we concerned about here? Is there a better way to ensure that this go routine terminates on the context being cancelled and doesn't block permenantly on the send?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message mentions adding tests, but I only see code being removed here.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the GPU health check system to improve robustness, observability, and graceful shutdown capabilities. The changes address production stability issues by implementing context-based shutdown coordination, non-blocking device reporting, granular error handling, and health check statistics tracking.

Changes:

  • Refactored health check system with structured error handling and observability
  • Added device health state tracking with timestamps and reasons
  • Implemented non-blocking unhealthy device reporting with buffered channel
  • Added comprehensive mock NVML infrastructure for testing

Reviewed changes

Copilot reviewed 6 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/rm/health.go Major refactoring: added stats tracking, context-based shutdown, structured health provider, and non-blocking device reporting
internal/rm/devices.go Added health tracking fields (LastUnhealthyTime, UnhealthyReason) and helper methods
internal/plugin/server.go Increased health channel buffer size to 64 to handle event bursts
internal/rm/health_test.go Added test for checkHealth with mock NVML infrastructure
vendor/* Added NVML mock packages for testing (generated code)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ArangoGutierrez ArangoGutierrez force-pushed the feature/modular-health-check branch 2 times, most recently from c8f5009 to 4cb9a80 Compare January 12, 2026 15:16
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review January 12, 2026 15:41
@ArangoGutierrez ArangoGutierrez changed the title [WIP] Enhance health check robustness and observability Enhance health check robustness and observability Jan 12, 2026
@ArangoGutierrez ArangoGutierrez force-pushed the feature/modular-health-check branch from 4cb9a80 to ea630af Compare January 15, 2026 13:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 21 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +273 to +275
r := &nvmlResourceManager{
nvml: server,
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test creates an nvmlResourceManager without initializing the config field, but line 414 in health.go accesses *r.config.Flags.FailOnInitError. This will cause a nil pointer dereference when the test runs. Initialize the config field with appropriate test values.

Copilot uses AI. Check for mistakes.
unhealthySendTimeout = 30 * time.Second

// nvmlInvalidInstanceID represents an invalid/unset value for MIG GPU and
// Compute instance IDs. Used as a sentinel value for non-MIG devices.
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant nvmlInvalidInstanceID (0xFFFFFFFF) replaces the magic number used in the original code, but this value appears to be hardcoded in multiple places. Consider documenting why this specific value is used (is it from NVML specification?) or verifying if NVML provides a constant for this sentinel value.

Suggested change
// Compute instance IDs. Used as a sentinel value for non-MIG devices.
// Compute instance IDs. Used as a sentinel value for non-MIG devices.
// The value 0xFFFFFFFF matches the "invalid instance ID" sentinel defined
// by the NVML C API. The Go bindings do not currently expose a dedicated
// constant for this, so we centralize the literal here for clarity.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 21 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

eventSet nvml.EventSet,
handleError func(nvml.Return, Devices, chan<- *Device) bool,
) error {
// Event receive channel with buffer
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded magic number 10 for the eventChan buffer size should be documented with a comment explaining the rationale, similar to how healthChannelBufferSize is documented in server.go.

Suggested change
// Event receive channel with buffer
// Event receive channel with buffer.
// Buffer size 10 is chosen to absorb short bursts of NVML events without
// blocking the eventSet.Wait goroutine, while keeping memory usage
// negligible for typical workloads.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +95
// Timeout - ListAndWatch is likely stalled
klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+
"ListAndWatch may be stalled. Device state updated directly but "+
"kubelet may not be notified.", unhealthySendTimeout, d.ID)
// Mark unhealthy directly as last resort - kubelet won't see this
// until ListAndWatch resumes, but at least internal state is correct
d.MarkUnhealthy("channel-timeout")
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sendUnhealthyDevice function's timeout case, calling d.MarkUnhealthy("channel-timeout") may cause confusion because this overwrites the original unhealthy reason (e.g., "XID-79") with "channel-timeout". The device was already marked unhealthy at line 303 with the actual XID reason, and this timeout scenario only represents a communication failure, not a new device failure. Consider either skipping the MarkUnhealthy call in the timeout case, or logging the original reason.

Suggested change
// Timeout - ListAndWatch is likely stalled
klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+
"ListAndWatch may be stalled. Device state updated directly but "+
"kubelet may not be notified.", unhealthySendTimeout, d.ID)
// Mark unhealthy directly as last resort - kubelet won't see this
// until ListAndWatch resumes, but at least internal state is correct
d.MarkUnhealthy("channel-timeout")
// Timeout - ListAndWatch is likely stalled; device has already been
// marked unhealthy with the original reason, so avoid overwriting it
// here. We only log the communication failure.
klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+
"ListAndWatch may be stalled. Device state may not be visible to "+
"kubelet until ListAndWatch resumes.", unhealthySendTimeout, d.ID)

Copilot uses AI. Check for mistakes.

// CheckHealth is disabled for the tegraResourceManager
func (r *tegraResourceManager) CheckHealth(stop <-chan interface{}, unhealthy chan<- *Device) error {
func (r *tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- *Device) error {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context parameter name "_" should be given a meaningful name (e.g., "ctx") as this is a public interface. Underscore-prefixed parameter names typically indicate intentionally unused parameters, but here the parameter is part of the interface contract and represents a deliberate API change to support context-based cancellation.

Suggested change
func (r *tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- *Device) error {
func (r *tegraResourceManager) CheckHealth(ctx context.Context, _ <-chan interface{}, _ chan<- *Device) error {

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 21 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/rm/health.go:1

  • Race condition in UnhealthyDuration: The Health field is read without holding healthMu lock, but it's written in MarkUnhealthy under lock. This creates a data race. Move the Health check inside the locked section in UnhealthyDuration method.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +319 to +320
close(unhealthy)
<-collectorDone
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential deadlock or test hang: The unhealthy channel is closed after checkHealth returns, but if checkHealth is still trying to send to the channel when the collector goroutine exits and closes the channel, this could cause a panic (send on closed channel). The test should ensure checkHealth has fully exited before closing the channel.

Copilot uses AI. Check for mistakes.
Add mock packages from go-nvml to enable unit testing of health
monitoring code. Includes dgxa100 mock server for simulating
DGX A100 GPU configurations in tests.

New packages:
- github.com/NVIDIA/go-nvml/pkg/nvml/mock
- github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Improve GPU health monitoring with better testability, thread-safety,
and graceful shutdown support.

Changes:
- Extract nvmlHealthProvider struct for modular, testable health logic
- Add thread-safe device health tracking with sync.RWMutex
- Add buffered health channel (64) with timeout fallback to prevent
  goroutine blocking when ListAndWatch is slow
- Add context.Context propagation to CheckHealth for graceful shutdown
- Add healthCheckStats for observability (events, XIDs, errors)
- Add withDevicePlacements wrapper for testable device placement logic
- Document XID skip list aligned with k8s-dra-driver-gpu

The buffer size of 64 handles 8 GPUs with multiple events per GPU.
Devices marked unhealthy remain in that state until external
intervention (node drain, GPU reset, reboot).

Reference: http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add unit test for health check event processing using the dgxa100
mock server. The test validates:
- XID critical error events trigger unhealthy device reports
- Events for devices in the hardcoded ignore list are skipped
- Device state is correctly tracked through the health channel

Uses the newly vendored go-nvml mock packages to simulate
GPU events without requiring actual hardware.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Initialize config in TestCheckHealth to prevent nil pointer dereference
  if NVML Init() fails (addresses Copilot review comment)
- Expand nvmlInvalidInstanceID documentation to reference NVML C API origin
  (addresses Copilot suggestion for better clarity)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Remove redundant MarkUnhealthy call in sendUnhealthyDevice timeout case.
  The device is already marked unhealthy with the real reason (e.g., XID-79)
  before sendUnhealthyDevice is called; calling it again would overwrite
  the diagnostic information.
- Document eventChan buffer size rationale.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants