Enhance health check robustness and observability by ArangoGutierrez · Pull Request #1554 · NVIDIA/k8s-device-plugin

ArangoGutierrez · 2025-12-04T11:38:48Z

Enhance health check robustness and observability

This PR improves the device health check system with the following changes:

Changes

Extract nvmlHealthProvider struct - Modularize health monitoring logic for better
testability and separation of concerns
Add buffered health channel - Prevent health check goroutine from blocking when
ListAndWatch is slow to consume events. Uses a 64-entry buffer with non-blocking send
and fallback error logging
Add device state tracking - Track LastUnhealthyTime and UnhealthyReason on
Device struct for better observability
Add withDevicePlacements wrapper - Enable unit testing of device placement logic
independently of the full resource manager
Add TestCheckHealth test - Unit test for health check flow using dgxa100 mock
Add healthCheckStats - Track events processed, devices marked unhealthy, errors,
and XID distribution for operational visibility

Commits

refactor: extract nvmlHealthProvider for modular health monitoring
feat: add buffered health channel and device state tracking
refactor: add withDevicePlacements wrapper and TestCheckHealth test
fix: improve test type assertion and buffer size documentation

Testing

All existing tests pass
New TestCheckHealth test validates health check event processing
Verified with golangci-lint, go vet, and go build

+// CheckDeviceHealth performs a simple health check on a single device by
+// verifying it can be accessed via NVML and responds to basic queries.
+// This is used for recovery detection - if a previously unhealthy device
+// passes this check, it's considered recovered. We intentionally keep this
+// simple and don't try to classify XIDs as recoverable vs permanent - that's
+// controlled via DP_DISABLE_HEALTHCHECKS / DP_ENABLE_HEALTHCHECKS env vars.
+func (r *nvmlResourceManager) CheckDeviceHealth(d *Device) (bool, error) {


I don't agree with this mechanism for transitioning the device back to healthy. This is an oversimplification and will lead to unhealthy devices being considered healthy.

For example, if a device becomes unhealth due to repteated ECC memory errors, it is LIKELY that query functions such as the device name will continue to succeed and result in the device being marked as healthy when it needs a RESET.

Before we add this logic to the device plugin let us properly define and agree upon how we are detecting health.

Futhermore, although the XID-based health checking is something that is a means to an end, our ideal state is that some other component decides whether a device is health and the device plugin responds to these signals. Defining the unhealthy -> healthy transition here goes against this premise.

I've removed the healthy transition logic from this PR.

The current implementation:

Only transitions devices from healthy → unhealthy (never the reverse)

Devices stay unhealthy until the pod/node is restarted

Tracks LastUnhealthyTime and UnhealthyReason for observability only

I agree that:

Automatic healthy transitions based on query success is an oversimplification

The ideal state is an external component (like DCGM or node-health-checker) deciding health

This PR should focus on robust unhealthy detection, not recovery

The MarkUnhealthy() method added here is one-way by design.
Any future healthy transitions should be driven by external signals as you suggest.

elezar · 2025-12-10T12:55:13Z

internal/plugin/server_test.go

 	return &x
 }
+
+func TestTriggerDeviceListUpdate_Phase2(t *testing.T) {


As a matter of interest, what is Phase2? (Were these tests generated?)

Yes it was generated, now removed

elezar · 2025-12-10T12:56:00Z

internal/rm/health.go

+// nvmlHealthProvider encapsulates the state and logic for NVML-based GPU
+// health monitoring. This struct groups related data and provides focused
+// methods for device registration and event monitoring.
+type nvmlHealthProvider struct {


Question: Why is the refactoring done AFTER the functional changes in this PR?

elezar · 2025-12-10T12:58:17Z

internal/rm/health.go

 	stats *healthCheckStats
 }

+// registerDeviceEvents registers NVML event handlers for all devices in the


How is this actually different from the changes proposed in a6a9f18?

elezar · 2025-12-10T13:02:43Z

internal/rm/health.go

+			if result.ret == nvml.ERROR_TIMEOUT {
+				continue
+			}


Why do we even send the event in the case of a timeout?

elezar · 2025-12-10T13:12:09Z

internal/rm/health.go

+			// Try to send event result, but respect context cancellation
+			select {
+			case <-ctx.Done():
+				return
+			case eventChan <- eventResult{event: e, ret: ret}:
+			}


This seems like the wrong way to try and ensure that the context has not been closed before sending to the event channel. What are we concerned about here? Is there a better way to ensure that this go routine terminates on the context being cancelled and doesn't block permenantly on the send?

elezar · 2025-12-10T13:16:26Z

internal/rm/health_test.go

The commit message mentions adding tests, but I only see code being removed here.

Copilot

Pull request overview

This PR enhances the GPU health check system to improve robustness, observability, and graceful shutdown capabilities. The changes address production stability issues by implementing context-based shutdown coordination, non-blocking device reporting, granular error handling, and health check statistics tracking.

Changes:

Refactored health check system with structured error handling and observability
Added device health state tracking with timestamps and reasons
Implemented non-blocking unhealthy device reporting with buffered channel
Added comprehensive mock NVML infrastructure for testing

Reviewed changes

Copilot reviewed 6 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
internal/rm/health.go	Major refactoring: added stats tracking, context-based shutdown, structured health provider, and non-blocking device reporting
internal/rm/devices.go	Added health tracking fields (LastUnhealthyTime, UnhealthyReason) and helper methods
internal/plugin/server.go	Increased health channel buffer size to 64 to handle event bursts
internal/rm/health_test.go	Added test for checkHealth with mock NVML infrastructure
vendor/*	Added NVML mock packages for testing (generated code)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/rm/health.go

internal/plugin/server.go

internal/rm/health.go

internal/rm/devices.go

internal/rm/health_test.go

internal/rm/health.go

Copilot

Pull request overview

Copilot reviewed 8 out of 21 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T13:28:33Z

internal/rm/health_test.go

+	r := &nvmlResourceManager{
+		nvml: server,
+	}


The test creates an nvmlResourceManager without initializing the config field, but line 414 in health.go accesses *r.config.Flags.FailOnInitError. This will cause a nil pointer dereference when the test runs. Initialize the config field with appropriate test values.

Copilot · 2026-01-15T13:28:33Z

internal/rm/health.go

+	unhealthySendTimeout = 30 * time.Second
+
+	// nvmlInvalidInstanceID represents an invalid/unset value for MIG GPU and
+	// Compute instance IDs. Used as a sentinel value for non-MIG devices.


The constant nvmlInvalidInstanceID (0xFFFFFFFF) replaces the magic number used in the original code, but this value appears to be hardcoded in multiple places. Consider documenting why this specific value is used (is it from NVML specification?) or verifying if NVML provides a constant for this sentinel value.

Suggested change

// Compute instance IDs. Used as a sentinel value for non-MIG devices.

// Compute instance IDs. Used as a sentinel value for non-MIG devices.

// The value 0xFFFFFFFF matches the "invalid instance ID" sentinel defined

// by the NVML C API. The Go bindings do not currently expose a dedicated

// constant for this, so we centralize the literal here for clarity.

Copilot

Pull request overview

Copilot reviewed 8 out of 21 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T14:07:20Z

internal/rm/health.go

+	eventSet nvml.EventSet,
+	handleError func(nvml.Return, Devices, chan<- *Device) bool,
+) error {
+	// Event receive channel with buffer


The hardcoded magic number 10 for the eventChan buffer size should be documented with a comment explaining the rationale, similar to how healthChannelBufferSize is documented in server.go.

Suggested change

// Event receive channel with buffer

// Event receive channel with buffer.

// Buffer size 10 is chosen to absorb short bursts of NVML events without

// blocking the eventSet.Wait goroutine, while keeping memory usage

// negligible for typical workloads.

Copilot · 2026-01-15T14:07:20Z

internal/rm/health.go

+		// Timeout - ListAndWatch is likely stalled
+		klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+
+			"ListAndWatch may be stalled. Device state updated directly but "+
+			"kubelet may not be notified.", unhealthySendTimeout, d.ID)
+		// Mark unhealthy directly as last resort - kubelet won't see this
+		// until ListAndWatch resumes, but at least internal state is correct
+		d.MarkUnhealthy("channel-timeout")


In the sendUnhealthyDevice function's timeout case, calling d.MarkUnhealthy("channel-timeout") may cause confusion because this overwrites the original unhealthy reason (e.g., "XID-79") with "channel-timeout". The device was already marked unhealthy at line 303 with the actual XID reason, and this timeout scenario only represents a communication failure, not a new device failure. Consider either skipping the MarkUnhealthy call in the timeout case, or logging the original reason.

Suggested change

// Timeout - ListAndWatch is likely stalled

klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+

"ListAndWatch may be stalled. Device state updated directly but "+

"kubelet may not be notified.", unhealthySendTimeout, d.ID)

// Mark unhealthy directly as last resort - kubelet won't see this

// until ListAndWatch resumes, but at least internal state is correct

d.MarkUnhealthy("channel-timeout")

// Timeout - ListAndWatch is likely stalled; device has already been

// marked unhealthy with the original reason, so avoid overwriting it

// here. We only log the communication failure.

klog.Errorf("Timeout after %v sending device %s to unhealthy channel. "+

"ListAndWatch may be stalled. Device state may not be visible to "+

"kubelet until ListAndWatch resumes.", unhealthySendTimeout, d.ID)

Copilot · 2026-01-15T14:07:21Z

internal/rm/tegra_manager.go


 // CheckHealth is disabled for the tegraResourceManager
-func (r *tegraResourceManager) CheckHealth(stop <-chan interface{}, unhealthy chan<- *Device) error {
+func (r *tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- *Device) error {


The context parameter name "_" should be given a meaningful name (e.g., "ctx") as this is a public interface. Underscore-prefixed parameter names typically indicate intentionally unused parameters, but here the parameter is part of the interface contract and represents a deliberate API change to support context-based cancellation.

Suggested change

func (r *tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- *Device) error {

func (r *tegraResourceManager) CheckHealth(ctx context.Context, _ <-chan interface{}, _ chan<- *Device) error {

Copilot

Pull request overview

Copilot reviewed 8 out of 21 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/rm/health.go:1

Race condition in UnhealthyDuration: The Health field is read without holding healthMu lock, but it's written in MarkUnhealthy under lock. This creates a data race. Move the Health check inside the locked section in UnhealthyDuration method.

/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T15:49:12Z

internal/rm/health_test.go

+	close(unhealthy)
+	<-collectorDone


Potential deadlock or test hang: The unhealthy channel is closed after checkHealth returns, but if checkHealth is still trying to send to the channel when the collector goroutine exits and closes the channel, this could cause a panic (send on closed channel). The test should ensure checkHealth has fully exited before closing the channel.

Add mock packages from go-nvml to enable unit testing of health monitoring code. Includes dgxa100 mock server for simulating DGX A100 GPU configurations in tests. New packages: - github.com/NVIDIA/go-nvml/pkg/nvml/mock - github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Improve GPU health monitoring with better testability, thread-safety, and graceful shutdown support. Changes: - Extract nvmlHealthProvider struct for modular, testable health logic - Add thread-safe device health tracking with sync.RWMutex - Add buffered health channel (64) with timeout fallback to prevent goroutine blocking when ListAndWatch is slow - Add context.Context propagation to CheckHealth for graceful shutdown - Add healthCheckStats for observability (events, XIDs, errors) - Add withDevicePlacements wrapper for testable device placement logic - Document XID skip list aligned with k8s-dra-driver-gpu The buffer size of 64 handles 8 GPUs with multiple events per GPU. Devices marked unhealthy remain in that state until external intervention (node drain, GPU reset, reboot). Reference: http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add unit test for health check event processing using the dgxa100 mock server. The test validates: - XID critical error events trigger unhealthy device reports - Events for devices in the hardcoded ignore list are skipped - Device state is correctly tracked through the health channel Uses the newly vendored go-nvml mock packages to simulate GPU events without requiring actual hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

- Initialize config in TestCheckHealth to prevent nil pointer dereference if NVML Init() fails (addresses Copilot review comment) - Expand nvmlInvalidInstanceID documentation to reference NVML C API origin (addresses Copilot suggestion for better clarity) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

- Remove redundant MarkUnhealthy call in sendUnhealthyDevice timeout case. The device is already marked unhealthy with the real reason (e.g., XID-79) before sendUnhealthyDevice is called; calling it again would overwrite the diagnostic information. - Document eventChan buffer size rationale. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez self-assigned this Dec 4, 2025

ArangoGutierrez force-pushed the feature/modular-health-check branch from 4749eae to 79e665e Compare December 4, 2025 11:40

ArangoGutierrez mentioned this pull request Dec 4, 2025

WIP: Refactor health checks #1538

Draft

elezar reviewed Dec 10, 2025

View reviewed changes

internal/rm/health_test.go

Copy link

Member

elezar Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message mentions adding tests, but I only see code being removed here.

ArangoGutierrez force-pushed the feature/modular-health-check branch from 79e665e to aa0d5c8 Compare January 12, 2026 13:25

ArangoGutierrez requested a review from Copilot January 12, 2026 15:00

Copilot started reviewing on behalf of ArangoGutierrez January 12, 2026 15:01 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

ArangoGutierrez force-pushed the feature/modular-health-check branch 2 times, most recently from c8f5009 to 4cb9a80 Compare January 12, 2026 15:16

ArangoGutierrez marked this pull request as ready for review January 12, 2026 15:41

ArangoGutierrez requested a review from elezar January 12, 2026 15:41

ArangoGutierrez changed the title ~~[WIP] Enhance health check robustness and observability~~ Enhance health check robustness and observability Jan 12, 2026

ArangoGutierrez force-pushed the feature/modular-health-check branch from 4cb9a80 to ea630af Compare January 15, 2026 13:23

ArangoGutierrez requested a review from Copilot January 15, 2026 13:24

Copilot started reviewing on behalf of ArangoGutierrez January 15, 2026 13:24 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

ArangoGutierrez requested a review from Copilot January 15, 2026 14:02

Copilot started reviewing on behalf of ArangoGutierrez January 15, 2026 14:03 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

ArangoGutierrez requested a review from Copilot January 15, 2026 15:43

Copilot started reviewing on behalf of ArangoGutierrez January 15, 2026 15:44 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

ArangoGutierrez added 4 commits January 15, 2026 18:53

ArangoGutierrez force-pushed the feature/modular-health-check branch from 24b6813 to 748f504 Compare January 15, 2026 17:55

ArangoGutierrez requested review from guptaNswati, rajathagasthya and tariq1890 January 23, 2026 12:28

-	// Compute instance IDs. Used as a sentinel value for non-MIG devices.
+	// Compute instance IDs. Used as a sentinel value for non-MIG devices.
+	// The value 0xFFFFFFFF matches the "invalid instance ID" sentinel defined
+	// by the NVML C API. The Go bindings do not currently expose a dedicated
+	// constant for this, so we centralize the literal here for clarity.

-	// Event receive channel with buffer
+	// Event receive channel with buffer.
+	// Buffer size 10 is chosen to absorb short bursts of NVML events without
+	// blocking the eventSet.Wait goroutine, while keeping memory usage
+	// negligible for typical workloads.

	func (r tegraResourceManager) CheckHealth(_ context.Context, _ <-chan interface{}, _ chan<- Device) error {
	func (r tegraResourceManager) CheckHealth(ctx context.Context, _ <-chan interface{}, _ chan<- Device) error {

Conversation

ArangoGutierrez commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!