Refactor checkHealth function by ArangoGutierrez · Pull Request #1508 · NVIDIA/k8s-device-plugin

ArangoGutierrez · 2025-11-18T14:43:28Z

This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.

elezar · 2025-11-20T13:58:21Z

internal/plugin/server.go

 	}()

+	// Start recovery worker to detect when unhealthy devices become healthy
+	go plugin.runRecoveryWorker()


Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?

sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR

elezar

In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.

As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:

type DeviceHealthCheckProvider interface {
   Start(context.Context) error
   Stop()
   Health() <-channel Device

(alternatively one could split the Health channel into Healthy() and Unhealthy()).

elezar · 2025-11-24T15:29:27Z

internal/plugin/server.go

+	// If health provider not available, wait for context cancellation
+	if plugin.healthProvider == nil {
+		<-plugin.ctx.Done()
+		return nil
+	}


Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?

thanks for the suggestion, adopted

elezar · 2025-11-24T15:30:53Z

internal/rm/health.go

-	// envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks
-	// should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are
-	// disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that
-	// this is in addition to the Application errors that are already ignored.
+	// envDisableHealthChecks defines the environment variable that is
+	// checked to determine whether healthchecks should be disabled. If
+	// this envvar is set to "all" or contains the string "xids",
+	// healthchecks are disabled entirely. If set, the envvar is treated
+	// as a comma-separated list of Xids to ignore. Note that this is in
+	// addition to the Application errors that are already ignored.


This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.

now in a [no-relnote] commit

elezar · 2025-11-24T15:31:27Z

internal/rm/health.go

+		nvml:       nvml,
+		config:     config,
+		devices:    devices,
+		healthChan: make(chan *Device, 64),


Question: Why 64?

size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes

elezar · 2025-11-24T15:32:09Z

internal/rm/health.go

+	if p.started {
+		p.mu.Unlock()
+		return fmt.Errorf("health provider already started")
+	}
+	p.started = true
+	p.mu.Unlock()


Any reason to not defer p.mu.Unlock() instead?

defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).

I don't agree that "slowness" is something we should optimize for. Start() and Stop() should not be running concurrently. As implemented, because we release the lock before Start() has completed (and similarly for Stop()) the remaining code may end up overlapping which is not what we want.

Also note that we set started to true before the health monitor is actually ready and this is not reset in the envent of an error.

suggestion adopted

elezar · 2025-11-24T15:32:44Z

internal/rm/health.go

+	wg     sync.WaitGroup
+
+	// State guards
+	mu      sync.Mutex


We could use an IsA relationship to simplify taking and releasing the lock:

Suggested change

mu sync.Mutex

sync.Mutex

thanks for the suggestion, adopted

elezar · 2025-11-24T15:33:42Z

internal/rm/health.go

+	ret := p.nvml.Init()
 	if ret != nvml.SUCCESS {
-		if *r.config.Flags.FailOnInitError {
+		if *p.config.Flags.FailOnInitError {


nit: Let's not rename r to p in a single commit. (see comment on managing diffs).

thanks for the suggestion, adopted , now an independent commit

elezar · 2025-11-24T15:34:32Z

internal/rm/health.go

+	p.xidsDisabled = getDisabledHealthCheckXids()
+	if p.xidsDisabled.IsAllDisabled() {
+		klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS")
 		return nil
 	}


This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.

thanks for the suggestion, adopted

I see that you have pulled this up to the func (r *nvmlResourceManager) HealthProvider() HealthProvider { implementation. This means that we have to construct the list of disabled XIDs twice. Why not handle this (and other static config) in the NewNVMLHealthProvider constructor instead?

moved to NewNVMLHealthProvider now

elezar · 2025-11-24T15:36:14Z

internal/rm/health.go

+		klog.Warningf("NVML init failed: %v; health checks disabled", ret)
 		return nil
 	}
-	defer func() {


Could you explain the move away from a deferred shutdown?

All error paths after Init() have now individual clean up logic.

I think this is more error prone than it needs to be. Note that we may want to distinguish between acceptable errors and non acceptable ones. As was the case for the related k8s-dra-driver PR, we may want to call Init() in the constructor and handle these errors (with deferred Shutdown() separately from errors that are triggered during Start.

For example, what about updating the constructor to:

// NewNVMLHealthProvider creates a new health provider for NVML devices. // Does not start monitoring - caller must call Start(). func newNVMLHealthProvider(nvmllib nvml.Interface, config *spec.Config, devices Devices) (HealthProvider, error) { xids := getDisabledHealthCheckXids() if xids.IsAllDisabled() { return &noopHealthProvider{}, nil } ret := nvmllib.Init() if ret != nvml.SUCCESS { if *config.Flags.FailOnInitError { return nil, fmt.Errorf("failed to initialize NVML: %v", ret) } klog.Warningf("NVML init failed: %v; health checks disabled", ret) return &noopHealthProvider{}, nil } defer func() { ret := nvmllib.Shutdown() if ret != nvml.SUCCESS { klog.Infof("Error shutting down NVML: %v", ret) } }() klog.Infof("Ignoring the following XIDs for health checks: %v", xids) p := &nvmlHealthProvider{ nvml: nvmllib, config: config, devices: devices, unhealthy: make(chan *Device, 64), xidsDisabled: xids, } return p, nil }

Note that we return a noopHealthProvider if we are tolerant of Init errors at this point. However, we update Start() to look something like:

// Start initializes NVML, registers event handlers, and starts the // monitoring goroutine. Blocks until initialization completes. func (r *nvmlHealthProvider) Start(ctx context.Context) (rerr error) { r.Lock() defer r.Unlock() if r.started { // TODO: Is this an error condition? Could we just return? return fmt.Errorf("health provider already started") } r.Unlock() // Initialize NVML ret := r.nvml.Init() if ret != nvml.SUCCESS { return fmt.Errorf("failed to initialize NVML: %v", ret) } defer func() { if rerr != nil { _ = r.nvml.Shutdown() } }() // Create event set eventSet, ret := r.nvml.EventSetCreate() if ret != nvml.SUCCESS { return fmt.Errorf("failed to create event set: %v", ret) } defer func() { if rerr != nil { _ = eventSet.Free() } }() // Register devices if err := r.registerDevices(eventSet); err != nil { return fmt.Errorf("failed to register devices: %w", err) } klog.Infof("Health monitoring started for %d devices", len(r.devices)) // Create child context r.ctx, r.cancel = context.WithCancel(ctx) // Start monitoring goroutine r.wg.Add(1) go r.runEventMonitor(eventSet) r.started = true return nil }

Where we take actions immediately in the event of failure so that we don't have to rely on cleanup() being called eventually for these resources.

Thanks for the detailed suggestion, adopted

elezar · 2025-11-24T15:36:43Z

internal/rm/health.go

+		}
 		return fmt.Errorf("failed to create event set: %v", ret)
 	}
-	defer func() {


is there a reason that we don't use the deferred cleanup here?

All error paths after Init() have now individual clean up logic.

Please see my comment above.

internal/rm/health.go

internal/rm/tegra_manager.go

elezar · 2025-11-24T15:43:30Z

internal/rm/tegra_manager.go

+}
+
+func (n *noopHealthProvider) Start(context.Context) error {
+	n.healthChan = make(chan *Device)


Why not just do this at construction?

Also, do we need to actually create a channel? Can we no leave it nil?

elezar · 2025-11-26T12:35:16Z

internal/rm/health.go

+	r.Lock()
+	if r.started {
+		r.Unlock()
+		return fmt.Errorf("health provider already started")


Why is this an error? Assuming we only set started once the HealthProvider / HealthMonitor has been successfully started, is there any harm in calling it again?

Agree, now is a return nil

elezar · 2025-11-26T12:35:46Z

internal/rm/health.go

+	// Get XID filter configuration
+	r.xidsDisabled = getDisabledHealthCheckXids()


This should be moved to construction.

elezar · 2025-11-26T12:36:46Z

internal/rm/health.go

-		e, ret := eventSet.Wait(5000)
+		// Wait for NVML event (5 second timeout)
+		event, ret := r.eventSet.Wait(5000)
+


Suggested change

elezar · 2025-11-26T15:54:33Z

internal/rm/health.go

-			continue
-		}
+	// Create child context
+	r.ctx, r.cancel = context.WithCancel(ctx)


Could you comment on whether we need to add WithCancel to the context if it already has this done for the plugin?

we now use the plugin context

elezar · 2025-11-26T16:01:29Z

internal/plugin/server.go

-	}()
+	// Initialize and start health provider
+	plugin.ctx, plugin.cancel = context.WithCancel(context.Background())
+	plugin.healthProvider = plugin.rm.HealthProvider()


This should be moved to the constructor.

elezar · 2025-11-26T16:02:52Z

internal/plugin/server.go

 		socket: getPluginSocketPath(resourceManager.Resource()),
-		// These will be reinitialized every
-		// time the plugin server is restarted.
+		// server and healthProvider will be reinitialized every time


Why can't we instantiate the healthProvider here? Why is it required to reinitilize it on every start?

Would the following be valid?

Suggested change

// server and healthProvider will be reinitialized every time

healthProvider: resourceManager.HealthProvider(),

// server and healthProvider will be reinitialized every time

elezar · 2025-11-26T16:04:50Z

internal/plugin/server.go

-		}
-	}()
+	// Initialize and start health provider
+	plugin.ctx, plugin.cancel = context.WithCancel(context.Background())


At which point do we rather pass in a ctx and handle the Cancel() call externally?

we now use the plugin context

Extract device health checking logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context. - Add HealthProvider interface (Start/Stop/Health methods) - Implement nvmlHealthProvider with WaitGroup coordination - Update ResourceManager to return HealthProvider instead of CheckHealth - Update device plugin to use HealthProvider - Add no-op implementation for Tegra devices This refactoring improves code modularity and testability without changing existing behavior. Prepares foundation for future device recovery features. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

github-actions · 2026-02-25T04:59:25Z

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

ArangoGutierrez self-assigned this Nov 18, 2025

ArangoGutierrez force-pushed the gtg branch 2 times, most recently from 541c6cb to 44d450f Compare November 18, 2025 18:13

ArangoGutierrez added the feature issue/PR that proposes a new feature or functionality label Nov 18, 2025

ArangoGutierrez requested a review from elezar November 18, 2025 18:45

ArangoGutierrez marked this pull request as ready for review November 18, 2025 18:45

elezar reviewed Nov 20, 2025

View reviewed changes

ArangoGutierrez force-pushed the gtg branch from 44d450f to 7875a1b Compare November 21, 2025 13:58

ArangoGutierrez requested a review from elezar November 21, 2025 15:16

elezar reviewed Nov 24, 2025

View reviewed changes

internal/rm/health.go Show resolved Hide resolved

elezar reviewed Nov 24, 2025

View reviewed changes

internal/rm/tegra_manager.go Show resolved Hide resolved

elezar reviewed Nov 24, 2025

View reviewed changes

ArangoGutierrez force-pushed the gtg branch 4 times, most recently from 91f4a6c to 33636eb Compare November 24, 2025 17:06

ArangoGutierrez requested review from elezar and klueska November 24, 2025 17:07

elezar reviewed Nov 26, 2025

View reviewed changes

elezar mentioned this pull request Nov 26, 2025

WIP: Refactor health checks #1538

Draft

ArangoGutierrez force-pushed the gtg branch from 33636eb to 683ab56 Compare November 26, 2025 15:35

ArangoGutierrez requested a review from elezar November 26, 2025 15:35

elezar reviewed Nov 26, 2025

View reviewed changes

ArangoGutierrez added 3 commits November 26, 2025 17:29

[no-relnote] Format doc comment at internal/rm/health.go

33d206f

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

[no-relnote] refactor use p as receiver for nvmlHealthProvider

9762cfb

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the gtg branch from 683ab56 to 9762cfb Compare November 26, 2025 16:29

ArangoGutierrez requested a review from elezar November 26, 2025 16:32

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2026

		// Get XID filter configuration
		r.xidsDisabled = getDisabledHealthCheckXids()

Conversation

ArangoGutierrez commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ArangoGutierrez commented Nov 18, 2025 •

edited

Loading

elezar Nov 26, 2025 •

edited

Loading