Skip to content

Add GPU health check#689

Merged
guptaNswati merged 1 commit intoNVIDIA:mainfrom
guptaNswati:device-health-check
Dec 4, 2025
Merged

Add GPU health check#689
guptaNswati merged 1 commit intoNVIDIA:mainfrom
guptaNswati:device-health-check

Conversation

@guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Oct 17, 2025

Addressing #360 to add preliminary health check similar to https://github.com/NVIDIA/k8s-device-plugin.

  • Clean follow-up of Gpu health check #545
  • addressing review comments
  • republising of resourceslice on health event
  • feature gate: --set featureGates.DeviceHealthCheck=true
  • xids: --set kubeletPlugin.gpus.additionalXidsToIgnore="n1,n2"

Test logs:

I1017 20:16:19.799738       1 device_health.go:179] Processing event {Device:{Handle:0xe4aea6b2fef0} EventType:8 EventData:43 GpuInstanceId:7 ComputeInstanceId:0}
I1017 20:16:19.799821       1 device_health.go:192] Event for mig device: &{<nil> 0x40006a0070 Healthy}
I1017 20:16:19.799843       1 device_health.go:202] Sending unhealthy notification for device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 due to event type: 8 and event data: 43
W1017 20:16:19.799870       1 driver.go:219] Received unhealthy notification for device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
I1017 20:16:19.799884       1 device_state.go:558] Update device sattus:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 healthstatus
I1017 20:16:19.799891       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c150 Healthy}
I1017 20:16:19.799955       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0ee0 Healthy}
I1017 20:16:19.799966       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0f50 Healthy}
I1017 20:16:19.799974       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0230 Healthy}
I1017 20:16:19.799983       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a02a0 Healthy}
I1017 20:16:19.799992       1 driver.go:230] Device is healthy, added to resoureslice: &{0x40002ae000 <nil> Healthy}
I1017 20:16:19.800000       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c0e0 Healthy}
W1017 20:16:19.800009       1 driver.go:233] Device:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 with uuid:&{%!s(*main.GpuInfo=<nil>) %!s(*main.MigDeviceInfo=&{MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 1g.12gb 0x40002503c0 0x400049c130 0x4000610090 0x400045ce40 0x4000610150 0x40004840f0 0009:01:00.0 0x40002fbf50}) Unhealthy} is unhealthy
I1017 20:16:19.800021       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a00e0 Healthy}
I1017 20:16:19.800031       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c1c0 Healthy}
I1017 20:16:19.800043       1 driver.go:230] Device is healthy, added to resoureslice: &{0x40002503c0 <nil> Healthy}
I1017 20:16:19.800050       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c000 Healthy}
I1017 20:16:19.800056       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0310 Healthy}
I1017 20:16:19.800063       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0150 Healthy}
I1017 20:16:19.800069       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c070 Healthy}
I1017 20:16:19.800076       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a01c0 Healthy}
I1017 20:16:19.800084       1 driver.go:237] [Rebulishing resourceslice with healthy devices
I1017 20:16:19.800142       1 driver.go:247] Successfully republished resources without unhealthy device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485:
I1017 20:16:19.800178       1 resourceslicecontroller.go:647] "Existing slices" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" obsolete=[] current=["sc-starwars-mab9-b00-gpu.nvidia.com-kmcts"]
I1017 20:16:19.800209       1 resourceslicecontroller.go:724] "Need to update slice" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" matchIndex=0
I1017 20:16:19.800225       1 resourceslicecontroller.go:727] "Completed comparison" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" numObsolete=0 numMatchedSlices=1 numChangedMatchedSlices=1 numNewSlices=0
I1017 20:16:19.800230       1 resourceslicecontroller.go:743] "Kept generation because at most one update API call is necessary" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" generation=1
I1017 20:16:19.805795       1 round_trippers.go:632] "Response" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" verb="PUT" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceslices/sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" status="200 OK" milliseconds=5
I1017 20:16:19.806290       1 resourceslicecontroller.go:779] "Updated existing resource slice" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts"
I1017 20:16:19.807922       1 resourceslicecontroller.go:500] "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-kmcts",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "7184f664-55c1-412d-bd99-4b46e7c23846",
	-  "resourceVersion": "59000652",
	-  "generation": 1,
	+  "resourceVersion": "59001011",
	+  "generation": 2,
	   "creationTimestamp": "2025-10-17T20:14:42Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-10-17T20:14:42Z",
	+    "time": "2025-10-17T20:16:19Z
	.....
	.....
	
		     "name": "gpu-1",
	     "attributes": {
	      "architecture": {
	@@ -161,7 +496,7 @@
	     }
	    },
	    {
	-    "name": "gpu-0-mig-19-0-1",
	+    "name": "gpu-0-mig-19-1-1",
	     "attributes": {
	      "architecture": {
	       "string": "Hopper"
	@@ -197,56 +532,230 @@
	       "string": "mig"
	      },
	      "uuid": {
	-      "string": "MIG-4d806f22-346a-5a1d-ac01-86b505cdf485"
	-     }
	-    },

TLDR, This device had an event and is not added MIG-4d806f22-346a-5a1d-ac01-86b505cdf485

The device is picked back when driver is restarted.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati guptaNswati mentioned this pull request Oct 17, 2025
klog.Infof("Processing event %+v", event)
eventUUID, ret := event.Device.GetUUID()
if ret != nvml.SUCCESS {
klog.Infof("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska

Copy link
Contributor Author

@guptaNswati guptaNswati Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@jgehrcke jgehrcke Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.

By the way, this would have been a perfect opportunity for a better code comment in the legacy code:

Image

No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.

For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.

The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.

}

if err := d.pluginhelper.PublishResources(ctx, resources); err != nil {
klog.Errorf("Failed to publish resources after device health status update: %v", err)
Copy link
Collaborator

@jgehrcke jgehrcke Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naturally, I wonder why this error is only handled by logging a message. This might be the correct (or currently best) decision. But please walk the reader of the code through the arguments for ending up with that decision, using a brief code comment.

I'd like to understand thoughts here in the lines of "do not retry, because" or "this is implicitly retried later, because" or "we could crash the plugin here, but" or "the old resource slice state remains published, which is good enough", and so on. I am sure you've thought through all this.

None of this is obvious to the reader of the code, and I'd really love to have some help here to convince myself that this is the right way to handle this error.

(as always, it will pay off to document the current argumentation for our future selves, even if it's incomplete or so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrying make sense. and if retries also fails. It should be a fatal error as it means existing resourceslice is outdated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jan-Philip Comment is a request for a code-doc-comment, to save the why retying makes sense for future code maintainers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after going over the entire loop I must agree with Jan-P here, as to the external reader, it reads that for both error and non error the action is the same, a log. I would like to "as a reader" better understand why this error is being handled as is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.Warningf("Received unhealthy notification for device: %s", uuid)

if !device.IsHealthy() {
klog.V(6).Infof("Device: %s is aleady marked unhealthy. Skip republishing resourceslice", uuid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, how often could we see a log message like this?

What I see here right now: we can get the d.deviceHealthMonitor.Unhealthy() event multiple times, even if we had already processed that device before. I wonder how often we should expect that to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be events burst. @lalitadithya showed me logs of device-plugin xid errors in a cluster which clearly showed same event logged multiple times.

@lalitadithya is it possible to share the log here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

showed same event logged multiple times

Interesting. Thanks for the feedback. Let's keep that in mind, this is really important detail to know. Maybe we need some kind of dedup in the future. Nothing to do here before landing this patch.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience in waiting on a review @guptaNswati.

release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
if err != nil {
klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that we don't mark the device as unhealth in this case. Is that the intended behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not abort the event. Probably should just log the error and update the device status anyway..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is not relevant anymore, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Comment on lines +151 to +157
&cli.StringFlag{
Name: "additional-xids-to-ignore",
Usage: "A comma-separated list of additional XIDs to ignore.",
Value: "",
Destination: &flags.additionalXidsToIgnore,
EnvVars: []string{"ADDITIONAL_XIDs_TO_IGNORE"},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NVIDIA/k8s-device-plugin#1443 we added a list of EXPLICIT XIDs to consider fatal. This allows a user to:

  1. Specify ignored XIDs (including all)
  2. Specify SPECIFIC XIDs that are considered fatal (including all).

The important thing here is that it allows users to override the list of hard-coded XIDs that we currently track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of this and was planning to do this as a follow-up as it recently got merged.

continue
}

release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Acquiring this lock here is a big decision.

Here, I really expect a concise / precise code comment explaining convincingly

  • why this lock must be acquired
  • how we guarantee that release() is always called

Maybe start by explaining what you think will go wrong when we do not acquire this lock here at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to double check that lock is released on any failures.

Copy link
Collaborator

@jgehrcke jgehrcke Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be mindful of acquiring this lock. Let's do it only for a strong reason. When we introduced that lock we named it prepare/unprepare lock because it's meant for that purpose.

Maybe we should use it here, too -- but let's pretty please thoroughly identify that strong reason, and put it into a few English sentences that are convincing.

I am not yet satisfied yet here by our arguments. We need to discuss the alternatives considered, I need more help please to understand why this is the correct approach (I really mean that -- it's not that I know what we should do -- but I sense that we don't, as a collective, understand yet what we really want to do here).

Copy link
Collaborator

@jgehrcke jgehrcke Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about the reason that you vaguely describe:

This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().

And I wonder: will it?

Can you describe a sequence of events where acquiring the lock would actually make that incoming nodePrepareResources() call not allocate an unhealthy device?

Here, we only update the ResourceSlice to un-announce any unhealthy device, right?

The moment we're done with that, we release the lock and the unchanged nodePrepareResources() call (that was waiting for us, hanging in lock acquisition) proceeds, trying to get what it wanted to get anyway.

When the kubelet already wants us to allocate an unhealthy device, then updating the ResourceSlice won't undo that (that gun is "fired" so to say), at least not reliably. Is that correct?

Let's say kubelet sends a PrepareResourceClaims() call our way and that may potentially result in allocating an unhealthy device (because the unhealthy notification arrived very recently).

Then I believe if we want a safe method to prevent that from happening we need to have logic within nodePrepareResource(). Does that make sense? (I quite literally don't know yet, you all have thought more about this than I did).

Specifically, I am wondring: do we need to call this new IsHealthy() somewhere in

func (s *DeviceState) prepareDevices(ctx context.Context, claim *resourceapi.ResourceClaim) (PreparedDevices, error) {
?

We can always actively fail a prepareDevices() if the only matching device turns out to be unhealthy right before we would have allocated it.

Maybe this is already similar?

		device, exists := s.allocatable[result.Device]
		if !exists {
			return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device)
		}

device, exists := s.allocatable[result.Device]


I might be completely off in what I say here. The point is that I need to convince myself that really we know what we're doing here and that we only acquire the PU lock if we absolutely have to -- because it has potentially devastating downsides to do it unnecessarily often.

This discussion is really important to align on, and I'd love for you all to help me understand why what we're doing here is the right thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.

But, does it? I might just not see it -- which sequence of events did you imagine, maybe?

Let me try to re-frame the problem space that I believe you are thinking about when you say "simultaneously not updating the health status and also allocating the same device".

You want to make sure that we don't allocate a device that's knowingly unhealthy. Does that sound about right?

I would agree -- let's try to do that :)

What do you think about the following mental model?

  1. Let's agree that this is generally a best-effort problem space -- between the device becoming unhealthy and us knowing there is unpredictable amount of time passing.
  2. Let's agree that this is an event propagation problem space.

I think we can also say:

  • Deep within func (s *DeviceState) Prepare() we must know, as early as possible, when a device is unhealthy. That's our final event consumer.
  • Event producer and event propagation pipeline are orthogonal to that.

The best we can do here is that we perform event propagation at minimal latency, towards the consumer.

Detour on latency

Because I find it interesting.

Between a GPU actually becoming unhealthy and us calling UpdateDeviceHealthStatus() as little time as possible should pass.

That is something that we can do and should do in this PR: minimize the fraction of event propagation latency that we control here.

Zooming out, this is always going to be a best effort strategy. Right now we seem to subscribe to GPU events more or less directly (but already now the event propagates through layers: there's the physical device, there's NVML, and then there's our process, and other layers that I am not even aware of). In the future, with NVSentinel, we're talking about event propagation across even more components.

Generally, there's a timeline attached that is unpredictable and we want to make sure we minimize latency at all steps. Here is one way to maybe think about that timeline:

T_1) a GPU actually becoming unhealthy (the 'physical' event)
T_2) us detecting it in component A
T_3) emitting an event in component A towards component B
T_4) potential-black-box-event-propagation -> after all emitted towards our GPU kubelet plugin
T_5) responding to that incoming event in our GPU kubelet plugin

Let's agree on the following: there's always a chance that we call func (s *DeviceState) Prepare() for a device after T_1 and before T_5.

We may just want to make sure we respond to the unhealthy event in the moment we receive it. We want to make sure we propagate to all its consumers ASAP.

That propagation itself does not need to be lock-protected; it just must happen fast.

Misc

Tangential, but potentially a helpful perspective: the "protected" data structure here (for now) is just device.Health and it's only mutated from within UpdateDeviceHealthStatus().

Then, also tangential, I notice that there already is a bit of a synchronization in your current patch proposal: UpdateDeviceHealthStatus() acquires the mutex on the DeviceState singleton:

func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, hs HealthStatus) {
	s.Lock()
	defer s.Unlock()
...

We acquire the same mutex in func (s *DeviceState) Prepare().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very descriptive. Thank you for the effort JP. Yes, this is the flow of events i imagined when i was convinced that we needed to have the lock when updating the device health status and republishing the ResourceSlice.

unhealthy event -> lock to prevent any other operation on the device (mark device unhealthy + republish RS) -> unlock -> device unhealthy = my logic

but as you pointed out, i already acquire the lock when updating device status so the above lock is not really needed for republish and this is all best effort anyway in avoiding a potential race from the T_1 to T_5.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guptaNswati if you updated the code in response to this thread; please describe the update briefly towards concluding this thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke I removed the lock before updating device status and republishing the ResourceSlice

Screenshot 2025-12-01 at 11 18 35 AM

1a950ca#diff-4015ee3b913a68a047b03392c3b18fbae672a6fdc26f6f76b52abd6ada951317

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I don't want to forget about our lovely detour here and hence I leave this open. But we can consider this 'done'.

if err := d.pluginhelper.PublishResources(ctx, resources); err != nil {
klog.Errorf("Failed to publish resources after device health status update: %v", err)
} else {
klog.V(6).Info("Successfully republished resources without unhealthy device")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes:

  • I think we should log this on level 0 or 1.
  • Maybe add more detail -- all UUIDs of all currently unhealthy devices: "without unhealthy device(s) %s"
  • I am never quite a fan of "Successfully" -- it does not add information :) Let's remove that, I will try to push for that elsewhere too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dint purposefully add the UUIDs of unhealthy devices as its already logged multiple times and when ResourceSlice is republished it logs the diff which shows it too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to do here anymore before merging, but I'll want to leave the thread open for visibility.

@guptaNswati
Copy link
Contributor Author

Thanks for the patience in waiting on a review @guptaNswati.

Thanks to you for the review @elezar.

Thanks to @jgehrcke also.

The PR looks more lively :-D

case m.unhealthy <- dev:
klog.V(6).Infof("Marked device %s as unhealthy", dev.UUID())
default:
klog.Errorf("Unhealthy channel full. Dropping unhealthy notification for device %s", dev.UUID())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth logging as an error? Is this a problem, really?

Are there different types of "unhealthy"? Do we drop valuable information here? If one unhealthy event is just like the other then dropping "duplicates" seems completely reasonable along happy path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so or else will miss the unique device UUID in case of non duplicate events. What else can we do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping events here indicates a problem processing the events at the other end of the unhealthy channel. If we don't want to block here, we may want to increase the size of the channel buffer. At the moment it is set to the number of allocatable devices, which is reasonable. Do we want to add some "safety factor" to decrease the likelihood of the channel buffer being full? The other option is to block here. As mentioned, with buffered channels, we should only block if we're not receiving the unhealthy events fast enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought I had, is whether it makes sense to introduce the concept of an "all" message so that the receiver could process this instead of having to send individual messages. (This would only be required if we foresee blocking / dropping messages being an actual issue in practice).

I don't think blocking this PR on deciding / implementing this is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like both ideas but need to think more about it https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/689/files/c575c63685bf6f92f71972f5c5fc737517dc77ed#r2586170468

I will add this as a follow-up To-do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

                       // TODO: The non-blocking send protects the health-monitor goroutine from deadlocks,
                        // but dropping an unhealthy notification means the device's health transition may
                        // never reach the consumer. Consider follow-up improvements:
                        //   - increase the channel buffer beyond len(allocatable) to reduce backpressure;
                        //   - introduce a special "all devices unhealthy" message when bulk updates occur;
                        //   - or revisit whether blocking briefly here is acceptable.

// only log the error on publish failure as this is the not action we intend to keep in the long run.
// this is a temporary solution while waiting for device taints and tolerations support
// KEP: https://github.com/kubernetes/enhancements/issues/5055
// as an alternative optimize this to do a patch update rather than full republish or retry on failure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give this code comment more love? Let's try to write correct English here (including capitalization) that is easy to understand even for external reader of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only log the error on publish failure as this is the not action we intend to keep in the long run

You can and should use the opportunity here to express in the code comment what the potential set of problems is with not retrying on failure.

Think: we mark a device as unhealthy in an internal data structure, but then fail to re-publish the resource slice. What's the fallout? (I don't know, and I hope for the code to tell me via code comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment to

                        // NOTE: We only log an error on publish failure and do not retry.
                        // If this publish fails, our in-memory health update succeeds but the
                        // ResourceSlice in the API server remains stale and still advertises the
                        // now-unhealthy device as allocatable. Until a later publish succeeds,
                        // the scheduler and other consumers will continue to see the unhealthy
                        // device as available, and new pods may be placed onto hardware we know
                        // is unusable. If publishes continue to fail (e.g., API server issues),
                        // the cluster can remain in this inconsistent state indefinitely.
                        // This is a temporary compromise while device taints/tolerations (KEP-5055)
                        // are available as a Beta feature. An interim improvement could be adding
                        // a retry/backoff or switch to patch updates instead of full republish.

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks for all the work that has happened here in the meantime. I want to leave an intermediate review comment, pointing out what I think should be major goals at this point:

  1. I think we should go the extra mile here and make sure that we never mark a device as unhealthy that isn't actually unhealthy. Before merging this patch, I'd like us to describe how we tried making sure of that.

  2. I understand that we have a major customer in mind with this patch. What do they care most about, and how much are we able to test that? (think: if there's one thing to get right here -- what would that be?)

  3. I know it's sometimes daunting, but let's try to emit quality log messages in the context of this new behavior. The mindset should be: we probably make some mistakes here, and they might manifest very rarely (might be hard to reproduce). That means: when something goes wrong in production, we should be able to debug what happened based on the logs we got on the default log level.

We will not achieve perfection along each of these three dimensions, and that's fine. But we should be focusing our energy on these points before releasing this.

@guptaNswati
Copy link
Contributor Author

2. I understand that we have a major customer in mind with this patch. What do they care most about, and how much are we able to test that? (think: if there's one thing to get right here -- what would that be?)

I think XID errors. And re-reviewing the code, it is reporting critical xid errors. I was never directly involved in the conversations with the customer on their requirements, so i cannot answer this well. My limited direction was to follow this #360. I would defer this to @klueska.

@guptaNswati
Copy link
Contributor Author

  1. I think we should go the extra mile here and make sure that we never mark a device as unhealthy that isn't actually unhealthy. Before merging this patch, I'd like us to describe how we tried making sure of that.

In summary, these are the scenarios in which device is marked unhealthy:

  • on critical XID error while explicitly skipping known benign XIDs.
  • the more conservative ones are when NVML return call is not successful which may mean it can’t see or identify GPUs anymore during event registration (DeviceGetHandleByUUID, GetSupportedEventTypes, RegisterEvents), and during event processing (event.Device.GetUUID).
  • on event wait errors (timeouts or non-GPU_IS_LOST) we only log and retry.

@guptaNswati
Copy link
Contributor Author

3. I know it's sometimes daunting, but let's try to emit quality log messages in the context of this new behavior. The mindset should be: we probably make some mistakes here, and they might manifest very rarely (might be hard to reproduce). That means: when something goes wrong in production, we should be able to debug what happened based on the logs we got on the default log level.

I agree. I will update the logging and see if i can improve it further. #689 (comment)

}

klog.V(4).Infof("Sending unhealthy notification for device %s due to event type:%v and event data:%d", affectedDevice.UUID(), eType, xid)
m.unhealthy <- affectedDevice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that although we use a non-blocking send when marking ALL devices as unhealthy, we use a blocking send for a single device. Since this is the only loop where we actually send these events, does it makes sense to switch to blocking sends in all cases for simplicitly (at least as a starting point)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm the single device path is driven by individual XID events. I would assume xid errors to be relatively infrequent, and if in case the consumer is actually slow, its more reasonable to block here so that we don’t silently drop any critical event.

in all devices path, there are mostly library and driver communication issues which are handled as best effort by logging an error but we still want to keep the monitor alive.

@guptaNswati guptaNswati force-pushed the device-health-check branch 4 times, most recently from 9202c43 to da6cad9 Compare December 3, 2025 23:39
// the cluster can remain in this inconsistent state indefinitely.
// This is a temporary compromise while device taints/tolerations (KEP-5055)
// are available as a Beta feature. An interim improvement could be adding
// a retry/backoff or switch to patch updates instead of full republish.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a more exhaustive and clear comment here.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Dec 4, 2025

Thanks for all the work, Swati. Towards merging, can you now do the following?

  • Rebase again on current HEAD of main, and then do the /ok-to-test dance (to see if things build, and also if the test suite passes).
  • Before merge, also give the "rebase and address review comments" commit a more expressive commit message or squash it into the previous commit.
  • Describe generally how and where you tested that this new code roughly works! (sorry if I missed that somewhere)
  • Do that confirmation-testing once again for the to-be-merged commit, and give an explicit "ack I've seen this to work!" :).

Signed-off-by: Swati Gupta <swatig@nvidia.com>

rebase and address review comments

Signed-off-by: Swati Gupta <swatig@nvidia.com>
@guptaNswati
Copy link
Contributor Author

/ok-to-test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 4, 2025

/ok-to-test

@guptaNswati, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@jgehrcke
Copy link
Collaborator

jgehrcke commented Dec 4, 2025

/ok-to-test d1d4eb5

@guptaNswati
Copy link
Contributor Author

Tested on a GH200 1.32 cluster by triggering an xid 43 error using null pointer access:

cat trigger_xid.cu 
#include <stdio.h>

// This kernel will cause a GPU page fault by writing to a null pointer.
__global__ void illegal_memory_kernel() {
    // Create a null pointer and try to write to it.
    int* bad_ptr = NULL;
    *bad_ptr = 123;
}

int main() {
    printf("Starting kernel with illegal memory access to trigger Xid...\n");

    // Launch the kernel
    illegal_memory_kernel<<<1, 1>>>();
    
    // This synchronize will now fail with an error for the bad memory access.
    cudaError_t err = cudaDeviceSynchronize();
    
    printf("Kernel finished (or was killed). Error: %s\n", cudaGetErrorString(err));
    return 0;
}

logs

I1204 20:42:19.608540       1 health.go:100] starting healthcheck service at [::]:51516
I1204 20:42:19.611278       1 device_health.go:78] creating NVML events for device health monitor
I1204 20:42:19.611301       1 device_health.go:86] registering NVML events for device health monitor
I1204 20:42:19.675749       1 device_health.go:95] started device health monitoring
I1204 20:42:19.675841       1 driver.go:228] Starting to watch for device health notifications


I1204 20:46:14.089191       1 health.go:140] Successfully invoked NodePrepareResources
I1204 20:46:19.838884       1 device_health.go:184] Processing event XID=43 event
I1204 20:46:19.838936       1 device_health.go:200] Sending unhealthy notification for device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 due to event type:8 and event data:43
W1204 20:46:19.839016       1 driver.go:242] Received unhealthy notification for device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
I1204 20:46:19.839029       1 device_state.go:714] Updated device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 health status to Unhealthy
I1204 20:46:19.839041       1 driver.go:259] Device: MIG-5ff5c47b-80e1-525d-bb61-5eef5ce0441a is healthy, added to ResoureSlice
I1204 20:46:19.839113       1 driver.go:259] Device: MIG-48e730aa-3a45-5cf9-b1f3-07d2916e9d83 is healthy, added to ResoureSlice
I1204 20:46:19.839122       1 driver.go:259] Device: MIG-d256cb8d-de32-55bf-833a-ef8a59112a84 is healthy, added to ResoureSlice
I1204 20:46:19.839129       1 driver.go:259] Device: MIG-8949ea7d-9166-53ce-9bac-6c58262addf9 is healthy, added to ResoureSlice
I1204 20:46:19.839136       1 driver.go:259] Device: MIG-24c6a0e5-fe28-53ac-9537-c4e23104f30f is healthy, added to ResoureSlice
I1204 20:46:19.839144       1 driver.go:259] Device: MIG-9b33318c-1fa4-58eb-a6f5-18273c6db866 is healthy, added to ResoureSlice
I1204 20:46:19.839152       1 driver.go:259] Device: MIG-d0c7cbd9-070b-58c6-9f87-cb8dd99a7214 is healthy, added to ResoureSlice
I1204 20:46:19.839159       1 driver.go:259] Device: GPU-9e6df7cb-64d4-5e53-2b1d-cee9e58aeb94 is healthy, added to ResoureSlice
I1204 20:46:19.839165       1 driver.go:259] Device: MIG-643fb61f-fb9a-553e-87d8-627fa535d3cb is healthy, added to ResoureSlice
I1204 20:46:19.839172       1 driver.go:259] Device: MIG-81544523-8683-58cd-8b72-5d797baf35de is healthy, added to ResoureSlice
W1204 20:46:19.839179       1 driver.go:262] Device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 is unhealthy, will be removed from ResoureSlice
I1204 20:46:19.839182       1 driver.go:259] Device: MIG-ddcdfb36-d773-524c-9dc8-4075877523e6 is healthy, added to ResoureSlice
I1204 20:46:19.839192       1 driver.go:259] Device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45 is healthy, added to ResoureSlice
I1204 20:46:19.839198       1 driver.go:259] Device: MIG-4e9fed73-e2a4-512f-ad8e-e7bb75aa920e is healthy, added to ResoureSlice
I1204 20:46:19.839206       1 driver.go:259] Device: MIG-c7cbc24b-a93f-5ff3-b388-deba1f3ac17e is healthy, added to ResoureSlice
I1204 20:46:19.839214       1 driver.go:259] Device: MIG-59e178f7-0c5e-52e3-902c-bf2f31c2275f is healthy, added to ResoureSlice
I1204 20:46:19.839221       1 driver.go:266] Rebulishing resourceslice with healthy devices
I1204 20:46:19.839267       1 driver.go:287] Successfully republished resources without unhealthy device

quick check on unhealthy device entry vs healthy:

$ kubectl get resourceslice sc-starwars-mab9-b00-gpu.nvidia.com-6h999 -o yaml | grep MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
$  kubectl get resourceslice sc-starwars-mab9-b00-gpu.nvidia.com-6h999 -o yaml | grep MIG-5ff5c47b-80e1-525d-bb61-5eef5ce0441a
          string: MIG-5ff5c47b-80e1-525d-bb61-5eef5ce0441a**

@jgehrcke
Copy link
Collaborator

jgehrcke commented Dec 4, 2025

triggering an xid 43 error using null pointer access:

Nice. Can you also show a snippet for how you build(t) that? So that I have that easier when I also want to try this out!

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 What a journey :-). Thanks everyone for all the help and input and ideas that we threw together. And thank you Swati for bearing with us.

I just saw green CI. Let's now merge this, and then let's iterate after merge.

@guptaNswati
Copy link
Contributor Author

triggering an xid 43 error using null pointer access:

Nice. Can you also show a snippet for how you build(t) that? So that I have that easier when I also want to try this out!

Definitely, its very handy for the test. I used to use it for DCGM testing also.

I use nvcc and this command:
nvcc -o trigger_xid trigger_xid.cu

@guptaNswati
Copy link
Contributor Author

🚀 What a journey :-). Thanks everyone for all the help and input and ideas that we threw together. And thank you Swati for bearing with us.

I just saw green CI. Let's now merge this, and then let's iterate after merge.

Oh i agree. Thank you @jgehrcke and thank you everyone. Alot of learnings here and follow-up improvements. Lets prioritize these early next year. With all the discussions, I realized how imp is this feature and needs to be done more thoroughly.

@guptaNswati guptaNswati merged commit 64b6718 into NVIDIA:main Dec 4, 2025
16 checks passed
@elezar
Copy link
Member

elezar commented Dec 5, 2025

Thanks @guptaNswati!

Just a note on the XID 43 that we're triggering with the example. According to the documentation this XID is an ROBUST_CHANNEL_RESETCHANNEL_VERIF_ERROR and should be ignored.

"This event is logged when a user application hits a software induced fault and must terminate. The GPU remains in a healthy state.

In most cases, this is not indicative of a driver bug but rather a user application error."

It is also in the list of XIDs that we explicitly skip. Did you run the test with some other configuration to trigger this?

@guptaNswati
Copy link
Contributor Author

@elezar Yes thats true. this is the only safe xid to trigger without actually breaking the GPU. so i comment out 43 before build for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Development

Successfully merging this pull request may close these issues.

6 participants