Conversation
| klog.Infof("Processing event %+v", event) | ||
| eventUUID, ret := event.Device.GetUUID() | ||
| if ret != nvml.SUCCESS { | ||
| klog.Infof("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret) |
There was a problem hiding this comment.
This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska
There was a problem hiding this comment.
its how its done in device-plugin https://github.com/NVIDIA/k8s-device-plugin/blob/main/internal/rm/health.go#L147
There was a problem hiding this comment.
I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.
By the way, this would have been a perfect opportunity for a better code comment in the legacy code:
No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.
For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.
The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.
| } | ||
|
|
||
| if err := d.pluginhelper.PublishResources(ctx, resources); err != nil { | ||
| klog.Errorf("Failed to publish resources after device health status update: %v", err) |
There was a problem hiding this comment.
Naturally, I wonder why this error is only handled by logging a message. This might be the correct (or currently best) decision. But please walk the reader of the code through the arguments for ending up with that decision, using a brief code comment.
I'd like to understand thoughts here in the lines of "do not retry, because" or "this is implicitly retried later, because" or "we could crash the plugin here, but" or "the old resource slice state remains published, which is good enough", and so on. I am sure you've thought through all this.
None of this is obvious to the reader of the code, and I'd really love to have some help here to convince myself that this is the right way to handle this error.
(as always, it will pay off to document the current argumentation for our future selves, even if it's incomplete or so)
There was a problem hiding this comment.
retrying make sense. and if retries also fails. It should be a fatal error as it means existing resourceslice is outdated
There was a problem hiding this comment.
Jan-Philip Comment is a request for a code-doc-comment, to save the why retying makes sense for future code maintainers
There was a problem hiding this comment.
after going over the entire loop I must agree with Jan-P here, as to the external reader, it reads that for both error and non error the action is the same, a log. I would like to "as a reader" better understand why this error is being handled as is
There was a problem hiding this comment.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| klog.Warningf("Received unhealthy notification for device: %s", uuid) | ||
|
|
||
| if !device.IsHealthy() { | ||
| klog.V(6).Infof("Device: %s is aleady marked unhealthy. Skip republishing resourceslice", uuid) |
There was a problem hiding this comment.
In practice, how often could we see a log message like this?
What I see here right now: we can get the d.deviceHealthMonitor.Unhealthy() event multiple times, even if we had already processed that device before. I wonder how often we should expect that to happen.
There was a problem hiding this comment.
there can be events burst. @lalitadithya showed me logs of device-plugin xid errors in a cluster which clearly showed same event logged multiple times.
@lalitadithya is it possible to share the log here.
There was a problem hiding this comment.
showed same event logged multiple times
Interesting. Thanks for the feedback. Let's keep that in mind, this is really important detail to know. Maybe we need some kind of dedup in the future. Nothing to do here before landing this patch.
elezar
left a comment
There was a problem hiding this comment.
Thanks for the patience in waiting on a review @guptaNswati.
| release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second)) | ||
| if err != nil { | ||
| klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err) | ||
| continue |
There was a problem hiding this comment.
So this means that we don't mark the device as unhealth in this case. Is that the intended behaviour?
There was a problem hiding this comment.
We should not abort the event. Probably should just log the error and update the device status anyway..
There was a problem hiding this comment.
This thread is not relevant anymore, correct?
| &cli.StringFlag{ | ||
| Name: "additional-xids-to-ignore", | ||
| Usage: "A comma-separated list of additional XIDs to ignore.", | ||
| Value: "", | ||
| Destination: &flags.additionalXidsToIgnore, | ||
| EnvVars: []string{"ADDITIONAL_XIDs_TO_IGNORE"}, | ||
| }, |
There was a problem hiding this comment.
In NVIDIA/k8s-device-plugin#1443 we added a list of EXPLICIT XIDs to consider fatal. This allows a user to:
- Specify ignored XIDs (including
all) - Specify SPECIFIC XIDs that are considered fatal (including
all).
The important thing here is that it allows users to override the list of hard-coded XIDs that we currently track.
There was a problem hiding this comment.
I am aware of this and was planning to do this as a follow-up as it recently got merged.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| continue | ||
| } | ||
|
|
||
| release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second)) |
There was a problem hiding this comment.
Oh! Acquiring this lock here is a big decision.
Here, I really expect a concise / precise code comment explaining convincingly
- why this lock must be acquired
- how we guarantee that
release()is always called
Maybe start by explaining what you think will go wrong when we do not acquire this lock here at this point.
There was a problem hiding this comment.
This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().
There was a problem hiding this comment.
i need to double check that lock is released on any failures.
There was a problem hiding this comment.
We should be mindful of acquiring this lock. Let's do it only for a strong reason. When we introduced that lock we named it prepare/unprepare lock because it's meant for that purpose.
Maybe we should use it here, too -- but let's pretty please thoroughly identify that strong reason, and put it into a few English sentences that are convincing.
I am not yet satisfied yet here by our arguments. We need to discuss the alternatives considered, I need more help please to understand why this is the correct approach (I really mean that -- it's not that I know what we should do -- but I sense that we don't, as a collective, understand yet what we really want to do here).
There was a problem hiding this comment.
I am thinking about the reason that you vaguely describe:
This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().
And I wonder: will it?
Can you describe a sequence of events where acquiring the lock would actually make that incoming nodePrepareResources() call not allocate an unhealthy device?
Here, we only update the ResourceSlice to un-announce any unhealthy device, right?
The moment we're done with that, we release the lock and the unchanged nodePrepareResources() call (that was waiting for us, hanging in lock acquisition) proceeds, trying to get what it wanted to get anyway.
When the kubelet already wants us to allocate an unhealthy device, then updating the ResourceSlice won't undo that (that gun is "fired" so to say), at least not reliably. Is that correct?
Let's say kubelet sends a PrepareResourceClaims() call our way and that may potentially result in allocating an unhealthy device (because the unhealthy notification arrived very recently).
Then I believe if we want a safe method to prevent that from happening we need to have logic within nodePrepareResource(). Does that make sense? (I quite literally don't know yet, you all have thought more about this than I did).
Specifically, I am wondring: do we need to call this new IsHealthy() somewhere in
We can always actively fail a prepareDevices() if the only matching device turns out to be unhealthy right before we would have allocated it.
Maybe this is already similar?
device, exists := s.allocatable[result.Device]
if !exists {
return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device)
}
I might be completely off in what I say here. The point is that I need to convince myself that really we know what we're doing here and that we only acquire the PU lock if we absolutely have to -- because it has potentially devastating downsides to do it unnecessarily often.
This discussion is really important to align on, and I'd love for you all to help me understand why what we're doing here is the right thing.
There was a problem hiding this comment.
acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.
But, does it? I might just not see it -- which sequence of events did you imagine, maybe?
Let me try to re-frame the problem space that I believe you are thinking about when you say "simultaneously not updating the health status and also allocating the same device".
You want to make sure that we don't allocate a device that's knowingly unhealthy. Does that sound about right?
I would agree -- let's try to do that :)
What do you think about the following mental model?
- Let's agree that this is generally a best-effort problem space -- between the device becoming unhealthy and us knowing there is unpredictable amount of time passing.
- Let's agree that this is an event propagation problem space.
I think we can also say:
- Deep within
func (s *DeviceState) Prepare()we must know, as early as possible, when a device is unhealthy. That's our final event consumer. - Event producer and event propagation pipeline are orthogonal to that.
The best we can do here is that we perform event propagation at minimal latency, towards the consumer.
Detour on latency
Because I find it interesting.
Between a GPU actually becoming unhealthy and us calling UpdateDeviceHealthStatus() as little time as possible should pass.
That is something that we can do and should do in this PR: minimize the fraction of event propagation latency that we control here.
Zooming out, this is always going to be a best effort strategy. Right now we seem to subscribe to GPU events more or less directly (but already now the event propagates through layers: there's the physical device, there's NVML, and then there's our process, and other layers that I am not even aware of). In the future, with NVSentinel, we're talking about event propagation across even more components.
Generally, there's a timeline attached that is unpredictable and we want to make sure we minimize latency at all steps. Here is one way to maybe think about that timeline:
T_1) a GPU actually becoming unhealthy (the 'physical' event)
T_2) us detecting it in component A
T_3) emitting an event in component A towards component B
T_4) potential-black-box-event-propagation -> after all emitted towards our GPU kubelet plugin
T_5) responding to that incoming event in our GPU kubelet plugin
Let's agree on the following: there's always a chance that we call func (s *DeviceState) Prepare() for a device after T_1 and before T_5.
We may just want to make sure we respond to the unhealthy event in the moment we receive it. We want to make sure we propagate to all its consumers ASAP.
That propagation itself does not need to be lock-protected; it just must happen fast.
Misc
Tangential, but potentially a helpful perspective: the "protected" data structure here (for now) is just device.Health and it's only mutated from within UpdateDeviceHealthStatus().
Then, also tangential, I notice that there already is a bit of a synchronization in your current patch proposal: UpdateDeviceHealthStatus() acquires the mutex on the DeviceState singleton:
func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, hs HealthStatus) {
s.Lock()
defer s.Unlock()
...
We acquire the same mutex in func (s *DeviceState) Prepare().
There was a problem hiding this comment.
This is very descriptive. Thank you for the effort JP. Yes, this is the flow of events i imagined when i was convinced that we needed to have the lock when updating the device health status and republishing the ResourceSlice.
unhealthy event -> lock to prevent any other operation on the device (mark device unhealthy + republish RS) -> unlock -> device unhealthy = my logic
but as you pointed out, i already acquire the lock when updating device status so the above lock is not really needed for republish and this is all best effort anyway in avoiding a potential race from the T_1 to T_5.
There was a problem hiding this comment.
@guptaNswati if you updated the code in response to this thread; please describe the update briefly towards concluding this thread.
There was a problem hiding this comment.
@jgehrcke I removed the lock before updating device status and republishing the ResourceSlice
1a950ca#diff-4015ee3b913a68a047b03392c3b18fbae672a6fdc26f6f76b52abd6ada951317
There was a problem hiding this comment.
Okay! I don't want to forget about our lovely detour here and hence I leave this open. But we can consider this 'done'.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| if err := d.pluginhelper.PublishResources(ctx, resources); err != nil { | ||
| klog.Errorf("Failed to publish resources after device health status update: %v", err) | ||
| } else { | ||
| klog.V(6).Info("Successfully republished resources without unhealthy device") |
There was a problem hiding this comment.
Notes:
- I think we should log this on level 0 or 1.
- Maybe add more detail -- all UUIDs of all currently unhealthy devices: "without unhealthy device(s) %s"
- I am never quite a fan of "Successfully" -- it does not add information :) Let's remove that, I will try to push for that elsewhere too.
There was a problem hiding this comment.
i dint purposefully add the UUIDs of unhealthy devices as its already logged multiple times and when ResourceSlice is republished it logs the diff which shows it too.
There was a problem hiding this comment.
Nothing to do here anymore before merging, but I'll want to leave the thread open for visibility.
Thanks to you for the review @elezar. Thanks to @jgehrcke also. The PR looks more lively :-D |
| case m.unhealthy <- dev: | ||
| klog.V(6).Infof("Marked device %s as unhealthy", dev.UUID()) | ||
| default: | ||
| klog.Errorf("Unhealthy channel full. Dropping unhealthy notification for device %s", dev.UUID()) |
There was a problem hiding this comment.
Is this worth logging as an error? Is this a problem, really?
Are there different types of "unhealthy"? Do we drop valuable information here? If one unhealthy event is just like the other then dropping "duplicates" seems completely reasonable along happy path.
There was a problem hiding this comment.
I think so or else will miss the unique device UUID in case of non duplicate events. What else can we do here?
There was a problem hiding this comment.
Dropping events here indicates a problem processing the events at the other end of the unhealthy channel. If we don't want to block here, we may want to increase the size of the channel buffer. At the moment it is set to the number of allocatable devices, which is reasonable. Do we want to add some "safety factor" to decrease the likelihood of the channel buffer being full? The other option is to block here. As mentioned, with buffered channels, we should only block if we're not receiving the unhealthy events fast enough.
There was a problem hiding this comment.
One thought I had, is whether it makes sense to introduce the concept of an "all" message so that the receiver could process this instead of having to send individual messages. (This would only be required if we foresee blocking / dropping messages being an actual issue in practice).
I don't think blocking this PR on deciding / implementing this is required.
There was a problem hiding this comment.
i like both ideas but need to think more about it https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/689/files/c575c63685bf6f92f71972f5c5fc737517dc77ed#r2586170468
I will add this as a follow-up To-do
There was a problem hiding this comment.
Added
// TODO: The non-blocking send protects the health-monitor goroutine from deadlocks,
// but dropping an unhealthy notification means the device's health transition may
// never reach the consumer. Consider follow-up improvements:
// - increase the channel buffer beyond len(allocatable) to reduce backpressure;
// - introduce a special "all devices unhealthy" message when bulk updates occur;
// - or revisit whether blocking briefly here is acceptable.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| // only log the error on publish failure as this is the not action we intend to keep in the long run. | ||
| // this is a temporary solution while waiting for device taints and tolerations support | ||
| // KEP: https://github.com/kubernetes/enhancements/issues/5055 | ||
| // as an alternative optimize this to do a patch update rather than full republish or retry on failure |
There was a problem hiding this comment.
Can you give this code comment more love? Let's try to write correct English here (including capitalization) that is easy to understand even for external reader of the code.
There was a problem hiding this comment.
only log the error on publish failure as this is the not action we intend to keep in the long run
You can and should use the opportunity here to express in the code comment what the potential set of problems is with not retrying on failure.
Think: we mark a device as unhealthy in an internal data structure, but then fail to re-publish the resource slice. What's the fallout? (I don't know, and I hope for the code to tell me via code comment).
There was a problem hiding this comment.
updated the comment to
// NOTE: We only log an error on publish failure and do not retry.
// If this publish fails, our in-memory health update succeeds but the
// ResourceSlice in the API server remains stale and still advertises the
// now-unhealthy device as allocatable. Until a later publish succeeds,
// the scheduler and other consumers will continue to see the unhealthy
// device as available, and new pods may be placed onto hardware we know
// is unusable. If publishes continue to fail (e.g., API server issues),
// the cluster can remain in this inconsistent state indefinitely.
// This is a temporary compromise while device taints/tolerations (KEP-5055)
// are available as a Beta feature. An interim improvement could be adding
// a retry/backoff or switch to patch updates instead of full republish.
jgehrcke
left a comment
There was a problem hiding this comment.
Hey! Thanks for all the work that has happened here in the meantime. I want to leave an intermediate review comment, pointing out what I think should be major goals at this point:
-
I think we should go the extra mile here and make sure that we never mark a device as unhealthy that isn't actually unhealthy. Before merging this patch, I'd like us to describe how we tried making sure of that.
-
I understand that we have a major customer in mind with this patch. What do they care most about, and how much are we able to test that? (think: if there's one thing to get right here -- what would that be?)
-
I know it's sometimes daunting, but let's try to emit quality log messages in the context of this new behavior. The mindset should be: we probably make some mistakes here, and they might manifest very rarely (might be hard to reproduce). That means: when something goes wrong in production, we should be able to debug what happened based on the logs we got on the default log level.
We will not achieve perfection along each of these three dimensions, and that's fine. But we should be focusing our energy on these points before releasing this.
I think XID errors. And re-reviewing the code, it is reporting critical xid errors. I was never directly involved in the conversations with the customer on their requirements, so i cannot answer this well. My limited direction was to follow this #360. I would defer this to @klueska. |
In summary, these are the scenarios in which device is marked unhealthy:
|
I agree. I will update the logging and see if i can improve it further. #689 (comment) |
| } | ||
|
|
||
| klog.V(4).Infof("Sending unhealthy notification for device %s due to event type:%v and event data:%d", affectedDevice.UUID(), eType, xid) | ||
| m.unhealthy <- affectedDevice |
There was a problem hiding this comment.
Note that although we use a non-blocking send when marking ALL devices as unhealthy, we use a blocking send for a single device. Since this is the only loop where we actually send these events, does it makes sense to switch to blocking sends in all cases for simplicitly (at least as a starting point)?
There was a problem hiding this comment.
mmmm the single device path is driven by individual XID events. I would assume xid errors to be relatively infrequent, and if in case the consumer is actually slow, its more reasonable to block here so that we don’t silently drop any critical event.
in all devices path, there are mostly library and driver communication issues which are handled as best effort by logging an error but we still want to keep the monitor alive.
9202c43 to
da6cad9
Compare
| // the cluster can remain in this inconsistent state indefinitely. | ||
| // This is a temporary compromise while device taints/tolerations (KEP-5055) | ||
| // are available as a Beta feature. An interim improvement could be adding | ||
| // a retry/backoff or switch to patch updates instead of full republish. |
There was a problem hiding this comment.
Thanks for adding a more exhaustive and clear comment here.
|
Thanks for all the work, Swati. Towards merging, can you now do the following?
|
da6cad9 to
c4fbda0
Compare
Signed-off-by: Swati Gupta <swatig@nvidia.com> rebase and address review comments Signed-off-by: Swati Gupta <swatig@nvidia.com>
c4fbda0 to
d1d4eb5
Compare
|
/ok-to-test |
@guptaNswati, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok-to-test d1d4eb5 |
|
Tested on a logs quick check on unhealthy device entry vs healthy: |
Nice. Can you also show a snippet for how you build(t) that? So that I have that easier when I also want to try this out! |
jgehrcke
left a comment
There was a problem hiding this comment.
🚀 What a journey :-). Thanks everyone for all the help and input and ideas that we threw together. And thank you Swati for bearing with us.
I just saw green CI. Let's now merge this, and then let's iterate after merge.
Definitely, its very handy for the test. I used to use it for DCGM testing also. I use nvcc and this command: |
Oh i agree. Thank you @jgehrcke and thank you everyone. Alot of learnings here and follow-up improvements. Lets prioritize these early next year. With all the discussions, I realized how imp is this feature and needs to be done more thoroughly. |
|
Thanks @guptaNswati! Just a note on the XID 43 that we're triggering with the example. According to the documentation this XID is an
It is also in the list of XIDs that we explicitly skip. Did you run the test with some other configuration to trigger this? |
|
@elezar Yes thats true. this is the only safe xid to trigger without actually breaking the GPU. so i comment out 43 before build for testing. |
Addressing #360 to add preliminary health check similar to https://github.com/NVIDIA/k8s-device-plugin.
Test logs:
TLDR, This device had an event and is not added
MIG-4d806f22-346a-5a1d-ac01-86b505cdf485The device is picked back when driver is restarted.