feat: implement Device Binding Conditions for ComputeDomain#855
feat: implement Device Binding Conditions for ComputeDomain#855ttsuuubasa wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
First high level comment before I dig into the details -- this needs a feature gate, so that users can opt-in to using this feature only on k8s clusters that also have the binding conditions feature enabled. |
|
Another high-level comment after reading the PR description but before looking into the details. As mentioned in this comment, the recommended value for This essentially makes the overall ComputeDomain Instead, one can/should rely on the This may already be what you are doing, but from the PR description this isn't very clear. |
| m.mutationCache = cache.NewIntegerResourceVersionMutationCache( | ||
| klog.Background(), | ||
| m.informer.GetStore(), | ||
| m.informer.GetIndexer(), | ||
| mutationCacheTTL, | ||
| true, | ||
| ) |
There was a problem hiding this comment.
We are never writing anything into these daemonset pods, so why do we need a mutation cache?
There was a problem hiding this comment.
As you already noted in the following comment, this was needed in order to execute getByComputeDomainUID.
| } | ||
|
|
||
| func (m *DaemonSetPodManager) Get(ctx context.Context, cdUID string, nodeName string) (*corev1.Pod, error) { | ||
| pods, err := getByComputeDomainUID[*corev1.Pod](ctx, m.mutationCache, cdUID) |
There was a problem hiding this comment.
I see, the mutation cache is added so that you can do this. The getByComputeDomainUID should probably take an Indexer() as a parameter rather than a MutationCache to avoid this.
There was a problem hiding this comment.
I replaced the existing getByComputeDomainUID with a new getByComputeDomainUIDAndNode function that uses an indexer based on cdUID + nodeName, and updated the call sites to pass m.informer.GetIndexer() instead of the MutationCache.
| for _, p := range pods { | ||
| if p.Spec.NodeName == nodeName { | ||
| return p, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
Instead of looping like this, can you add an indexer that combines cdUID+nodeName (rather than just cdUID. since this is the combo you always want to look up with.
There was a problem hiding this comment.
I added a new indexer named "computeDomainNode" that uses "<cdUID>/<nodeName>" as its key. To support this, I defined the addComputeDomainNodePodIndexer and getByComputeDomainUIDAndNode functions in indexers.go.
| var candidates []corev1.Node | ||
| if len(nodeNames) == 0 { | ||
| candidates = nodes.Items | ||
| } else { | ||
| for _, nodeName := range nodeNames { | ||
| for _, node := range nodes.Items { | ||
| if node.Name == nodeName { | ||
| candidates = append(candidates, node) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
When do you ever want to remove the label from some nodes but not all of them for a given computeDomain?
There was a problem hiding this comment.
You would want to remove the label from only some nodes when the IMEX Daemon Pod processing on a specific node fails or times out. In such cases, removing the label helps clean up the node where the failure occurred. After that, by allowing the workload pod to be re‑scheduled through mechanisms like BindingFailureConditions, the IMEX Daemon Pod processing can be retried on another node.
| func (m *ComputeDomainManager) AssertComputeDomainNamespace(ctx context.Context, claimNamespace, cdUID string) error { | ||
| cd, err := m.GetComputeDomain(ctx, cdUID) | ||
| if err != nil { | ||
| return fmt.Errorf("error getting ComputeDomain: %w", err) | ||
| } | ||
| if cd == nil { | ||
| return fmt.Errorf("ComputeDomain not found: %s", cdUID) | ||
| } | ||
|
|
||
| if cd.Namespace != claimNamespace { | ||
| return fmt.Errorf("the ResourceClaim's namespace is different than the ComputeDomain's namespace") | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
We can't just remove these functions. The default should still be to do things the "old" way. Your new feature needs to be put behind a feature gate and only used when that feature gate is enabled.
There was a problem hiding this comment.
I have restored these functions.
| if err := s.computeDomainManager.AddNodeLabel(ctx, config.DomainID); err != nil { | ||
| return nil, fmt.Errorf("error adding Node label for ComputeDomain: %w", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
I’ve updated these functions to be skipped when the ComputeDomainBindingConditions feature gate is enabled. By default, they still run.
| BindingConditions: []string{nvapi.ComputeDomainBindingConditions}, | ||
| BindingFailureConditions: []string{nvapi.ComputeDomainBindingFailureConditions}, |
There was a problem hiding this comment.
When were these fields added to the API? Do we need a version check here to ensure that these fields are available? In any case, setting them should (at a minimum) be protected by your new feature gate.
There was a problem hiding this comment.
These fields have been added since v1.34 across resource/v1beta1, v1beta2, and v1.
If we want to perform version checking, would it be better to use ServerVersion() and ensure that the API server is at least v1.34.0?
For now, I’ve updated the implementation so that these fields are added only when the ComputeDomainBindingConditions feature gate is enabled.
| - apiGroups: [""] | ||
| resources: ["events"] | ||
| verbs: ["get", "list", "watch"] |
There was a problem hiding this comment.
It is needed because the ResourceClaimManager watches Pod events to determine whether a binding timeout has occurred during scheduling. Since the binding timeout is triggered by the scheduler, it is not recorded in the ResourceClaim. Instead, it appears in the Pod events with Event.reason = "SchedulerError" and Event.message including "binding timeout" text. To detect this, we grant list permissions on Events.
There was a problem hiding this comment.
How long until a binding timeout occurs?
| m.nodeManager = NewNodeManager(config, getComputeDomain) | ||
| m.daemonSetPodManager = NewDaemonSetPodManager(config) |
There was a problem hiding this comment.
We should not be instantiating another set of instances for these components. if we need them to do something in response to events happening in this component, then some other method must be used to plumb the appropriate functions between the already-created instances of these components into this one (or vice versa).
There was a problem hiding this comment.
I was thinking of resolving this by passing references to m.nodeManager and m.daemonSetPodManager into the ResourceClaimManager.
Then I would implement the function (m *MultiNamespaceDaemonSetManager) GetDaemonSetPod(), which would invoke (m *DaemonSetPodManager) Get in a nested, hierarchical manner.
However, when calling (m *MultiNamespaceDaemonSetManager) GetDaemonSetPod() from the ResourceClaimManager, how should the namespace be specified?
Also, if a DaemonSet exists across multiple namespaces, will the IMEX Daemon Pod be started multiple times?
I'm not entirely sure how this behavior works.
| informer := cache.NewSharedIndexInformer( | ||
| &cache.ListWatch{ | ||
| ListWithContextFunc: func(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) { | ||
| return config.clientsets.Resource.ResourceClaims("").List(ctx, options) | ||
| }, | ||
| WatchFuncWithContext: func(ctx context.Context, options metav1.ListOptions) (watch.Interface, error) { | ||
| return config.clientsets.Resource.ResourceClaims("").Watch(ctx, options) | ||
| }, | ||
| }, | ||
| &resourcev1.ResourceClaim{}, | ||
| informerResyncPeriod, | ||
| cache.Indexers{}, | ||
| ) |
There was a problem hiding this comment.
I'm not a huge fan of this component having to watch all ResourceClaims just to look for those that have requests for ComputeDomainChannels. Is there a better way to filter these at the informer level?
/cc @pohly
There was a problem hiding this comment.
Actually, thinking about this more -- is there a way to have each kubelet plugin process the binding conditions for the claim destined for its node (instead of doing this in the centralized controller)?
I don't want each kubelet plugin to have to watch all ResourceClaims in order to do this, but if there is a way for the kubelet plugin to know that a given claim is bound to a pod that is destined for its node (and waiting for a binding condition), that would be ideal.
This way the centralized controller does not become a bottleneck for all workload pods to start. The work instead gets distributed to the local node where the workload will eventually start.
There was a problem hiding this comment.
I'm not a huge fan of this component having to watch all ResourceClaims just to look for those that have requests for ComputeDomainChannels. Is there a better way to filter these at the informer level?
I thought it would be ideal if the labels attached to the ResourceClaimTemplate were propagated to the ResourceClaim - just like with DaemonSet - but since that’s not the case, I couldn’t come up with a good approach.
There was a problem hiding this comment.
is there a way to have each kubelet plugin process the binding conditions for the claim destined for its node (instead of doing this in the centralized controller)?
As far as I understand, there is no such mechanism.
The reason is that the kubelet plugin operates after the Pod has been scheduled, whereas the BindingConditions are conditions that must be set before scheduling in order for the scheduler to make a decision.
The kubelet plugin also functions as the controller for ResourceSlices, but it does not watch ResourceClaims.
Therefore, they cannot be processed on the kubelet side.
To support this behavior, I believe the DRA framework itself would require significant changes.
There was a problem hiding this comment.
I really don't like the idea of the controller being a bottleneck for releasing each workload pod after its corresponding compute-domain-daemon pod becomes ready.
We have some customers running compute domains across 1000+ nodes. With the current, distributed solution we are able to get all workload pods up and running in under 12s. With this new, centralized solution it will take over 3min (with the default QPS and bust settings) to update all 1000 resource claims with the binding condition.
Given this, I'm not comfortable moving forward with this PR until we can find an elegant way of distributing these binding condition updates to each node.
There was a problem hiding this comment.
Even if we take this approach, wouldn’t the controller still need to filter through all ResourceClaims in order to figure out which ResourceClaim is the target and which node it is assigned to? Is my understanding correct?
There was a problem hiding this comment.
Yes, but then its only using this information to update one object -- the ComputeDomain (and can have read-only access to ResourceClaims). This isn't a long-term solution because it still leaves the controller as a bottleneck for triggering workloads to be started, but it's closer to what we want in terms of the kubelet-plugins only requesting access to the one-and-only ResourceClaim they care about updating.
There was a problem hiding this comment.
Thank you for clarifying.
With this approach, I believe the kubelet plugin would need to add an updateFunc to its ComputeDomain informer in order to detect when the controller writes to the ComputeDomain. Is that correct?
There was a problem hiding this comment.
In addition, I’d like to discuss under what circumstances we should write BindingFailureConditions.
By writing BindingFailureConditions, we can trigger Pod rescheduling, which is useful in cases such as when processing in the IMEX DaemonSet Pod fails.
Initially, we were considering writing BindingFailureConditions when we detect that the ComputeDomain’s node status is "NotReady". However, in what situations does a node actually become NotReady?
The IMEX DaemonSet Pod checks the status of its own Pod via the PodManager, but if the Pod cannot start, wouldn’t it be unable to write NotReady in the first place? In other words, it seems there are only two possibilities:
- Pod startup fails → nothing is written
- Pod startup succeeds →
Readyis written
Also, looking at the code, it appears that when the IMEX DaemonSet Pod starts up, the ComputeDomainStatusManager first writes NotReady, and then the PodManager writes Ready afterward. Because of this, I think it’s not possible to tell from NotReady alone whether the Pod is still in the process of becoming ready or has actually failed. Is my understanding of the specification correct?
For this reason, we were considering introducing checkDaemonSetPodStatus() to check the DaemonSet Pod’s status externally.
Given this, what kinds of cases do you think should be used as triggers for writing BindingFailureConditions?
There was a problem hiding this comment.
@klueska
I have pushed an implementation that moves most of the BindingConditions update logic to the kubelet plugin.
In the controller, the ResourceClaim manager filters ResourceClaims and writes the node name and ResourceClaim name into the ComputeDomain status.
In your proposal, the computedomain.status.nodes field was suggested as the appropriate place for this information. However, this field is synchronized by cdstatusmanager according to the startup state of the imex daemon pod, which means it cannot be written to before the imex daemon pod starts.
For this reason, I added a new field, computedomain.status.resourceclaim.
Additionally, for the NotReady issue, we found that computedomain.status.nodes.status is also updated by the controller’s cdstatusmanager. I introduced a new Failed status in addition to NotReady and Ready as a condition for setting BindingFailureConditions.
Based on these changes, I would appreciate another round of review.
| } | ||
|
|
||
| // Checks if ResourceClaim is Eligible for Compute Domain Labeling and returns domainID | ||
| func (m *ResourceClaimManager) checkClaimEligibleForComputeDomainLabeling(rc *resourcev1.ResourceClaim) (bool, string) { |
There was a problem hiding this comment.
This function should have more comments throughout it, explaining what it is doing.
There was a problem hiding this comment.
Also, see me comment above about changing its name / what it returns.
There was a problem hiding this comment.
I’ve added the comments, and I’ve also updated the function name and its return value.
|
|
||
| ComputeDomainBindingConditions = "IMEXDaemonSettingsDone" | ||
| ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed" |
There was a problem hiding this comment.
These variable names / their values do not seems to match their intent.
There was a problem hiding this comment.
I updated it as follows.
ComputeDomainBindingConditions = "ComputeDomainReady"
ComputeDomainBindingFailureConditions = "ComputeDomainNotReady"
| //Check namespace | ||
| err = m.AssertComputeDomainNamespace(rc.Namespace, domainID) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to assert Namespace for computeDomain with domainID %s and ResourceClaim %s/%s: %w", domainID, rc.Namespace, rc.Name, err) | ||
| } |
There was a problem hiding this comment.
This should be done as soon as we know what the cdUID is.
There was a problem hiding this comment.
I modified it so that the AssertComputeDomainNamespace() function is executed immediately after obtaining the domainID.
| // Check the allocationResult of ResourceClaim to determine whether it should be monitored. | ||
| isEligible, domainID := m.checkClaimEligibleForComputeDomainLabeling(rc) | ||
| if !isEligible { | ||
| return nil | ||
| } | ||
|
|
||
| if domainID == "" { | ||
| return fmt.Errorf("matching ResourceClaim %s/%s has no domainID in allocation config", rc.Namespace, rc.Name) | ||
| } |
There was a problem hiding this comment.
I'd rather see this as something like:
// Check the allocationResult of ResourceClaim to determine whether it should be monitored.
err, req := m.getComputeDomainChannelRequestConfig(rc)
if err != nil {
return fmt.Errorf("error getting config for ComputeDomainChannel request from ResourceClaim %s/%s: %w", rc.Namespace, rc.Name, err)
}
if req == nil {
return nil
}
// use req.DomainID later on, where the type of `req` is a `ComputeDomainChannelConfig`.
There was a problem hiding this comment.
I updated the code as you pointed out.
| // get info for map keys | ||
| rcUID := string(rc.UID) | ||
| rcMonitors, err := m.loadResourceClaimMonitor(rcUID) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| // Cancel a monitor that has already timed out. | ||
| m.cancelTimeoutedMonitor(ctx, rc, rcMonitors, currentAllocationTimestamp) |
There was a problem hiding this comment.
I haven't looked into the details of what these "monitors" are. Can you explain what you are trying to do here (and why it's needed).?
There was a problem hiding this comment.
These monitors refer to the ComputeDomain status monitors executed by PollUntilContextCancel running in separate goroutines. Since multiple ResourceClaims can be created for a single ComputeDomain, I launch a dedicated goroutine for each one so that a single controller can monitor them collectively.
The reason the ResourceClaimManager holds a cancelable context is to support "binding timeouts". If the goroutine monitoring the ComputeDomain fails to detect Ready or NotReady, the BindingConditions will never be written. If this state persists, the scheduler will, by default, trigger a timeout after 10 minutes. Once the timeout occurs, the scheduler will attempt to reschedule the Pod, which may cause the ResourceClaim to be assigned to a different node.
In such cases, monitoring the ComputeDomain.status.nodes.status for the original node is no longer necessary, so the corresponding monitoring goroutine should be canceled. This is why the monitor includes cancellation handling.
| ComputeDomainChannelAllocationModeAll = "All" | ||
|
|
||
| ComputeDomainBindingConditions = "IMEXDaemonSettingsDone" | ||
| ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed" |
There was a problem hiding this comment.
When does one enter the failed condition? Is it after some timeout? If so, what is the benefit to timing out rather than just waiting indefinitely?
There was a problem hiding this comment.
It enters the failure condition in the two failure cases described in the PR.
The timeout behavior is part of the BindingConditions specification, but users can configure the scheduler to enable or disable this timeout. By default, a timeout occurs after 10 minutes.
Also, in cases where a timeout happens, the BindingFailureConditions are not written.
| // Start the polling loop to monitor the ComputeDomain node status. | ||
| err := wait.PollUntilContextCancel(monitorCtx, pollInterval, false, pollCondition) |
There was a problem hiding this comment.
We do not want to be blocking in the body of onAddOrUpdate. This will block the entire state machine of the ComputeDomain controller and no other incoming tasks will be processed until this function returns.
There was a problem hiding this comment.
Because PollUntilContextCancel is executed in a separate goroutine, it does not block the controller’s overall state machine.
| // Launch the monitoring goroutine | ||
| m.waitGroup.Add(1) | ||
| go func() { | ||
| defer m.waitGroup.Done() | ||
| defer close(doneChannel) | ||
| pollCondition := func(pollCtx context.Context) (bool, error) { | ||
| cd, err := m.getComputeDomain(domainID) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
|
|
||
| var foundNodesStatus bool | ||
| if cd.Status.Nodes != nil { | ||
| for _, node := range cd.Status.Nodes { | ||
| if node.Name == nodeName { | ||
| foundNodesStatus = true | ||
| // Check ComputeDomain node status | ||
| switch node.Status { | ||
| // Ready | ||
| case nvapi.ComputeDomainStatusReady: | ||
| return true, nil | ||
| // NotReady | ||
| case nvapi.ComputeDomainStatusNotReady: | ||
| return true, fmt.Errorf("%w: binding failed — IMEX daemon failed", ErrBindingFailure) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| // If the ComputeDomain node status has not been written. | ||
| if !foundNodesStatus { | ||
| // Check if IMEX Daemon Pod was started correctly | ||
| err := m.checkDaemonSetPodStatus(pollCtx, domainID, nodeName) | ||
| if err != nil { | ||
| return true, err | ||
| } | ||
| } | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
We already have the daemonsetpodmanager reporting when each of its daemonset pods from a given compute domain move in/out of the ready state, can that not be leveraged here instead of starting our own polling loop?
There was a problem hiding this comment.
Sorry, but I couldn’t understand what you meant.
As far as I can see, the DaemonSetPodManager only implements the Start, Stop, and List functions.
What do you mean when you say it reports the ready state?
| Device: allocationDevice.Device, | ||
| Pool: allocationDevice.Pool, |
There was a problem hiding this comment.
I only made it part of the way through this file (there is alot to digest in here). Please address the existing comments, and then I'll come back to this file and look at it in its new state.
There was a problem hiding this comment.
I’ve addressed the existing comments. There were a few points where I wasn’t sure how to implement the requested changes or how to interpret the comments, so I’d appreciate it if you could review them again.
| for _, term := range rc.Status.Allocation.NodeSelector.NodeSelectorTerms { | ||
| for _, field := range term.MatchFields { | ||
| if field.Key == "metadata.name" && | ||
| field.Operator == "In" && | ||
| len(field.Values) > 0 && | ||
| field.Values[0] != "" { | ||
| return field.Values[0], true | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Is this guaranteed to be the format / way the node selector will be set for binding conditions?
There was a problem hiding this comment.
As I understand it, regardless of the presence of BindingConditions, in DRA’s specification, cases where a device in a ResourceSlice specifies a nodeName like NVIDIA DRA Driver are guaranteed to follow this format according to the code below. Is my understanding correct? -> @pohly
I introduced a feature gate called |
This behavior was already implemented, but I had omitted it from the PR description for the sake of simplicity. I have now added an explanation that the ResourceClaimManager watches the field
|
|
@klueska cc: @pohly Short-term
Long-term
For the part about “using filtering functionality to retrieve ResourceClaims assigned to the local node”, |
bfecd6c to
3b78f15
Compare
|
I have two additional questions. First, currently the controller’s Second, regarding the implementation of BindingConditions in ComputeDomain: you suggested adding fields to the ComputeDomain status to store the ResourceClaim |
|
I want to generalize the use of BindingConditions to reach agreement on the design for the ComputeDomain case. [Required functionality when implementing BindingConditions in vendor DRA drivers]
[Regarding Functionality 2]
[Where should the four features be implemented? (ControlPlane or each Node)]
2. Functionality to detect ResourceClaims where resources with BindingConditions are allocated
3. Functionality to perform processing to satisfy BindingConditions
4. Functionality to write the result (success or failure) of whether BindingConditions was satisfied
Based on this approach, the ComputeDomain case would be as follows: |
Added a feature gate for ComputeDomainBindingConditions. When this feature gate is enabled, the following functionality is activated. - Publish BindingConditions on channel devices in ComputeDomain ResourceSlices - Record assigned node and ResourceClaim name/namespace into ComputeDomain via ResourceClaimManager - Update ResourceClaims when ComputeDomain nodes become Ready - Move namespace assertion logic from kubelet-plugin to controller Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
3b78f15 to
0444278
Compare
|
@klueska @shivamerla With your existing distributed solution, you can bring up all workload pods in under 12 seconds when computing domains across 1,000+ nodes. We would like to see how this result differs with our approach. Previously, you mentioned that updating all 1,000 resource claims with binding conditions "will take" more than 3 min. At least since then, we have moved the binding condition update logic to the kubelet plugin, so we expect to see some performance improvements. |
Summary
This PR introduces Device Binding Conditions into the ComputeDomain.
This implementation allows scheduling of workload pods with channel devices to be delayed by
BindingConditionsuntil the IMEX Daemon Pods complete their processing. Based on the status of the IMEX Daemon Pod, the ComputeDomain Controller can determine whether the workload pod should be scheduled on the same node or rescheduled to another node. To accomplish this, we introduced theResourceClaimManagerinto thecompute-domain-controller.Workflow
ResourceSlice is published with its channel devices having
BindingConditionsapplied.ComputeDomain Controller detects a ResourceClaim that subscribes to a channel device and then labels the node where the workload pod is scheduled and triggers the launch of the IMEX DaemonSet pod.
ResourceClaims to be monitored must satisfy the following conditions:
driveris"compute-domain.nvidia.com".ComputeDomainChannelConfig).BindingConditions.BindingConditions/BindingFailureConditionsThe ComputeDomain Controller monitors the status of the ComputeDomain resource and checks the processing status of the IMEX DaemonSet pod per node - the field is
ComputeDomain.status.nodes[*].status.ComputeDomain Controller to branch into three processing paths: "success", "failure", and "timeout" - based on this status.
success
BindingConditionsand proceeds to launch the workload pod.failure
BindingFailureConditionsand prompts the workload pod to be rescheduled.timeout
BindingConditions/BindingFailureConditionsare written within the timeout window, the scheduler triggers a BindingTimeout and reschedules the workload pod. The controller detects the timeout and starts monitoring the new assignment. The status monitoring information for each ResourceClaim is stored in a map indexed by a UUID x timestamp. When a timeout triggers a new assignment, any monitoring entry associated with an older timestamp is canceled.cleanup: ComputeDomain Controller performs cleanup actions - such as deleting the node labels - to terminate the IMEX Daemon Pod.
Comparison with the Existing
We newly introduced the
ResourceClaimManagerinto thecompute-domain-controllerin this implementation, and we have migrated several existing functionalities that previously resided in thecompute-domain-kubelet-plugin.Readystatus of the ComputeDomainTest
We have already completed testing for all cases except failure‑1 in the above flow.
We will also push the test scripts later.
Fixes #653