feat: implement Device Binding Conditions for ComputeDomain by ttsuuubasa · Pull Request #855 · NVIDIA/k8s-dra-driver-gpu

ttsuuubasa · 2026-02-06T08:31:16Z

Summary

This PR introduces Device Binding Conditions into the ComputeDomain.
This implementation allows scheduling of workload pods with channel devices to be delayed by BindingConditions until the IMEX Daemon Pods complete their processing. Based on the status of the IMEX Daemon Pod, the ComputeDomain Controller can determine whether the workload pod should be scheduled on the same node or rescheduled to another node. To accomplish this, we introduced the ResourceClaimManager into the compute-domain-controller.

Workflow

ResourceSlice is published with its channel devices having BindingConditions applied.
ComputeDomain Controller detects a ResourceClaim that subscribes to a channel device and then labels the node where the workload pod is scheduled and triggers the launch of the IMEX DaemonSet pod.
ResourceClaims to be monitored must satisfy the following conditions:
- The driver is "compute-domain.nvidia.com".
- The device is a channel device (determined by whether its corresponding config is ComputeDomainChannelConfig).
- The device has BindingConditions.
- The device is not set any BindingConditions/BindingFailureConditions
The ComputeDomain Controller monitors the status of the ComputeDomain resource and checks the processing status of the IMEX DaemonSet pod per node - the field is ComputeDomain.status.nodes[*].status.
ComputeDomain Controller to branch into three processing paths: "success", "failure", and "timeout" - based on this status.

success
- Ready - IMEX DaemonSet pod completes successfully and the ComputeDomain status is updated to "Ready". ComputeDomain Controller sets the BindingConditions and proceeds to launch the workload pod.
failure
1. NotReady - IMEX DaemonSet Pods fails and the ComputeDomain status is updated to "NotReady", the ComputeDomain Controller sets the BindingFailureConditions and prompts the workload pod to be rescheduled.
2. IMEX Daemon Pod fails and no status is written to the ComputeDomain status, the ComputeDomain Controller checks the status of the IMEX Daemon Pod. Once it confirms that the creation has failed - for example, in a CrashLoopBackOff state - it sets the BindingFailureConditions and prompts the workload pod to be rescheduled.
timeout
- ComputeDomain Controller is in neither success nor failure, it keeps monitoring. If no BindingConditions / BindingFailureConditions are written within the timeout window, the scheduler triggers a BindingTimeout and reschedules the workload pod. The controller detects the timeout and starts monitoring the new assignment. The status monitoring information for each ResourceClaim is stored in a map indexed by a UUID x timestamp. When a timeout triggers a new assignment, any monitoring entry associated with an older timestamp is canceled.
cleanup: ComputeDomain Controller performs cleanup actions - such as deleting the node labels - to terminate the IMEX Daemon Pod.

Comparison with the Existing

We newly introduced the ResourceClaimManager into the compute-domain-controller in this implementation, and we have migrated several existing functionalities that previously resided in the compute-domain-kubelet-plugin.

Labeling nodes to enable the startup of the IMEX DaemonSet Pods
Assertion of the ComputeDomain and ResourceClaim namespaces
Monitoring the Ready status of the ComputeDomain

Test

We have already completed testing for all cases except failure‑1 in the above flow.

We will also push the test scripts later.

Fixes #653

copy-pr-bot · 2026-02-06T08:31:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

klueska · 2026-02-08T13:52:23Z

First high level comment before I dig into the details -- this needs a feature gate, so that users can opt-in to using this feature only on k8s clusters that also have the binding conditions feature enabled.

klueska · 2026-02-08T14:06:10Z

Another high-level comment after reading the PR description but before looking into the details.

As mentioned in this comment, the recommended value for NumNodes is 0 since the 2.5.8.0 release.

This essentially makes the overall ComputeDomain Ready status meaningless. If this PR is relying on the overall ComputeDomain Ready status to set a binding condition for all workload pods, then this needs to be changed, since it will always be flipped to Ready immediately when NumNodes=0.

Instead, one can/should rely on the Ready status of each individual node in the ComputeDomain status to set the binding condition for the workload pod running on that node.

This may already be what you are doing, but from the PR description this isn't very clear.

klueska · 2026-02-08T14:11:21Z

cmd/compute-domain-controller/daemonsetpods.go

+	m.mutationCache = cache.NewIntegerResourceVersionMutationCache(
+		klog.Background(),
+		m.informer.GetStore(),
+		m.informer.GetIndexer(),
+		mutationCacheTTL,
+		true,
+	)


We are never writing anything into these daemonset pods, so why do we need a mutation cache?

As you already noted in the following comment, this was needed in order to execute getByComputeDomainUID.

klueska · 2026-02-08T14:15:40Z

cmd/compute-domain-controller/daemonsetpods.go

 }
+
+func (m *DaemonSetPodManager) Get(ctx context.Context, cdUID string, nodeName string) (*corev1.Pod, error) {
+	pods, err := getByComputeDomainUID[*corev1.Pod](ctx, m.mutationCache, cdUID)


I see, the mutation cache is added so that you can do this. The getByComputeDomainUID should probably take an Indexer() as a parameter rather than a MutationCache to avoid this.

I replaced the existing getByComputeDomainUID with a new getByComputeDomainUIDAndNode function that uses an indexer based on cdUID + nodeName, and updated the call sites to pass m.informer.GetIndexer() instead of the MutationCache.

klueska · 2026-02-08T14:19:30Z

cmd/compute-domain-controller/daemonsetpods.go

+	for _, p := range pods {
+		if p.Spec.NodeName == nodeName {
+			return p, nil
+		}
+	}


Instead of looping like this, can you add an indexer that combines cdUID+nodeName (rather than just cdUID. since this is the combo you always want to look up with.

I added a new indexer named "computeDomainNode" that uses "<cdUID>/<nodeName>" as its key. To support this, I defined the addComputeDomainNodePodIndexer and getByComputeDomainUIDAndNode functions in indexers.go.

klueska · 2026-02-08T14:22:28Z

cmd/compute-domain-controller/node.go

+	var candidates []corev1.Node
+	if len(nodeNames) == 0 {
+		candidates = nodes.Items
+	} else {
+		for _, nodeName := range nodeNames {
+			for _, node := range nodes.Items {
+				if node.Name == nodeName {
+					candidates = append(candidates, node)
+				}
+			}
+		}
+	}


When do you ever want to remove the label from some nodes but not all of them for a given computeDomain?

You would want to remove the label from only some nodes when the IMEX Daemon Pod processing on a specific node fails or times out. In such cases, removing the label helps clean up the node where the failure occurred. After that, by allowing the workload pod to be re‑scheduled through mechanisms like BindingFailureConditions, the IMEX Daemon Pod processing can be retried on another node.

klueska · 2026-02-08T14:24:27Z

cmd/compute-domain-kubelet-plugin/computedomain.go

-func (m *ComputeDomainManager) AssertComputeDomainNamespace(ctx context.Context, claimNamespace, cdUID string) error {
-	cd, err := m.GetComputeDomain(ctx, cdUID)
-	if err != nil {
-		return fmt.Errorf("error getting ComputeDomain: %w", err)
-	}
-	if cd == nil {
-		return fmt.Errorf("ComputeDomain not found: %s", cdUID)
-	}
-
-	if cd.Namespace != claimNamespace {
-		return fmt.Errorf("the ResourceClaim's namespace is different than the ComputeDomain's namespace")
-	}
-
-	return nil
-}


We can't just remove these functions. The default should still be to do things the "old" way. Your new feature needs to be put behind a feature gate and only used when that feature gate is enabled.

I have restored these functions.

klueska · 2026-02-08T14:24:48Z

cmd/compute-domain-kubelet-plugin/device_state.go

-	if err := s.computeDomainManager.AddNodeLabel(ctx, config.DomainID); err != nil {
-		return nil, fmt.Errorf("error adding Node label for ComputeDomain: %w", err)
-	}
-


Likewise here.

I’ve updated these functions to be skipped when the ComputeDomainBindingConditions feature gate is enabled. By default, they still run.

klueska · 2026-02-08T14:27:15Z

cmd/compute-domain-kubelet-plugin/deviceinfo.go

+		BindingConditions:        []string{nvapi.ComputeDomainBindingConditions},
+		BindingFailureConditions: []string{nvapi.ComputeDomainBindingFailureConditions},


When were these fields added to the API? Do we need a version check here to ensure that these fields are available? In any case, setting them should (at a minimum) be protected by your new feature gate.

These fields have been added since v1.34 across resource/v1beta1, v1beta2, and v1.
If we want to perform version checking, would it be better to use ServerVersion() and ensure that the API server is at least v1.34.0?
For now, I’ve updated the implementation so that these fields are added only when the ComputeDomainBindingConditions feature gate is enabled.

klueska · 2026-02-08T14:28:36Z

deployments/helm/nvidia-dra-driver-gpu/templates/rbac-controller.yaml

+- apiGroups: [""]
+  resources: ["events"]
+  verbs: ["get", "list", "watch"]


Why is this necessary?

It is needed because the ResourceClaimManager watches Pod events to determine whether a binding timeout has occurred during scheduling. Since the binding timeout is triggered by the scheduler, it is not recorded in the ResourceClaim. Instead, it appears in the Pod events with Event.reason = "SchedulerError" and Event.message including "binding timeout" text. To detect this, we grant list permissions on Events.

How long until a binding timeout occurs?

klueska · 2026-02-08T14:32:09Z

cmd/compute-domain-controller/resourceclaim.go

+	m.nodeManager = NewNodeManager(config, getComputeDomain)
+	m.daemonSetPodManager = NewDaemonSetPodManager(config)


We should not be instantiating another set of instances for these components. if we need them to do something in response to events happening in this component, then some other method must be used to plumb the appropriate functions between the already-created instances of these components into this one (or vice versa).

I was thinking of resolving this by passing references to m.nodeManager and m.daemonSetPodManager into the ResourceClaimManager.

Then I would implement the function (m *MultiNamespaceDaemonSetManager) GetDaemonSetPod(), which would invoke (m *DaemonSetPodManager) Get in a nested, hierarchical manner.

However, when calling (m *MultiNamespaceDaemonSetManager) GetDaemonSetPod() from the ResourceClaimManager, how should the namespace be specified?

Also, if a DaemonSet exists across multiple namespaces, will the IMEX Daemon Pod be started multiple times?
I'm not entirely sure how this behavior works.

klueska · 2026-02-08T14:37:28Z

cmd/compute-domain-controller/resourceclaim.go

+	informer := cache.NewSharedIndexInformer(
+		&cache.ListWatch{
+			ListWithContextFunc: func(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) {
+				return config.clientsets.Resource.ResourceClaims("").List(ctx, options)
+			},
+			WatchFuncWithContext: func(ctx context.Context, options metav1.ListOptions) (watch.Interface, error) {
+				return config.clientsets.Resource.ResourceClaims("").Watch(ctx, options)
+			},
+		},
+		&resourcev1.ResourceClaim{},
+		informerResyncPeriod,
+		cache.Indexers{},
+	)


I'm not a huge fan of this component having to watch all ResourceClaims just to look for those that have requests for ComputeDomainChannels. Is there a better way to filter these at the informer level?

/cc @pohly

Actually, thinking about this more -- is there a way to have each kubelet plugin process the binding conditions for the claim destined for its node (instead of doing this in the centralized controller)?

I don't want each kubelet plugin to have to watch all ResourceClaims in order to do this, but if there is a way for the kubelet plugin to know that a given claim is bound to a pod that is destined for its node (and waiting for a binding condition), that would be ideal.

This way the centralized controller does not become a bottleneck for all workload pods to start. The work instead gets distributed to the local node where the workload will eventually start.

I'm not a huge fan of this component having to watch all ResourceClaims just to look for those that have requests for ComputeDomainChannels. Is there a better way to filter these at the informer level?

I thought it would be ideal if the labels attached to the ResourceClaimTemplate were propagated to the ResourceClaim - just like with DaemonSet - but since that’s not the case, I couldn’t come up with a good approach.

is there a way to have each kubelet plugin process the binding conditions for the claim destined for its node (instead of doing this in the centralized controller)?

As far as I understand, there is no such mechanism.
The reason is that the kubelet plugin operates after the Pod has been scheduled, whereas the BindingConditions are conditions that must be set before scheduling in order for the scheduler to make a decision.

The kubelet plugin also functions as the controller for ResourceSlices, but it does not watch ResourceClaims.

Therefore, they cannot be processed on the kubelet side.
To support this behavior, I believe the DRA framework itself would require significant changes.

I really don't like the idea of the controller being a bottleneck for releasing each workload pod after its corresponding compute-domain-daemon pod becomes ready.

We have some customers running compute domains across 1000+ nodes. With the current, distributed solution we are able to get all workload pods up and running in under 12s. With this new, centralized solution it will take over 3min (with the default QPS and bust settings) to update all 1000 resource claims with the binding condition.

Given this, I'm not comfortable moving forward with this PR until we can find an elegant way of distributing these binding condition updates to each node.

Even if we take this approach, wouldn’t the controller still need to filter through all ResourceClaims in order to figure out which ResourceClaim is the target and which node it is assigned to? Is my understanding correct?

Yes, but then its only using this information to update one object -- the ComputeDomain (and can have read-only access to ResourceClaims). This isn't a long-term solution because it still leaves the controller as a bottleneck for triggering workloads to be started, but it's closer to what we want in terms of the kubelet-plugins only requesting access to the one-and-only ResourceClaim they care about updating.

Thank you for clarifying.
With this approach, I believe the kubelet plugin would need to add an updateFunc to its ComputeDomain informer in order to detect when the controller writes to the ComputeDomain. Is that correct?

In addition, I’d like to discuss under what circumstances we should write BindingFailureConditions.
By writing BindingFailureConditions, we can trigger Pod rescheduling, which is useful in cases such as when processing in the IMEX DaemonSet Pod fails.

Initially, we were considering writing BindingFailureConditions when we detect that the ComputeDomain’s node status is "NotReady". However, in what situations does a node actually become NotReady?

The IMEX DaemonSet Pod checks the status of its own Pod via the PodManager, but if the Pod cannot start, wouldn’t it be unable to write NotReady in the first place? In other words, it seems there are only two possibilities:

Pod startup fails → nothing is written

Pod startup succeeds → Ready is written

Also, looking at the code, it appears that when the IMEX DaemonSet Pod starts up, the ComputeDomainStatusManager first writes NotReady, and then the PodManager writes Ready afterward. Because of this, I think it’s not possible to tell from NotReady alone whether the Pod is still in the process of becoming ready or has actually failed. Is my understanding of the specification correct?

For this reason, we were considering introducing checkDaemonSetPodStatus() to check the DaemonSet Pod’s status externally.

Given this, what kinds of cases do you think should be used as triggers for writing BindingFailureConditions?

@klueska
I have pushed an implementation that moves most of the BindingConditions update logic to the kubelet plugin.
In the controller, the ResourceClaim manager filters ResourceClaims and writes the node name and ResourceClaim name into the ComputeDomain status.

In your proposal, the computedomain.status.nodes field was suggested as the appropriate place for this information. However, this field is synchronized by cdstatusmanager according to the startup state of the imex daemon pod, which means it cannot be written to before the imex daemon pod starts.
For this reason, I added a new field, computedomain.status.resourceclaim.

Additionally, for the NotReady issue, we found that computedomain.status.nodes.status is also updated by the controller’s cdstatusmanager. I introduced a new Failed status in addition to NotReady and Ready as a condition for setting BindingFailureConditions.

Based on these changes, I would appreciate another round of review.

cmd/compute-domain-controller/resourceclaim.go

klueska · 2026-02-08T14:42:39Z

cmd/compute-domain-controller/resourceclaim.go

+}
+
+// Checks if ResourceClaim is Eligible for Compute Domain Labeling and returns domainID
+func (m *ResourceClaimManager) checkClaimEligibleForComputeDomainLabeling(rc *resourcev1.ResourceClaim) (bool, string) {


This function should have more comments throughout it, explaining what it is doing.

Also, see me comment above about changing its name / what it returns.

I’ve added the comments, and I’ve also updated the function name and its return value.

klueska · 2026-02-08T14:52:09Z

api/nvidia.com/resource/v1beta1/computedomain.go

+
+	ComputeDomainBindingConditions        = "IMEXDaemonSettingsDone"
+	ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed"


These variable names / their values do not seems to match their intent.

I updated it as follows.

ComputeDomainBindingConditions = "ComputeDomainReady" ComputeDomainBindingFailureConditions = "ComputeDomainNotReady"

klueska · 2026-02-08T14:56:21Z

cmd/compute-domain-controller/resourceclaim.go

+	//Check namespace
+	err = m.AssertComputeDomainNamespace(rc.Namespace, domainID)
+	if err != nil {
+		return fmt.Errorf("failed to assert Namespace for computeDomain with domainID %s and ResourceClaim %s/%s: %w", domainID, rc.Namespace, rc.Name, err)
+	}


This should be done as soon as we know what the cdUID is.

I modified it so that the AssertComputeDomainNamespace() function is executed immediately after obtaining the domainID.

klueska · 2026-02-08T15:06:05Z

cmd/compute-domain-controller/resourceclaim.go

+	// Check the allocationResult of ResourceClaim to determine whether it should be monitored.
+	isEligible, domainID := m.checkClaimEligibleForComputeDomainLabeling(rc)
+	if !isEligible {
+		return nil
+	}
+
+	if domainID == "" {
+		return fmt.Errorf("matching ResourceClaim %s/%s has no domainID in allocation config", rc.Namespace, rc.Name)
+	}


I'd rather see this as something like:

// Check the allocationResult of ResourceClaim to determine whether it should be monitored. err, req := m.getComputeDomainChannelRequestConfig(rc) if err != nil { return fmt.Errorf("error getting config for ComputeDomainChannel request from ResourceClaim %s/%s: %w", rc.Namespace, rc.Name, err) } if req == nil { return nil } // use req.DomainID later on, where the type of `req` is a `ComputeDomainChannelConfig`.

I updated the code as you pointed out.

klueska · 2026-02-08T15:07:46Z

cmd/compute-domain-controller/resourceclaim.go

+	// get info for map keys
+	rcUID := string(rc.UID)
+	rcMonitors, err := m.loadResourceClaimMonitor(rcUID)
+	if err != nil {
+		return err
+	}
+
+	// Cancel a monitor that has already timed out.
+	m.cancelTimeoutedMonitor(ctx, rc, rcMonitors, currentAllocationTimestamp)


I haven't looked into the details of what these "monitors" are. Can you explain what you are trying to do here (and why it's needed).?

These monitors refer to the ComputeDomain status monitors executed by PollUntilContextCancel running in separate goroutines. Since multiple ResourceClaims can be created for a single ComputeDomain, I launch a dedicated goroutine for each one so that a single controller can monitor them collectively.

The reason the ResourceClaimManager holds a cancelable context is to support "binding timeouts". If the goroutine monitoring the ComputeDomain fails to detect Ready or NotReady, the BindingConditions will never be written. If this state persists, the scheduler will, by default, trigger a timeout after 10 minutes. Once the timeout occurs, the scheduler will attempt to reschedule the Pod, which may cause the ResourceClaim to be assigned to a different node.

In such cases, monitoring the ComputeDomain.status.nodes.status for the original node is no longer necessary, so the corresponding monitoring goroutine should be canceled. This is why the monitor includes cancellation handling.

klueska · 2026-02-08T15:12:31Z

api/nvidia.com/resource/v1beta1/computedomain.go

 	ComputeDomainChannelAllocationModeAll    = "All"
+
+	ComputeDomainBindingConditions        = "IMEXDaemonSettingsDone"
+	ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed"


When does one enter the failed condition? Is it after some timeout? If so, what is the benefit to timing out rather than just waiting indefinitely?

It enters the failure condition in the two failure cases described in the PR.
The timeout behavior is part of the BindingConditions specification, but users can configure the scheduler to enable or disable this timeout. By default, a timeout occurs after 10 minutes.
Also, in cases where a timeout happens, the BindingFailureConditions are not written.

klueska · 2026-02-08T15:15:56Z

cmd/compute-domain-controller/resourceclaim.go

+		// Start the polling loop to monitor the ComputeDomain node status.
+		err := wait.PollUntilContextCancel(monitorCtx, pollInterval, false, pollCondition)


We do not want to be blocking in the body of onAddOrUpdate. This will block the entire state machine of the ComputeDomain controller and no other incoming tasks will be processed until this function returns.

Because PollUntilContextCancel is executed in a separate goroutine, it does not block the controller’s overall state machine.

klueska · 2026-02-08T15:17:56Z

cmd/compute-domain-controller/resourceclaim.go

+	// Launch the monitoring goroutine
+	m.waitGroup.Add(1)
+	go func() {
+		defer m.waitGroup.Done()
+		defer close(doneChannel)
+		pollCondition := func(pollCtx context.Context) (bool, error) {
+			cd, err := m.getComputeDomain(domainID)
+			if err != nil {
+				return false, err
+			}
+
+			var foundNodesStatus bool
+			if cd.Status.Nodes != nil {
+				for _, node := range cd.Status.Nodes {
+					if node.Name == nodeName {
+						foundNodesStatus = true
+						// Check ComputeDomain node status
+						switch node.Status {
+						// Ready
+						case nvapi.ComputeDomainStatusReady:
+							return true, nil
+						// NotReady
+						case nvapi.ComputeDomainStatusNotReady:
+							return true, fmt.Errorf("%w: binding failed — IMEX daemon failed", ErrBindingFailure)
+						}
+					}
+				}
+			}
+			// If the ComputeDomain node status has not been written.
+			if !foundNodesStatus {
+				// Check if IMEX Daemon Pod was started correctly
+				err := m.checkDaemonSetPodStatus(pollCtx, domainID, nodeName)
+				if err != nil {
+					return true, err
+				}
+			}
+			return false, nil
+		}


We already have the daemonsetpodmanager reporting when each of its daemonset pods from a given compute domain move in/out of the ready state, can that not be leveraged here instead of starting our own polling loop?

Sorry, but I couldn’t understand what you meant.
As far as I can see, the DaemonSetPodManager only implements the Start, Stop, and List functions.
What do you mean when you say it reports the ready state?

klueska · 2026-02-08T15:20:24Z

cmd/compute-domain-controller/resourceclaim.go

+				Device: allocationDevice.Device,
+				Pool:   allocationDevice.Pool,


Swap device / pool

klueska · 2026-02-08T15:21:52Z

cmd/compute-domain-controller/resourceclaim.go

I only made it part of the way through this file (there is alot to digest in here). Please address the existing comments, and then I'll come back to this file and look at it in its new state.

I’ve addressed the existing comments. There were a few points where I wasn’t sure how to implement the requested changes or how to interpret the comments, so I’d appreciate it if you could review them again.

klueska · 2026-02-08T16:41:51Z

cmd/compute-domain-controller/resourceclaim.go

+	for _, term := range rc.Status.Allocation.NodeSelector.NodeSelectorTerms {
+		for _, field := range term.MatchFields {
+			if field.Key == "metadata.name" &&
+				field.Operator == "In" &&
+				len(field.Values) > 0 &&
+				field.Values[0] != "" {
+				return field.Values[0], true
+			}
+		}
+	}


Is this guaranteed to be the format / way the node selector will be set for binding conditions?

As I understand it, regardless of the presence of BindingConditions, in DRA’s specification, cases where a device in a ResourceSlice specifies a nodeName like NVIDIA DRA Driver are guaranteed to follow this format according to the code below. Is my understanding correct? -> @pohly

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/dynamic-resource-allocation/structured/internal/stable/allocator_stable.go#L1400-L1407

ttsuuubasa · 2026-02-09T11:36:26Z

First high level comment before I dig into the details -- this needs a feature gate, so that users can opt-in to using this feature only on k8s clusters that also have the binding conditions feature enabled.

I introduced a feature gate called ComputeDomainBindingConditions, and the BindingConditions logic is executed only when this gate is enabled. Specifically, the assignment of BindingConditions in ResourceSlice and the Start/Stop operations of the ResourceClaimManager will not run without this gate.

ttsuuubasa · 2026-02-09T11:37:27Z

Instead, one can/should rely on the Ready status of each individual node in the ComputeDomain status to set the binding condition for the workload pod running on that node.

This behavior was already implemented, but I had omitted it from the PR description for the sake of simplicity. I have now added an explanation that the ResourceClaimManager watches the field ComputeDomain.status.nodes[*].status. From this point forward, please interpret the term “ComputeDomain status” as referring specifically to this field.

The ComputeDomain Controller monitors the status of the ComputeDomain resource and checks the processing status of the IMEX DaemonSet pod per node — the field is ComputeDomain.status.nodes[*].status.

ttsuuubasa · 2026-02-13T05:44:02Z

@klueska cc: @pohly
Toward introducing BindingConditions into ComputeDomain, I would like to summarize the overall approach.
Is it reasonable to think about the implementation strategy as having both short-term and long-term phases as follows?

Short-term

Implement a ResourceClaim informer "on the ComputeDomain controller", retrieve all ResourceClaims, and perform filtering within the controller.
Record the ResourceClaim name/namespace associated with each node in ComputeDomain, and use this to notify each node’s kubelet plugin which ResourceClaims it should process.
Based on this information, the kubelet plugin retrieves only the targeted ResourceClaims and performs labeling.

Long-term

Implement a ResourceClaim informer "on the kubelet plugin side", and use filtering functionality to retrieve only the ResourceClaims assigned to the local node.
Based on this information, the kubelet plugin retrieves only the targeted ResourceClaims and performs labeling.

For the part about “using filtering functionality to retrieve ResourceClaims assigned to the local node”,
would it be reasonable to first evaluate the feasibility, and if it looks viable, proceed in the long term by proposing a KEP or similar?
As feedback for BindingConditions, is it acceptable to evaluate it based on the short-term implementation?

ttsuuubasa · 2026-02-25T04:54:02Z

I have two additional questions.

First, currently the controller’s ComputeDomainStatusManager is responsible for watching the IMEX DaemonSet Pod status and updating ComputeDomain.status.nodes.status to Ready/NotReady. However, considering your concern that controller-side processing could become a performance bottleneck, if we were to move ResourceClaim updates to the kubelet plugin, wouldn’t it be more architecturally appropriate to also move the monitoring of the IMEX DaemonSet Pods into the kubelet plugin? In that case, both ResourceClaim updates and ComputeDomain updates would be handled by the kubelet plugin. This approach could address the performance concerns you mentioned and might also result in a simpler overall architecture.

Second, regarding the implementation of BindingConditions in ComputeDomain: you suggested adding fields to the ComputeDomain status to store the ResourceClaim name and namespace. However, if a fieldSelector is eventually implemented for the ResourceClaim informer, those fields may no longer be necessary. I am concerned that this could lead to ComputeDomain API fields becoming deprecated. Do you think this would be an issue?

hase1128 · 2026-02-25T07:31:05Z

I want to generalize the use of BindingConditions to reach agreement on the design for the ComputeDomain case.

[Required functionality when implementing BindingConditions in vendor DRA drivers]
The following four functions are required:

Functionality to expose resources with BindingConditions as ResourceSlices
Functionality to detect ResourceClaims where resources with BindingConditions are allocated
Functionality to perform processing to satisfy BindingConditions
Functionality to write the result (success or failure) of whether BindingConditions was satisfied

[Regarding Functionality 2]

More specifically, functionality is needed to detect only the required ResourceClaims.
Implementing a field selector in the ResourceClaim informer would be extremely helpful for vendors implementing BindingConditions.
If this is not available, a method like writing ResourceClaim information to the ComputeDomain resource is necessary(This should be acceptable if there are no issues with performance/scalability.)
Since the current focus is on gathering developer's feedback—such as whether additional APIs are needed—for the BindingConditions Beta promotion, I don't believe it is necessary to implement the field selector immediately.

[Where should the four features be implemented? (ControlPlane or each Node)]
1. Functionality to expose resources with BindingConditions as ResourceSlices
This depends on the vendor. For example:

If exposing resources per node: Exposed by the kubelet plugin on each node (as in the ComputeDomain case)
If exposing cluster-wide resources: Exposed by a controller on the Control Plane (as in our ComposableDisaggregatedInfrastructure case)

2. Functionality to detect ResourceClaims where resources with BindingConditions are allocated

Ideally, this should reside in the same location as the functionality described in 3 below.
However, if the ResourceClaim Informer lacks a field selector, it is preferable to implement this on the Control Plane.

3. Functionality to perform processing to satisfy BindingConditions

This depends on the vendor.

4. Functionality to write the result (success or failure) of whether BindingConditions was satisfied

Ideally, this should be implemented in the same location as the functionality described in 3 above.
If 3 should primarily be processed on each node, a kubelet plugin on each node would be preferable.
Alternatively, if processing spans multiple nodes, the Control Plane would be preferable(because Each node cannot determine whether the BindingConditions are satisfied.)

Based on this approach, the ComputeDomain case would be as follows:
1: Kubelet plugin on each Node
2: Control Plane (because we have not implemented a field selector in ResourceClaim informer)
3: Kubelet plugin on each Node
4: Kubelet plugin on each Node

Added a feature gate for ComputeDomainBindingConditions. When this feature gate is enabled, the following functionality is activated. - Publish BindingConditions on channel devices in ComputeDomain ResourceSlices - Record assigned node and ResourceClaim name/namespace into ComputeDomain via ResourceClaimManager - Update ResourceClaims when ComputeDomain nodes become Ready - Move namespace assertion logic from kubelet-plugin to controller Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>

ttsuuubasa · 2026-03-04T14:51:32Z

@klueska @shivamerla
Thank you for your feedback during yesterday’s meeting!!
As discussed, would it be possible to evaluate how the performance changes with our implementation?

With your existing distributed solution, you can bring up all workload pods in under 12 seconds when computing domains across 1,000+ nodes. We would like to see how this result differs with our approach.

Previously, you mentioned that updating all 1,000 resource claims with binding conditions "will take" more than 3 min.
Was that comment based on actually evaluating our code on the real machines?

At least since then, we have moved the binding condition update logic to the kubelet plugin, so we expect to see some performance improvements.

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Feb 6, 2026

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Feb 6, 2026

hase1128 mentioned this pull request Feb 6, 2026

KEP-5007: DRA Device Binding Conditions beta in 1.36 kubernetes/enhancements#5846

Merged

klueska modified the milestones: Backlog, v26.4.0 Feb 8, 2026

klueska added the feature issue/PR that proposes a new feature or functionality label Feb 8, 2026

klueska self-assigned this Feb 8, 2026

klueska reviewed Feb 8, 2026

View reviewed changes

cmd/compute-domain-controller/resourceclaim.go Outdated Show resolved Hide resolved

klueska reviewed Feb 8, 2026

View reviewed changes

ttsuuubasa force-pushed the binding-conditions-in-compute-domain branch from bfecd6c to 3b78f15 Compare February 19, 2026 14:38

ttsuuubasa force-pushed the binding-conditions-in-compute-domain branch from 3b78f15 to 0444278 Compare March 4, 2026 14:35

		BindingConditions: []string{nvapi.ComputeDomainBindingConditions},
		BindingFailureConditions: []string{nvapi.ComputeDomainBindingFailureConditions},

		m.nodeManager = NewNodeManager(config, getComputeDomain)
		m.daemonSetPodManager = NewDaemonSetPodManager(config)


		ComputeDomainBindingConditions = "IMEXDaemonSettingsDone"
		ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed"

		// Start the polling loop to monitor the ComputeDomain node status.
		err := wait.PollUntilContextCancel(monitorCtx, pollInterval, false, pollCondition)

		Device: allocationDevice.Device,
		Pool: allocationDevice.Pool,

Conversation

ttsuuubasa commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Workflow

Comparison with the Existing

Test

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

klueska commented Feb 8, 2026

Uh oh!

klueska commented Feb 8, 2026

Uh oh!

klueska Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttsuuubasa Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttsuuubasa Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klueska Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttsuuubasa commented Feb 6, 2026 •

edited

Loading

klueska Feb 8, 2026 •

edited

Loading

ttsuuubasa Feb 9, 2026 •

edited

Loading

ttsuuubasa Feb 9, 2026 •

edited

Loading

klueska Feb 8, 2026 •

edited

Loading

klueska Feb 8, 2026 •

edited

Loading

klueska Feb 8, 2026 •

edited

Loading

ttsuuubasa Feb 9, 2026 •

edited

Loading