Skip to content

feat: implement Device Binding Conditions for ComputeDomain#855

Open
ttsuuubasa wants to merge 1 commit intoNVIDIA:mainfrom
ttsuuubasa:binding-conditions-in-compute-domain
Open

feat: implement Device Binding Conditions for ComputeDomain#855
ttsuuubasa wants to merge 1 commit intoNVIDIA:mainfrom
ttsuuubasa:binding-conditions-in-compute-domain

Conversation

@ttsuuubasa
Copy link

@ttsuuubasa ttsuuubasa commented Feb 6, 2026

Summary

This PR introduces Device Binding Conditions into the ComputeDomain.
This implementation allows scheduling of workload pods with channel devices to be delayed by BindingConditions until the IMEX Daemon Pods complete their processing. Based on the status of the IMEX Daemon Pod, the ComputeDomain Controller can determine whether the workload pod should be scheduled on the same node or rescheduled to another node. To accomplish this, we introduced the ResourceClaimManager into the compute-domain-controller.

Workflow

  • ResourceSlice is published with its channel devices having BindingConditions applied.

  • ComputeDomain Controller detects a ResourceClaim that subscribes to a channel device and then labels the node where the workload pod is scheduled and triggers the launch of the IMEX DaemonSet pod.

  • ResourceClaims to be monitored must satisfy the following conditions:

    • The driver is "compute-domain.nvidia.com".
    • The device is a channel device (determined by whether its corresponding config is ComputeDomainChannelConfig).
    • The device has BindingConditions.
    • The device is not set any BindingConditions/BindingFailureConditions
  • The ComputeDomain Controller monitors the status of the ComputeDomain resource and checks the processing status of the IMEX DaemonSet pod per node - the field is ComputeDomain.status.nodes[*].status.

  • ComputeDomain Controller to branch into three processing paths: "success", "failure", and "timeout" - based on this status.

    success

    • Ready - IMEX DaemonSet pod completes successfully and the ComputeDomain status is updated to "Ready". ComputeDomain Controller sets the BindingConditions and proceeds to launch the workload pod.

    failure

    1. NotReady - IMEX DaemonSet Pods fails and the ComputeDomain status is updated to "NotReady", the ComputeDomain Controller sets the BindingFailureConditions and prompts the workload pod to be rescheduled.
    2. IMEX Daemon Pod fails and no status is written to the ComputeDomain status, the ComputeDomain Controller checks the status of the IMEX Daemon Pod. Once it confirms that the creation has failed - for example, in a CrashLoopBackOff state - it sets the BindingFailureConditions and prompts the workload pod to be rescheduled.

    timeout

    • ComputeDomain Controller is in neither success nor failure, it keeps monitoring. If no BindingConditions / BindingFailureConditions are written within the timeout window, the scheduler triggers a BindingTimeout and reschedules the workload pod. The controller detects the timeout and starts monitoring the new assignment. The status monitoring information for each ResourceClaim is stored in a map indexed by a UUID x timestamp. When a timeout triggers a new assignment, any monitoring entry associated with an older timestamp is canceled.
  • cleanup: ComputeDomain Controller performs cleanup actions - such as deleting the node labels - to terminate the IMEX Daemon Pod.

Comparison with the Existing

We newly introduced the ResourceClaimManager into the compute-domain-controller in this implementation, and we have migrated several existing functionalities that previously resided in the compute-domain-kubelet-plugin.

  • Labeling nodes to enable the startup of the IMEX DaemonSet Pods
  • Assertion of the ComputeDomain and ResourceClaim namespaces
  • Monitoring the Ready status of the ComputeDomain

Test

We have already completed testing for all cases except failure‑1 in the above flow.

We will also push the test scripts later.

Fixes #653

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska klueska modified the milestones: Backlog, v26.4.0 Feb 8, 2026
@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Feb 8, 2026
@klueska klueska self-assigned this Feb 8, 2026
@klueska
Copy link
Collaborator

klueska commented Feb 8, 2026

First high level comment before I dig into the details -- this needs a feature gate, so that users can opt-in to using this feature only on k8s clusters that also have the binding conditions feature enabled.

@klueska
Copy link
Collaborator

klueska commented Feb 8, 2026

Another high-level comment after reading the PR description but before looking into the details.

As mentioned in this comment, the recommended value for NumNodes is 0 since the 2.5.8.0 release.

This essentially makes the overall ComputeDomain Ready status meaningless. If this PR is relying on the overall ComputeDomain Ready status to set a binding condition for all workload pods, then this needs to be changed, since it will always be flipped to Ready immediately when NumNodes=0.

Instead, one can/should rely on the Ready status of each individual node in the ComputeDomain status to set the binding condition for the workload pod running on that node.

This may already be what you are doing, but from the PR description this isn't very clear.

Comment on lines +92 to +98
m.mutationCache = cache.NewIntegerResourceVersionMutationCache(
klog.Background(),
m.informer.GetStore(),
m.informer.GetIndexer(),
mutationCacheTTL,
true,
)
Copy link
Collaborator

@klueska klueska Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are never writing anything into these daemonset pods, so why do we need a mutation cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you already noted in the following comment, this was needed in order to execute getByComputeDomainUID.

}

func (m *DaemonSetPodManager) Get(ctx context.Context, cdUID string, nodeName string) (*corev1.Pod, error) {
pods, err := getByComputeDomainUID[*corev1.Pod](ctx, m.mutationCache, cdUID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the mutation cache is added so that you can do this. The getByComputeDomainUID should probably take an Indexer() as a parameter rather than a MutationCache to avoid this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the existing getByComputeDomainUID with a new getByComputeDomainUIDAndNode function that uses an indexer based on cdUID + nodeName, and updated the call sites to pass m.informer.GetIndexer() instead of the MutationCache.

Comment on lines +132 to +136
for _, p := range pods {
if p.Spec.NodeName == nodeName {
return p, nil
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of looping like this, can you add an indexer that combines cdUID+nodeName (rather than just cdUID. since this is the combo you always want to look up with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new indexer named "computeDomainNode" that uses "<cdUID>/<nodeName>" as its key. To support this, I defined the addComputeDomainNodePodIndexer and getByComputeDomainUIDAndNode functions in indexers.go.

Comment on lines +131 to +142
var candidates []corev1.Node
if len(nodeNames) == 0 {
candidates = nodes.Items
} else {
for _, nodeName := range nodeNames {
for _, node := range nodes.Items {
if node.Name == nodeName {
candidates = append(candidates, node)
}
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you ever want to remove the label from some nodes but not all of them for a given computeDomain?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would want to remove the label from only some nodes when the IMEX Daemon Pod processing on a specific node fails or times out. In such cases, removing the label helps clean up the node where the failure occurred. After that, by allowing the workload pod to be re‑scheduled through mechanisms like BindingFailureConditions, the IMEX Daemon Pod processing can be retried on another node.

Comment on lines -296 to -310
func (m *ComputeDomainManager) AssertComputeDomainNamespace(ctx context.Context, claimNamespace, cdUID string) error {
cd, err := m.GetComputeDomain(ctx, cdUID)
if err != nil {
return fmt.Errorf("error getting ComputeDomain: %w", err)
}
if cd == nil {
return fmt.Errorf("ComputeDomain not found: %s", cdUID)
}

if cd.Namespace != claimNamespace {
return fmt.Errorf("the ResourceClaim's namespace is different than the ComputeDomain's namespace")
}

return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just remove these functions. The default should still be to do things the "old" way. Your new feature needs to be put behind a feature gate and only used when that feature gate is enabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have restored these functions.

if err := s.computeDomainManager.AddNodeLabel(ctx, config.DomainID); err != nil {
return nil, fmt.Errorf("error adding Node label for ComputeDomain: %w", err)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated these functions to be skipped when the ComputeDomainBindingConditions feature gate is enabled. By default, they still run.

Comment on lines +62 to +63
BindingConditions: []string{nvapi.ComputeDomainBindingConditions},
BindingFailureConditions: []string{nvapi.ComputeDomainBindingFailureConditions},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When were these fields added to the API? Do we need a version check here to ensure that these fields are available? In any case, setting them should (at a minimum) be protected by your new feature gate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields have been added since v1.34 across resource/v1beta1, v1beta2, and v1.
If we want to perform version checking, would it be better to use ServerVersion() and ensure that the API server is at least v1.34.0?
For now, I’ve updated the implementation so that these fields are added only when the ComputeDomainBindingConditions feature gate is enabled.

Comment on lines +29 to +31
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "list", "watch"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Author

@ttsuuubasa ttsuuubasa Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed because the ResourceClaimManager watches Pod events to determine whether a binding timeout has occurred during scheduling. Since the binding timeout is triggered by the scheduler, it is not recorded in the ResourceClaim. Instead, it appears in the Pod events with Event.reason = "SchedulerError" and Event.message including "binding timeout" text. To detect this, we grant list permissions on Events.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long until a binding timeout occurs?

Comment on lines +104 to +105
m.nodeManager = NewNodeManager(config, getComputeDomain)
m.daemonSetPodManager = NewDaemonSetPodManager(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be instantiating another set of instances for these components. if we need them to do something in response to events happening in this component, then some other method must be used to plumb the appropriate functions between the already-created instances of these components into this one (or vice versa).

Copy link
Author

@ttsuuubasa ttsuuubasa Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of resolving this by passing references to m.nodeManager and m.daemonSetPodManager into the ResourceClaimManager.

Then I would implement the function (m *MultiNamespaceDaemonSetManager) GetDaemonSetPod(), which would invoke (m *DaemonSetPodManager) Get in a nested, hierarchical manner.

However, when calling (m *MultiNamespaceDaemonSetManager) GetDaemonSetPod() from the ResourceClaimManager, how should the namespace be specified?

Also, if a DaemonSet exists across multiple namespaces, will the IMEX Daemon Pod be started multiple times?
I'm not entirely sure how this behavior works.

Comment on lines +84 to +96
informer := cache.NewSharedIndexInformer(
&cache.ListWatch{
ListWithContextFunc: func(ctx context.Context, options metav1.ListOptions) (runtime.Object, error) {
return config.clientsets.Resource.ResourceClaims("").List(ctx, options)
},
WatchFuncWithContext: func(ctx context.Context, options metav1.ListOptions) (watch.Interface, error) {
return config.clientsets.Resource.ResourceClaims("").Watch(ctx, options)
},
},
&resourcev1.ResourceClaim{},
informerResyncPeriod,
cache.Indexers{},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of this component having to watch all ResourceClaims just to look for those that have requests for ComputeDomainChannels. Is there a better way to filter these at the informer level?

/cc @pohly

Copy link
Collaborator

@klueska klueska Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about this more -- is there a way to have each kubelet plugin process the binding conditions for the claim destined for its node (instead of doing this in the centralized controller)?

I don't want each kubelet plugin to have to watch all ResourceClaims in order to do this, but if there is a way for the kubelet plugin to know that a given claim is bound to a pod that is destined for its node (and waiting for a binding condition), that would be ideal.

This way the centralized controller does not become a bottleneck for all workload pods to start. The work instead gets distributed to the local node where the workload will eventually start.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of this component having to watch all ResourceClaims just to look for those that have requests for ComputeDomainChannels. Is there a better way to filter these at the informer level?

I thought it would be ideal if the labels attached to the ResourceClaimTemplate were propagated to the ResourceClaim - just like with DaemonSet - but since that’s not the case, I couldn’t come up with a good approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to have each kubelet plugin process the binding conditions for the claim destined for its node (instead of doing this in the centralized controller)?

As far as I understand, there is no such mechanism.
The reason is that the kubelet plugin operates after the Pod has been scheduled, whereas the BindingConditions are conditions that must be set before scheduling in order for the scheduler to make a decision.

The kubelet plugin also functions as the controller for ResourceSlices, but it does not watch ResourceClaims.

Therefore, they cannot be processed on the kubelet side.
To support this behavior, I believe the DRA framework itself would require significant changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the idea of the controller being a bottleneck for releasing each workload pod after its corresponding compute-domain-daemon pod becomes ready.

We have some customers running compute domains across 1000+ nodes. With the current, distributed solution we are able to get all workload pods up and running in under 12s. With this new, centralized solution it will take over 3min (with the default QPS and bust settings) to update all 1000 resource claims with the binding condition.

Given this, I'm not comfortable moving forward with this PR until we can find an elegant way of distributing these binding condition updates to each node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we take this approach, wouldn’t the controller still need to filter through all ResourceClaims in order to figure out which ResourceClaim is the target and which node it is assigned to? Is my understanding correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then its only using this information to update one object -- the ComputeDomain (and can have read-only access to ResourceClaims). This isn't a long-term solution because it still leaves the controller as a bottleneck for triggering workloads to be started, but it's closer to what we want in terms of the kubelet-plugins only requesting access to the one-and-only ResourceClaim they care about updating.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying.
With this approach, I believe the kubelet plugin would need to add an updateFunc to its ComputeDomain informer in order to detect when the controller writes to the ComputeDomain. Is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I’d like to discuss under what circumstances we should write BindingFailureConditions.
By writing BindingFailureConditions, we can trigger Pod rescheduling, which is useful in cases such as when processing in the IMEX DaemonSet Pod fails.

Initially, we were considering writing BindingFailureConditions when we detect that the ComputeDomain’s node status is "NotReady". However, in what situations does a node actually become NotReady?

The IMEX DaemonSet Pod checks the status of its own Pod via the PodManager, but if the Pod cannot start, wouldn’t it be unable to write NotReady in the first place? In other words, it seems there are only two possibilities:

  • Pod startup fails → nothing is written
  • Pod startup succeeds → Ready is written

Also, looking at the code, it appears that when the IMEX DaemonSet Pod starts up, the ComputeDomainStatusManager first writes NotReady, and then the PodManager writes Ready afterward. Because of this, I think it’s not possible to tell from NotReady alone whether the Pod is still in the process of becoming ready or has actually failed. Is my understanding of the specification correct?

For this reason, we were considering introducing checkDaemonSetPodStatus() to check the DaemonSet Pod’s status externally.

Given this, what kinds of cases do you think should be used as triggers for writing BindingFailureConditions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska
I have pushed an implementation that moves most of the BindingConditions update logic to the kubelet plugin.
In the controller, the ResourceClaim manager filters ResourceClaims and writes the node name and ResourceClaim name into the ComputeDomain status.

In your proposal, the computedomain.status.nodes field was suggested as the appropriate place for this information. However, this field is synchronized by cdstatusmanager according to the startup state of the imex daemon pod, which means it cannot be written to before the imex daemon pod starts.
For this reason, I added a new field, computedomain.status.resourceclaim.

Additionally, for the NotReady issue, we found that computedomain.status.nodes.status is also updated by the controller’s cdstatusmanager. I introduced a new Failed status in addition to NotReady and Ready as a condition for setting BindingFailureConditions.

Based on these changes, I would appreciate another round of review.

}

// Checks if ResourceClaim is Eligible for Compute Domain Labeling and returns domainID
func (m *ResourceClaimManager) checkClaimEligibleForComputeDomainLabeling(rc *resourcev1.ResourceClaim) (bool, string) {
Copy link
Collaborator

@klueska klueska Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should have more comments throughout it, explaining what it is doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, see me comment above about changing its name / what it returns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added the comments, and I’ve also updated the function name and its return value.

Comment on lines +30 to +32

ComputeDomainBindingConditions = "IMEXDaemonSettingsDone"
ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variable names / their values do not seems to match their intent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it as follows.

	ComputeDomainBindingConditions        = "ComputeDomainReady"
	ComputeDomainBindingFailureConditions = "ComputeDomainNotReady"

Comment on lines +201 to +205
//Check namespace
err = m.AssertComputeDomainNamespace(rc.Namespace, domainID)
if err != nil {
return fmt.Errorf("failed to assert Namespace for computeDomain with domainID %s and ResourceClaim %s/%s: %w", domainID, rc.Namespace, rc.Name, err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done as soon as we know what the cdUID is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified it so that the AssertComputeDomainNamespace() function is executed immediately after obtaining the domainID.

Comment on lines +170 to +178
// Check the allocationResult of ResourceClaim to determine whether it should be monitored.
isEligible, domainID := m.checkClaimEligibleForComputeDomainLabeling(rc)
if !isEligible {
return nil
}

if domainID == "" {
return fmt.Errorf("matching ResourceClaim %s/%s has no domainID in allocation config", rc.Namespace, rc.Name)
}
Copy link
Collaborator

@klueska klueska Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see this as something like:

// Check the allocationResult of ResourceClaim to determine whether it should be monitored.
	err, req := m.getComputeDomainChannelRequestConfig(rc)
	if err != nil {
		return fmt.Errorf("error getting config for ComputeDomainChannel request from ResourceClaim %s/%s: %w", rc.Namespace, rc.Name, err)
	}

	if req == nil {
		return nil
	}

// use req.DomainID later on, where the type of `req` is a `ComputeDomainChannelConfig`.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code as you pointed out.

Comment on lines +191 to +199
// get info for map keys
rcUID := string(rc.UID)
rcMonitors, err := m.loadResourceClaimMonitor(rcUID)
if err != nil {
return err
}

// Cancel a monitor that has already timed out.
m.cancelTimeoutedMonitor(ctx, rc, rcMonitors, currentAllocationTimestamp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the details of what these "monitors" are. Can you explain what you are trying to do here (and why it's needed).?

Copy link
Author

@ttsuuubasa ttsuuubasa Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These monitors refer to the ComputeDomain status monitors executed by PollUntilContextCancel running in separate goroutines. Since multiple ResourceClaims can be created for a single ComputeDomain, I launch a dedicated goroutine for each one so that a single controller can monitor them collectively.

The reason the ResourceClaimManager holds a cancelable context is to support "binding timeouts". If the goroutine monitoring the ComputeDomain fails to detect Ready or NotReady, the BindingConditions will never be written. If this state persists, the scheduler will, by default, trigger a timeout after 10 minutes. Once the timeout occurs, the scheduler will attempt to reschedule the Pod, which may cause the ResourceClaim to be assigned to a different node.

In such cases, monitoring the ComputeDomain.status.nodes.status for the original node is no longer necessary, so the corresponding monitoring goroutine should be canceled. This is why the monitor includes cancellation handling.

ComputeDomainChannelAllocationModeAll = "All"

ComputeDomainBindingConditions = "IMEXDaemonSettingsDone"
ComputeDomainBindingFailureConditions = "IMEXDaemonSettingsFailed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does one enter the failed condition? Is it after some timeout? If so, what is the benefit to timing out rather than just waiting indefinitely?

Copy link
Author

@ttsuuubasa ttsuuubasa Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It enters the failure condition in the two failure cases described in the PR.
The timeout behavior is part of the BindingConditions specification, but users can configure the scheduler to enable or disable this timeout. By default, a timeout occurs after 10 minutes.
Also, in cases where a timeout happens, the BindingFailureConditions are not written.

Comment on lines +273 to +274
// Start the polling loop to monitor the ComputeDomain node status.
err := wait.PollUntilContextCancel(monitorCtx, pollInterval, false, pollCondition)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to be blocking in the body of onAddOrUpdate. This will block the entire state machine of the ComputeDomain controller and no other incoming tasks will be processed until this function returns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because PollUntilContextCancel is executed in a separate goroutine, it does not block the controller’s overall state machine.

Comment on lines +234 to +271
// Launch the monitoring goroutine
m.waitGroup.Add(1)
go func() {
defer m.waitGroup.Done()
defer close(doneChannel)
pollCondition := func(pollCtx context.Context) (bool, error) {
cd, err := m.getComputeDomain(domainID)
if err != nil {
return false, err
}

var foundNodesStatus bool
if cd.Status.Nodes != nil {
for _, node := range cd.Status.Nodes {
if node.Name == nodeName {
foundNodesStatus = true
// Check ComputeDomain node status
switch node.Status {
// Ready
case nvapi.ComputeDomainStatusReady:
return true, nil
// NotReady
case nvapi.ComputeDomainStatusNotReady:
return true, fmt.Errorf("%w: binding failed — IMEX daemon failed", ErrBindingFailure)
}
}
}
}
// If the ComputeDomain node status has not been written.
if !foundNodesStatus {
// Check if IMEX Daemon Pod was started correctly
err := m.checkDaemonSetPodStatus(pollCtx, domainID, nodeName)
if err != nil {
return true, err
}
}
return false, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the daemonsetpodmanager reporting when each of its daemonset pods from a given compute domain move in/out of the ready state, can that not be leveraged here instead of starting our own polling loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I couldn’t understand what you meant.
As far as I can see, the DaemonSetPodManager only implements the Start, Stop, and List functions.
What do you mean when you say it reports the ready state?

Comment on lines +421 to +422
Device: allocationDevice.Device,
Pool: allocationDevice.Pool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap device / pool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only made it part of the way through this file (there is alot to digest in here). Please address the existing comments, and then I'll come back to this file and look at it in its new state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve addressed the existing comments. There were a few points where I wasn’t sure how to implement the requested changes or how to interpret the comments, so I’d appreciate it if you could review them again.

Comment on lines +706 to +715
for _, term := range rc.Status.Allocation.NodeSelector.NodeSelectorTerms {
for _, field := range term.MatchFields {
if field.Key == "metadata.name" &&
field.Operator == "In" &&
len(field.Values) > 0 &&
field.Values[0] != "" {
return field.Values[0], true
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to be the format / way the node selector will be set for binding conditions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, regardless of the presence of BindingConditions, in DRA’s specification, cases where a device in a ResourceSlice specifies a nodeName like NVIDIA DRA Driver are guaranteed to follow this format according to the code below. Is my understanding correct? -> @pohly

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/dynamic-resource-allocation/structured/internal/stable/allocator_stable.go#L1400-L1407

@ttsuuubasa
Copy link
Author

First high level comment before I dig into the details -- this needs a feature gate, so that users can opt-in to using this feature only on k8s clusters that also have the binding conditions feature enabled.

I introduced a feature gate called ComputeDomainBindingConditions, and the BindingConditions logic is executed only when this gate is enabled. Specifically, the assignment of BindingConditions in ResourceSlice and the Start/Stop operations of the ResourceClaimManager will not run without this gate.

@ttsuuubasa
Copy link
Author

ttsuuubasa commented Feb 9, 2026

Instead, one can/should rely on the Ready status of each individual node in the ComputeDomain status to set the binding condition for the workload pod running on that node.

This behavior was already implemented, but I had omitted it from the PR description for the sake of simplicity. I have now added an explanation that the ResourceClaimManager watches the field ComputeDomain.status.nodes[*].status. From this point forward, please interpret the term “ComputeDomain status” as referring specifically to this field.

The ComputeDomain Controller monitors the status of the ComputeDomain resource and checks the processing status of the IMEX DaemonSet pod per node — the field is ComputeDomain.status.nodes[*].status.

@ttsuuubasa
Copy link
Author

@klueska cc: @pohly
Toward introducing BindingConditions into ComputeDomain, I would like to summarize the overall approach.
Is it reasonable to think about the implementation strategy as having both short-term and long-term phases as follows?

Short-term

  • Implement a ResourceClaim informer "on the ComputeDomain controller", retrieve all ResourceClaims, and perform filtering within the controller.
  • Record the ResourceClaim name/namespace associated with each node in ComputeDomain, and use this to notify each node’s kubelet plugin which ResourceClaims it should process.
  • Based on this information, the kubelet plugin retrieves only the targeted ResourceClaims and performs labeling.

Long-term

  • Implement a ResourceClaim informer "on the kubelet plugin side", and use filtering functionality to retrieve only the ResourceClaims assigned to the local node.
  • Based on this information, the kubelet plugin retrieves only the targeted ResourceClaims and performs labeling.

For the part about “using filtering functionality to retrieve ResourceClaims assigned to the local node”,
would it be reasonable to first evaluate the feasibility, and if it looks viable, proceed in the long term by proposing a KEP or similar?
As feedback for BindingConditions, is it acceptable to evaluate it based on the short-term implementation?

@ttsuuubasa ttsuuubasa force-pushed the binding-conditions-in-compute-domain branch from bfecd6c to 3b78f15 Compare February 19, 2026 14:38
@ttsuuubasa
Copy link
Author

I have two additional questions.

First, currently the controller’s ComputeDomainStatusManager is responsible for watching the IMEX DaemonSet Pod status and updating ComputeDomain.status.nodes.status to Ready/NotReady. However, considering your concern that controller-side processing could become a performance bottleneck, if we were to move ResourceClaim updates to the kubelet plugin, wouldn’t it be more architecturally appropriate to also move the monitoring of the IMEX DaemonSet Pods into the kubelet plugin? In that case, both ResourceClaim updates and ComputeDomain updates would be handled by the kubelet plugin. This approach could address the performance concerns you mentioned and might also result in a simpler overall architecture.

Second, regarding the implementation of BindingConditions in ComputeDomain: you suggested adding fields to the ComputeDomain status to store the ResourceClaim name and namespace. However, if a fieldSelector is eventually implemented for the ResourceClaim informer, those fields may no longer be necessary. I am concerned that this could lead to ComputeDomain API fields becoming deprecated. Do you think this would be an issue?

@hase1128
Copy link

I want to generalize the use of BindingConditions to reach agreement on the design for the ComputeDomain case.

[Required functionality when implementing BindingConditions in vendor DRA drivers]
The following four functions are required:

  1. Functionality to expose resources with BindingConditions as ResourceSlices
  2. Functionality to detect ResourceClaims where resources with BindingConditions are allocated
  3. Functionality to perform processing to satisfy BindingConditions
  4. Functionality to write the result (success or failure) of whether BindingConditions was satisfied

[Regarding Functionality 2]

  • More specifically, functionality is needed to detect only the required ResourceClaims.
  • Implementing a field selector in the ResourceClaim informer would be extremely helpful for vendors implementing BindingConditions.
  • If this is not available, a method like writing ResourceClaim information to the ComputeDomain resource is necessary(This should be acceptable if there are no issues with performance/scalability.)
  • Since the current focus is on gathering developer's feedback—such as whether additional APIs are needed—for the BindingConditions Beta promotion, I don't believe it is necessary to implement the field selector immediately.

[Where should the four features be implemented? (ControlPlane or each Node)]
1. Functionality to expose resources with BindingConditions as ResourceSlices
This depends on the vendor. For example:

  • If exposing resources per node: Exposed by the kubelet plugin on each node (as in the ComputeDomain case)
  • If exposing cluster-wide resources: Exposed by a controller on the Control Plane (as in our ComposableDisaggregatedInfrastructure case)

2. Functionality to detect ResourceClaims where resources with BindingConditions are allocated

  • Ideally, this should reside in the same location as the functionality described in 3 below.
  • However, if the ResourceClaim Informer lacks a field selector, it is preferable to implement this on the Control Plane.

3. Functionality to perform processing to satisfy BindingConditions

  • This depends on the vendor.

4. Functionality to write the result (success or failure) of whether BindingConditions was satisfied

  • Ideally, this should be implemented in the same location as the functionality described in 3 above.
  • If 3 should primarily be processed on each node, a kubelet plugin on each node would be preferable.
  • Alternatively, if processing spans multiple nodes, the Control Plane would be preferable(because Each node cannot determine whether the BindingConditions are satisfied.)

Based on this approach, the ComputeDomain case would be as follows:
1: Kubelet plugin on each Node
2: Control Plane (because we have not implemented a field selector in ResourceClaim informer)
3: Kubelet plugin on each Node
4: Kubelet plugin on each Node

Added a feature gate for ComputeDomainBindingConditions.
When this feature gate is enabled, the following functionality is activated.

  - Publish BindingConditions on channel devices in ComputeDomain ResourceSlices
  - Record assigned node and ResourceClaim name/namespace into ComputeDomain via ResourceClaimManager
  - Update ResourceClaims when ComputeDomain nodes become Ready
  - Move namespace assertion logic from kubelet-plugin to controller

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
@ttsuuubasa ttsuuubasa force-pushed the binding-conditions-in-compute-domain branch from 3b78f15 to 0444278 Compare March 4, 2026 14:35
@ttsuuubasa
Copy link
Author

@klueska @shivamerla
Thank you for your feedback during yesterday’s meeting!!
As discussed, would it be possible to evaluate how the performance changes with our implementation?

With your existing distributed solution, you can bring up all workload pods in under 12 seconds when computing domains across 1,000+ nodes. We would like to see how this result differs with our approach.

Previously, you mentioned that updating all 1,000 resource claims with binding conditions "will take" more than 3 min.
Was that comment based on actually evaluating our code on the real machines?

At least since then, we have moved the binding condition update logic to the kubelet plugin, so we expect to see some performance improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Reimplement release of workload pods in ComputeDomain in terms of BindingConditions

4 participants