Skip to content

KEP-5007: DRA Device Binding Conditions beta in 1.36#5846

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
ttsuuubasa:dra-device-binding-conditions
Feb 10, 2026
Merged

KEP-5007: DRA Device Binding Conditions beta in 1.36#5846
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
ttsuuubasa:dra-device-binding-conditions

Conversation

@ttsuuubasa
Copy link
Contributor

  • One-line PR description: updating KEP docs for promotion to beta
  • Other comments:
    This PR promotes the DRA Device Binding Conditions feature from alpha to beta for Kubernetes v1.36, with enhancements based on DRA driver developer's feedback and metrics.

    1. Stage Promotion to Beta

    • Updated stage: alphastage: beta in kep.yaml
    • Updated latest-milestone: "v1.35""v1.36"

    2. Enhanced DRA Driver Developer's Feedback

    • CoHDI: Added comprehensive feedback from CoHDI (Composable Hardware Device Infrastructure) testing, including scenarios for device pool changes and external controller bug identification
    • NVIDIA DRA Driver: Added feedback from NVIDIA's k8s-dra-driver-gpu showcasing ComputeDomain support with Multi-Node NVLink and IMEX technology

    3. Improved Monitoring & Observability

    • New Metrics: Introduced two new metrics for better operational visibility:
      • scheduler_dra_bindingconditions_allocations_total:
        tracks scheduling attempts with success/failure/timeout status
      • scheduler_dra_bindingconditions_prebind_duration_seconds:
        measures PreBind phase duration with detailed labels
    • Enhanced Detection: Replaced event log-based monitoring with metric-based detection for better automation

    4. Clarified Feature Scope

    • Added explicit non-goal: device pool migration as happy-path flow (deferred to separate KEP)

    NOTE:
    I addressed comments and suggestions from @johnbelamaric during the v1.35 review cycle:
    KEP-5007: DRA Device Binding Conditions alpha in 1.35 #5487

/wg device-management
/sig scheduling
/cc @pohly @johnbelamaric @dom4ha

@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 28, 2026
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 28, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @ttsuuubasa. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 28, 2026
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 28, 2026
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to request a PRR review for the beta promotion.

@github-project-automation github-project-automation bot moved this to Needs Review in SIG Scheduling Jan 29, 2026
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 29, 2026
@ttsuuubasa ttsuuubasa force-pushed the dra-device-binding-conditions branch from cbc56b6 to 55e918d Compare January 29, 2026 06:54
@ttsuuubasa
Copy link
Contributor Author

@kannon92
I received ack from @johnbelamaric to be the PRR review approver, and updated prod-readiness/sig-scheduling/5007.yaml accordingly.
Please let me know if this is the correct procedure.

- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
- Additional tests are in Testgrid and linked in KEP
- Scheduler supports timeout configuration via command-line argument
- In this use case, the attachment scenario for moving devices between different pools is achieved through re-scheduling triggered by BindingFailureConditions. However, there remains an issue that device migration needs to be implemented using BindingConditions as a happy‑path flow. This will be addressed in a separate KEP and will be considered out of scope for the beta-graduation criteria.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for decoupling Binding Conditions from the attachment, but I'm a bit skeptical whether the problem can be fixed easily (see discussion in kubernetes/kubernetes#135473 (comment)), so the question is whether the happy-path is really good enough and proved working?

@wojtek-t @sanposhiho @macsko WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I consider having at least one user of the happy path at the prototype stage (= PR fully implemented and reviewed, but maybe not merged because of release timing) sufficient for beta. But we should have one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "happy path" I meant the one we have right now, i.e. without update the allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The happy path is planned to be implemented in NVIDIA's ComputeDomain case.
My team members are currently working on the implementation. We plan to submit a pull request to NVIDIA's DRA GitHub within the next two or three days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on vacation through the end of this week. If you have a PR ready this week I will make a point to review it on Monday, Tuesday, and Wednesday before the feature freeze. Please be ready to respond to review comments daily so we can get it in good shape by your EOD on Wednesday.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!
All members of my team (implementation team) will ensure they can address any review comments you provide promptly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska
We created the PR. Please review it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Will start looking on Monday.

@ttsuuubasa ttsuuubasa mentioned this pull request Feb 3, 2026
20 tasks
@ttsuuubasa
Copy link
Contributor Author

@johnbelamaric
We have agreement from @pohly and @dom4ha that the happy path for binding conditions specifically in the device‑attachment scenario is out of scope for this KEP, and that the feature can still proceed to beta graduation.
Given that, I’d like to hear your opinion as well.

Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, waiting for @johnbelamaric before I give approve

- "@macsko"
- "@sanposhiho"
approvers:
- "@alculquicondor"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put me as approver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dom4ha
Thank you for the review.
I’ve added you as an approver and pushed the latest changes.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Also approved for PRR. I will add the Prow command after SIG approval.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026
- Updated the Production Readiness Review questionnaire
  and introduced metrics for troubleshooting and operations.
- Addressed review comments from the v1.35 PR kubernetes#5487.
- Added Graduation Criteria for beta.
- Clarify that happy-path device migration is out of scope for beta criteria

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Also approved for PRR. I will add the Prow command after SIG approval.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026
- Updated the Production Readiness Review questionnaire
  and introduced metrics for troubleshooting and operations.
- Addressed review comments from the v1.35 PR kubernetes#5487.
- Added Graduation Criteria for beta.
- Clarify that happy-path device migration is out of scope for beta criteria

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
@ttsuuubasa ttsuuubasa force-pushed the dra-device-binding-conditions branch from 55e918d to a89021d Compare February 10, 2026 04:30
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026
@ttsuuubasa
Copy link
Contributor Author

ttsuuubasa commented Feb 10, 2026

@johnbelamaric
Thank you for the review.
I pushed an update to add the approver, so could you please run the LGTM command again?
Just to confirm, does “SIG approval” mean an LGTM from @dom4ha?

@dom4ha
Copy link
Member

dom4ha commented Feb 10, 2026

/approve

@johnbelamaric
Copy link
Member

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, johnbelamaric, ttsuuubasa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026
@k8s-ci-robot k8s-ci-robot merged commit 834293c into kubernetes:master Feb 10, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 10, 2026
@github-project-automation github-project-automation bot moved this from Needs Review to Done in SIG Scheduling Feb 10, 2026
@ttsuuubasa
Copy link
Contributor Author

@johnbelamaric @dom4ha
Thank you for the approval.

@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: ✅ Done
Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants