Skip to content

feat(trainer): Support NPU labels in TrainJob device#336

Open
sujalshah-bit wants to merge 1 commit intokubeflow:mainfrom
sujalshah-bit:support_npu_label
Open

feat(trainer): Support NPU labels in TrainJob device#336
sujalshah-bit wants to merge 1 commit intokubeflow:mainfrom
sujalshah-bit:support_npu_label

Conversation

@sujalshah-bit
Copy link

What this PR does / why we need it:
Add support for NPU resource labels in container resource limits,
resolving the existing TODO to support additional accelerator types
(e.g., NPUs).

  • Detect NPU resource keys using strict /npu suffix matching
  • Extract and return the corresponding resource value
  • Fails fast when multiple NPU resource types are present
  • Maintains consistency with existing GPU/MIG validation behavior

Fixes NONE

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings February 27, 2026 19:45
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds NPU extended-resource detection to the Kubernetes backend so TrainJob device metadata can be derived from container resource limits when an NPU resource key is present.

Changes:

  • Introduces an NPU-related resource constant in constants.py.
  • Extends get_container_devices() to detect */npu resource keys and fail fast on multiple NPU types.
  • Adds unit tests covering single-NPU and multiple-NPU-key scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
kubeflow/trainer/constants/constants.py Adds an NPU-related constant for resource identification.
kubeflow/trainer/backends/kubernetes/utils.py Adds NPU resource key detection and validation in device extraction.
kubeflow/trainer/backends/kubernetes/utils_test.py Adds test cases validating NPU detection and multi-key failure behavior.

Comment on lines 107 to 109
# The label for NPU in the container resources.
NPU_LABEL = "/npu"

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPU_LABEL is defined as the suffix string "/npu" (not a full resource label like the other *_LABEL constants) and is currently unused, which is likely to confuse future callers; rename it to something like NPU_RESOURCE_SUFFIX/NPU_SUFFIX (or document it explicitly as a suffix) and update usages accordingly.

Copilot generated this review using guidance from repository custom instructions.
mig_key = mig_keys[0]
device = mig_key.split("/")[1]
device_count = resources.limits[mig_key].actual_instance
elif npu_keys := [k for k in resources.limits if k.endswith("/npu")]:
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NPU detection uses a hard-coded string literal "/npu" even though an NPU constant was added in constants; use the constant here (and/or rename it to a suffix constant) so the matching rule is defined in one place and can’t drift.

Suggested change
elif npu_keys := [k for k in resources.limits if k.endswith("/npu")]:
elif npu_keys := [k for k in resources.limits if k.endswith(constants.NPU_SUFFIX)]:

Copilot uses AI. Check for mistakes.
Implement support for NPU resource labels in resource limit validation,
resolving the existing TODO to support additional accelerator types.

Signed-off-by: Sujal Shah <sujalshah28092004@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants