feat(trainer): Support NPU labels in TrainJob device#336
feat(trainer): Support NPU labels in TrainJob device#336sujalshah-bit wants to merge 1 commit intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Adds NPU extended-resource detection to the Kubernetes backend so TrainJob device metadata can be derived from container resource limits when an NPU resource key is present.
Changes:
- Introduces an NPU-related resource constant in
constants.py. - Extends
get_container_devices()to detect*/npuresource keys and fail fast on multiple NPU types. - Adds unit tests covering single-NPU and multiple-NPU-key scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| kubeflow/trainer/constants/constants.py | Adds an NPU-related constant for resource identification. |
| kubeflow/trainer/backends/kubernetes/utils.py | Adds NPU resource key detection and validation in device extraction. |
| kubeflow/trainer/backends/kubernetes/utils_test.py | Adds test cases validating NPU detection and multi-key failure behavior. |
| # The label for NPU in the container resources. | ||
| NPU_LABEL = "/npu" | ||
|
|
There was a problem hiding this comment.
NPU_LABEL is defined as the suffix string "/npu" (not a full resource label like the other *_LABEL constants) and is currently unused, which is likely to confuse future callers; rename it to something like NPU_RESOURCE_SUFFIX/NPU_SUFFIX (or document it explicitly as a suffix) and update usages accordingly.
| mig_key = mig_keys[0] | ||
| device = mig_key.split("/")[1] | ||
| device_count = resources.limits[mig_key].actual_instance | ||
| elif npu_keys := [k for k in resources.limits if k.endswith("/npu")]: |
There was a problem hiding this comment.
The NPU detection uses a hard-coded string literal "/npu" even though an NPU constant was added in constants; use the constant here (and/or rename it to a suffix constant) so the matching rule is defined in one place and can’t drift.
| elif npu_keys := [k for k in resources.limits if k.endswith("/npu")]: | |
| elif npu_keys := [k for k in resources.limits if k.endswith(constants.NPU_SUFFIX)]: |
Implement support for NPU resource labels in resource limit validation, resolving the existing TODO to support additional accelerator types. Signed-off-by: Sujal Shah <sujalshah28092004@gmail.com>
f4fa8fe to
6975af2
Compare
What this PR does / why we need it:
Add support for NPU resource labels in container resource limits,
resolving the existing TODO to support additional accelerator types
(e.g., NPUs).
/npusuffix matchingFixes NONE
Checklist: