cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context#860
Open
AkshatDudeja77 wants to merge 249 commits intoNVIDIA:release-25.8from
Open
cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context#860AkshatDudeja77 wants to merge 249 commits intoNVIDIA:release-25.8from
AkshatDudeja77 wants to merge 249 commits intoNVIDIA:release-25.8from
Conversation
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…struction README: refer to external install instructions
This captures the state at 59a01fde91a53105a6a183a2e8a86f7f16b54622 Signed-off-by: Evan Lezar <elezar@nvidia.com>
Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.0-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>
Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0. - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases) - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md) - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0) --- updated-dependencies: - dependency-name: github.com/NVIDIA/nvidia-container-toolkit dependency-version: 1.18.0 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…github.com/NVIDIA/nvidia-container-toolkit-1.18.0 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0
…ts/container/main/nvidia/distroless/cc-v3.2.0-dev build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container
Increment version to 25.12.0-dev
Commit messages of squashed commits: wip: state Kubecon NA 2025 .gitignore: add top-level code files and archives gpu plugin: log kubelet registration status in health check gpu pluggin: announce devices in slice in predictable order gpu plugin: make counter set construction more concise gpu plugin: introduce DeviceName, use more common code gpu plugin: fix announced names, misc .gitignore: fix wildcards gpu plugin: dynamically enable MIG mode (pragmatic) gpu plugin: fix typo in cdi.go wip: dynamically delete deviceinfo.go: add docstring to String() wip: disable MIG mode after deleting last device create mig device: enrich log messages with device details gpu plugin: introduce common migppCanonicalName() gpu plugin: add ResourceClaimToString() gpu plugin: prepared: add GetDeviceNames() gpu plugin: driver minor log verb change gpu plugin: make MIG dev deletion work device_state: only claim-specific devices cdi: only claim-specific devices device_state: delete MIG dev as part of unprepare device_state: improve log msgs deviceinfo: fix types especially for dynamig MIG deletion gpu plugin: conditionally enable nvcdi logger change note about memory unit (misc) cdioptions: add WithLogger() device_state: better logging, more commentary nvlib: add cleanup comment, capitalize Placement deviceinfo: capitalize Placement cdi: use per-claim mode everywhere, disable nvsandboxutils This fixes device injection for now. tests: add dynamic MIG device allocation test tests: temporary changes for test dev pkg/flags/utils: fix gpu plugin: manually create per-MIG devnode CDI inject and misc (comments, cleanup) tests: add spec files, and work in progress tests: reduce code duplication, introduce common setup tests: add more tests for basic GPU allocation allocatable.go: refine commentary cdi.go: rename to cdiCharDevNode(), improve comments deviceinfo.go: update comments for dev/testing: decrease kp health check freq to reduce verbosity README: add section for first-class dev cmds (flesh out) print-debug in nvtk around container edit creation for mig devs nvlib: fix cleanup when CI was torn down previously driver: go enrich claim logging on prep allocatable: fix migppCanonicalName in unprep path add to previous commit device_state: improve logging around unprep err remove debug-log statements implement one resource slice per GPU, fix index/minor allocatable.go: add note about unit MB vs MiB -- gpu plugin: PU lock: change timeout from 10 to 300 seconds Under alloc/dealloc pressure in the context of dynamic MIG device allocation, it is apparent that requests line up behind this lock. With four physical GPUs and ~7 MIG devices per GPU, there are 28 devices to be managed. If each of these devices runs a job that is expected to have a duration of ~1 minute, then there are ~30 Prepare()s per minute ~30 Unpepare()s per minute That leads up to one required Prep/Unprep operation per second. Now, it becomes apparent that each of these operations may last longer than a second. In any case, I have under pressure seen the PU lock acquisition to frequently time out, in which case the same action is going to be retried _later_, potentially much later. The system converges faster if we just leave these requests _lined up_ (in order) and process them as quickly as we can. Hence, I believe it certainly makes sense to bump this timeout constant to northwards the retrying constant at which the kubelet would retry the Prepare() request anyway. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> -- gpus: DestroyUnknownMIGDevices() upon startup gpu plugin: fix check for 'already prepared' (do earlier) Seen in practice: I1106 20:55:46.273327 1 device_state.go:174] checkpoint updated for claim 2b267b75-227f-4e7a-92a1-14b37e15a595 I1106 20:55:46.273337 1 device_state.go:181] skip prepare: claim 2b267b75-227f-4e7a-92a1-14b37e15a595 found in checkpoint The first log line made this claim look like only partially prepared. Later, unprepare then failed with: I1106 20:56:11.318763 1 device_state.go:284] unprepare noop: claim preparation started but not completed for claim '2b267b75-227f-4e7a-92a1-14b37e15a595' wip: changes in tests/ wip: memsat not so important golangci lint cfg changes gpu: log-based pragmatic timing metrics around prep/unprep, and better logging gpu: rm PU lock from prepare, lock checkpoint mutation, add timer log msgs create mig device: read UUID directly, do not scan through all devices gpus: CDI spec: cache specs per UUID (physical GPUs) and common edits cdi spec cache: return copy (fix mutation bug), initialize NVML less frequently gpus: use long-lived NVML state, re-use handles (reduce latency) memsat: 160 jobs 90 seconds vs 6 min 30 flock: reduce polling period (protects cp updates now) memsat: gpu vs mig kubecon demo state V1 cleanup for demo memsat: demo as performed at kubecon NA fix device_state 204 Revert debug changes to vendor directory dynmig: fixes after vfio conflict resolution (tests pass) Squash-merge upstream/main and fix conflicts (Jan 21/27/28) Squash merge & conflict fix (Jan 27) Squash merge & conflict fix (Jan 28) dynmig: dyn/static distinction in AllocatableDevices, refactor & cleanup - fix: dynmig fg disabled, regular gpu: set perGPUAllocatable[gpuInfo.minor] - Rename MigInfo to MigSpec one-line change to driver.go (squash me) --- Start introducing Mig[Dynamic|Static]DeviceType. This is as part of a bug fix. Saw a new test failure. panic: unexpected type for AllocatableDevice in cmd/gpu-kubelet-plugin/allocatable.go:266 +0x164 and as part of fixing that it really asks for using two different types of allocatable devices. --- - Rename migpp to migspec - dynmig: dynamic/static distinction in AllocatableDevices, cleanup - Fix code paths in regular GPU allocation and static MIG dev allocation along the way. - fix a type check bug (oh.. linting? compiler?) - fix bug: missing RequestedCanonicalName prop - fix inverse boolean expression bug comment cleanup comment cleanup remove unrelated changes (potentially goodies, such as lint config) cleanup: remove commented code, unused code, memsat, etc Squash merge upstream/main, fix conflicts (Jan 29/30) dynmig: add partitions.go, re-enable Passthrough/MPS/TimeSlicing, cleanup gpu plugin: tweak config type validation err msgs Re-enable AllocatableDevices as UUIDProvider, re-enable MPS and TimeSlicing Clean up diff: comments, newlines, etc Move code to partitions.go, misc Tune comments, re-enable PassthroughSupport Move PartGetDevice(), comment cleanup dynmig: type work, change mig dev name to contain profile ID, cleanup minor comment fixes Improve MIG type comments Improve log messages and comments dynmig: introduce Mig[Live/Spec]Tuple, refactor, polish Misc cleanup / comment improvements in cdi.go Adjust to more polished MIG types gpu cdi: tune log msgs, use new types dynmig: mutex w/ PassthroughSupport/NVMLDeviceHealthCheck/MPSSupport Just for now -- what's not tested is broken :-). And these combinations are entirely untested, and code paths not well reviewed. dynmig: tune timing-related log messages and comments more timing log message tuning dynmig: periodic stale-claim cleanup, rollback support in prepare() This implements two critical cleanup strategies. Add code comment about cleanup (not yet claiming correctness) dynmig: cleanup, minimize allocatable.go diff allocatable: move code to make diff simpler allocatable: minimize diff Remove accidentally commited markdown file dynmig: remove outcommented code, tweak comments More code comment cleanup More code comment cleanup dynmig: re-enable DestroyUnknownMIGDevices(ctx) upon startup dynmig: larger cleanup (nvlib.go, comments, ...) improve log message Many cleanup commit squashed into one more cleanup dynmig: comment cleanup Minor comment cleanup more comment cleanup comment cleanup more comment cleanup comment cleanup comment cleanup lint fixes: exhaustruct, forcetypeassert, int parse/cast -- dynmig: JSON-annotate MIG types, rename property While `uuid` itself on any MIG-related type should be rather obviously the UUID of the MIG device, I have noticed that when reading code it often helps to see something like `dev.miguuid` instead of `dev.uuid` because then it takes just a microsecond to understand what is being referred to. This commit also annotated types for JSON (de)serialization. Not yet used. -- dynmig: put long-lived NVML session behind feature gate Improve log messages around NVML init/shutdown -- dynmig: re-activate PU lock & DS lock - Did brief perf testing, with pulock with DynamicMIG fg enabled. - Use PU lock also for DynamicMIG again for now (perf looks OK). - Also use DeviceStae Prepare() lock again in DynamicMIG mode. -- Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Marco Ebert <marco_ebert@icloud.com>
Signed-off-by: Marco Ebert <marco_ebert@icloud.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.1-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> misc fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove cdi spec removal again Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…ts/container/main/nvidia/distroless/cc-v3.2.1-dev build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container
kubelet plugins: add /opt/bin to binary search paths
tests: cover basic GPU allocation, misc improvements
* Add separate make targets to run GPU and CD specific tests * Add a stress test for GPU allocation * Refactor Makefile to share common docker setup between targets Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
tests: Add separate targets for GPU plugin tests + add stress tests
Bumps golang from 1.25.3 to 1.25.4. --- updated-dependencies: - dependency-name: golang dependency-version: 1.25.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use chroot to run modprobe Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> deadvertise sibling devices on preparation Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> soft check for VFs before attempting unbind Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address review comments Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address comments (2) Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use fuser to check if gpu is free Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> remove unnecessary securityContext Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> don't mix vfio and mig devices Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
Support VFIO passthrough
Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 8 to 9. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v8...v9) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
squashed from: tests: account for SchedulingDisabled in node iterator tests: remove migutils.sh and unused helpers tests: remove unused helpers tests: fix cleanup tests: print cwd from test_cd_nvb_failover.sh Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: fix bad conflict resolution Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: fix bad conflict resolution Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…ts/container/main/nvidia/distroless/cc-v4.0.1-dev build(deps): bump nvidia/distroless/cc from v4.0.0-dev to v4.0.1-dev in /deployments/container
tests: add tests/bats/specs/gpu-simple-mig-ts.yaml Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove focus tag Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Messages of squashed commits: gpu plugin: fix nil pointer deref on static MIG unprep Found by adding pod deletion (+wait) to test `static MIG: allocate (1 cnt)` gpu plugin: tweak health check log msg content & verbosity gpu plugin: revert checkpoint schema change dynmig: remove RequestedCanonicalName from PreparedMigDevice again gpu plugin: do not store Health in gpu/mig info in checkpoint dynmig: remove MigCapable bool (squash) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Also - run CD upgrade test in CI - debuggability improvements, fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
minor change to tests (squash me) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…dits The common edits returned by the nvcdi API already include NVIDIA_VISIBLE_DEVICES=void. This change removes the code that sets this envvar again, but ensures that it is explicitly set for the vfio common edits. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
This change applies the ConfigState.ContainerEdits (i.e. associated with MPS sharing) directly to the container edits for claim devices instead of introducing a new named CDI device. This ensures that there is no need to also include the name of this device in any NodePrepareResource responses. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
We only instantiate a cdi.Cache to write / remove enerated CDI specs to / from the CDI spec directory mounted into the plugin container. This change switches to writing CDI specs using the nvcdi/spec.Interface (which also handles minimum version detection) and removing the generated file(s) directly. This aligns spec generation with other tools such as the GPU Device Plugin and NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
The vendor and class for GPU CDI device specs can be predefined. This change removes the ability to set them in the CDIHandler. This means that the CDI kind k8s.gpu.nvidia.com/claim is always used. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
dynmig: change a timing log msg to level 7 dynmig: remove outdated log message dynmig: tweak logging around disabling mig mode dynmig: add comment about SetMigMode err handling Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: increase tail size (debug log) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Fix mig_ensure_teardown_on_all_nodes to poke every GPU. Add grouping prefixes to test names: This is for better readability in the CI log where the reporter picks a different format from local execution. In CI, for example the output shows ok 19 Stress: shared ResourceClaim across 20 pods x 1 repetitions in 8882ms ok 20 IMEX channel injection (all) in 14851ms whereas the first line refers to a GPU test, and the second line refers to a ComputeDomain test. This patch is a pragmatic improvement to improve that. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Introduce dynamic MIG device management
Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
71640c0 to
e4830a5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wrap MkdirAll errors in gpu-kubelet-plugin with contextual information to improve diagnosability.
No functional changes intended.