Skip to content

cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context#860

Open
AkshatDudeja77 wants to merge 249 commits intoNVIDIA:release-25.8from
AkshatDudeja77:wrap-mkdirall-errors
Open

cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context#860
AkshatDudeja77 wants to merge 249 commits intoNVIDIA:release-25.8from
AkshatDudeja77:wrap-mkdirall-errors

Conversation

@AkshatDudeja77
Copy link

Wrap MkdirAll errors in gpu-kubelet-plugin with contextual information to improve diagnosability.

No functional changes intended.

jgehrcke and others added 30 commits October 8, 2025 16:33
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…struction

README: refer to external install instructions
This captures the state at 59a01fde91a53105a6a183a2e8a86f7f16b54622

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev.

---
updated-dependencies:
- dependency-name: nvidia/distroless/cc
  dependency-version: v3.2.0-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>
Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0.
- [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
- [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
- [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0)

---
updated-dependencies:
- dependency-name: github.com/NVIDIA/nvidia-container-toolkit
  dependency-version: 1.18.0
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…github.com/NVIDIA/nvidia-container-toolkit-1.18.0

build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0
…ts/container/main/nvidia/distroless/cc-v3.2.0-dev

build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container
Commit messages of squashed commits:

wip: state Kubecon NA 2025
.gitignore: add top-level code files and archives
gpu plugin: log kubelet registration status in health check
gpu pluggin: announce devices in slice in predictable order
gpu plugin: make counter set construction more concise
gpu plugin: introduce DeviceName, use more common code
gpu plugin: fix announced names, misc
.gitignore: fix wildcards
gpu plugin: dynamically enable MIG mode (pragmatic)
gpu plugin: fix typo in cdi.go
wip: dynamically delete
deviceinfo.go: add docstring to String()
wip: disable MIG mode after deleting last device
create mig device: enrich log messages with device details
gpu plugin: introduce common migppCanonicalName()
gpu plugin: add ResourceClaimToString()
gpu plugin: prepared: add GetDeviceNames()
gpu plugin: driver minor log verb change
gpu plugin: make MIG dev deletion work
device_state: only claim-specific devices
cdi: only claim-specific devices
device_state: delete MIG dev as part of unprepare
device_state: improve log msgs
deviceinfo: fix types
especially for dynamig MIG deletion
gpu plugin: conditionally enable nvcdi logger
change note about memory unit (misc)
cdioptions: add WithLogger()
device_state: better logging, more commentary
nvlib: add cleanup comment, capitalize Placement
deviceinfo: capitalize Placement
cdi: use per-claim mode everywhere, disable nvsandboxutils
This fixes device injection for now.
tests: add dynamic MIG device allocation test
tests: temporary changes for test dev
pkg/flags/utils: fix
gpu plugin: manually create per-MIG devnode CDI inject and misc (comments, cleanup)
tests: add spec files, and work in progress
tests: reduce code duplication, introduce common setup
tests: add more tests for basic GPU allocation
allocatable.go: refine commentary
cdi.go: rename to cdiCharDevNode(), improve comments
deviceinfo.go: update comments
for dev/testing: decrease kp health check freq to reduce verbosity
README: add section for first-class dev cmds (flesh out)
print-debug in nvtk around container edit creation for mig devs
nvlib: fix cleanup when CI was torn down previously
driver: go enrich claim logging on prep
allocatable: fix migppCanonicalName in unprep path
add to previous commit
device_state: improve logging around unprep err
remove debug-log statements
implement one resource slice per GPU, fix index/minor
allocatable.go: add note about unit MB vs MiB
--
gpu plugin: PU lock: change timeout from 10 to 300 seconds

Under alloc/dealloc pressure in the context of dynamic MIG
device allocation, it is apparent that requests line up behind
this lock.

With four physical GPUs and ~7 MIG devices per GPU, there are 28
devices to be managed. If each of these devices runs a job that is
expected to have a duration of ~1 minute, then there are

~30 Prepare()s per minute
~30 Unpepare()s per minute

That leads up to one required Prep/Unprep operation per second.

Now, it becomes apparent that each of these operations may last
longer than a second.

In any case, I have under pressure seen the PU lock acquisition
to frequently time out, in which case the same action is going
to be retried _later_, potentially much later. The system converges
faster if we just leave these requests _lined up_ (in order) and
process them as quickly as we can. Hence, I believe it certainly
makes sense to bump this timeout constant to northwards the retrying
constant at which the kubelet would retry the Prepare() request anyway.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
--
gpus: DestroyUnknownMIGDevices() upon startup
gpu plugin: fix check for 'already prepared' (do earlier)

Seen in practice:

I1106 20:55:46.273327       1 device_state.go:174] checkpoint updated for claim 2b267b75-227f-4e7a-92a1-14b37e15a595
I1106 20:55:46.273337       1 device_state.go:181] skip prepare: claim 2b267b75-227f-4e7a-92a1-14b37e15a595 found in checkpoint

The first log line made this claim look like only partially prepared.

Later, unprepare then failed with:

I1106 20:56:11.318763       1 device_state.go:284] unprepare noop: claim preparation started but not completed for claim '2b267b75-227f-4e7a-92a1-14b37e15a595'

wip: changes in tests/
wip: memsat
not so important golangci lint cfg changes
gpu: log-based pragmatic timing metrics around prep/unprep, and better logging
gpu: rm PU lock from prepare, lock checkpoint mutation, add timer log msgs
create mig device: read UUID directly, do not scan through all devices
gpus: CDI spec: cache specs per UUID (physical GPUs) and common edits
cdi spec cache: return copy (fix mutation bug), initialize NVML less frequently
gpus: use long-lived NVML state, re-use handles (reduce latency)
memsat: 160 jobs 90 seconds vs 6 min 30
flock: reduce polling period (protects cp updates now)
memsat: gpu vs mig kubecon demo state V1
cleanup for demo
memsat: demo as performed at kubecon NA
fix device_state 204
Revert debug changes to vendor directory
dynmig: fixes after vfio conflict resolution (tests pass)
Squash-merge upstream/main and fix conflicts (Jan 21/27/28)
Squash merge & conflict fix (Jan 27)
Squash merge & conflict fix (Jan 28)
dynmig: dyn/static distinction in AllocatableDevices, refactor & cleanup
- fix: dynmig fg disabled, regular gpu: set perGPUAllocatable[gpuInfo.minor]
- Rename MigInfo to MigSpec
one-line change to driver.go (squash me)
---
Start introducing Mig[Dynamic|Static]DeviceType.
This is as part of a bug fix. Saw a new test failure.
panic: unexpected type for AllocatableDevice
in
cmd/gpu-kubelet-plugin/allocatable.go:266 +0x164
and as part of fixing that it really asks for using two different types
of allocatable devices.
---

- Rename migpp to migspec
- dynmig: dynamic/static distinction in AllocatableDevices, cleanup
- Fix code paths in regular GPU allocation and static MIG dev allocation
  along the way.
- fix a type check bug (oh.. linting? compiler?)
- fix bug: missing RequestedCanonicalName prop
- fix inverse boolean expression bug

comment cleanup
comment cleanup
remove unrelated changes (potentially goodies, such as lint config)
cleanup: remove commented code, unused code, memsat, etc
Squash merge upstream/main, fix conflicts (Jan 29/30)
dynmig: add partitions.go, re-enable Passthrough/MPS/TimeSlicing, cleanup
gpu plugin: tweak config type validation err msgs
Re-enable AllocatableDevices as UUIDProvider, re-enable MPS and TimeSlicing
Clean up diff: comments, newlines, etc
Move code to partitions.go, misc
Tune comments, re-enable PassthroughSupport
Move PartGetDevice(), comment cleanup
dynmig: type work, change mig dev name to contain profile ID, cleanup
minor comment fixes
Improve MIG type comments
Improve log messages and comments
dynmig: introduce Mig[Live/Spec]Tuple, refactor, polish
Misc cleanup / comment improvements in cdi.go
Adjust to more polished MIG types
gpu cdi: tune log msgs, use new types
dynmig: mutex w/ PassthroughSupport/NVMLDeviceHealthCheck/MPSSupport
 Just for now -- what's not tested is broken :-). And these
 combinations are entirely untested, and code paths not well
 reviewed.
dynmig: tune timing-related log messages and comments
more timing log message tuning
dynmig: periodic stale-claim cleanup, rollback support in prepare()
This implements two critical cleanup strategies.
Add code comment about cleanup (not yet claiming correctness)
dynmig: cleanup, minimize allocatable.go diff
allocatable: move code to make diff simpler
allocatable: minimize diff
Remove accidentally commited markdown file
dynmig: remove outcommented code, tweak comments
More code comment cleanup
More code comment cleanup
dynmig: re-enable DestroyUnknownMIGDevices(ctx) upon startup
dynmig: larger cleanup (nvlib.go, comments, ...)
improve log message
Many cleanup commit squashed into one
more cleanup
dynmig: comment cleanup
Minor comment cleanup
more comment cleanup
comment cleanup
more comment cleanup
comment cleanup
comment cleanup
lint fixes: exhaustruct, forcetypeassert, int parse/cast
--
dynmig: JSON-annotate MIG types, rename property

While `uuid` itself on any MIG-related type should
be rather obviously the UUID of the MIG device, I have
noticed that when reading code it often helps to see
something like `dev.miguuid` instead of `dev.uuid` because
then it takes just a microsecond to understand what is
being referred to.

This commit also annotated types for JSON (de)serialization.
Not yet used.
--
dynmig: put long-lived NVML session behind feature gate
Improve log messages around NVML init/shutdown
--
dynmig: re-activate PU lock & DS lock

- Did brief perf testing, with pulock with DynamicMIG fg
  enabled.
- Use PU lock also for DynamicMIG again for now (perf
  looks OK).
- Also use DeviceStae Prepare() lock again in DynamicMIG
  mode.
--

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Marco Ebert <marco_ebert@icloud.com>
Signed-off-by: Marco Ebert <marco_ebert@icloud.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev.

---
updated-dependencies:
- dependency-name: nvidia/distroless/cc
  dependency-version: v3.2.1-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

misc fixes

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

remove cdi spec removal again

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…ts/container/main/nvidia/distroless/cc-v3.2.1-dev

build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container
kubelet plugins: add /opt/bin to binary search paths
tests: cover basic GPU allocation, misc improvements
* Add separate make targets to run GPU and CD specific tests
* Add a stress test for GPU allocation
* Refactor Makefile to share common docker setup between targets

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
tests: Add separate targets for GPU plugin tests + add stress tests
Bumps golang from 1.25.3 to 1.25.4.

---
updated-dependencies:
- dependency-name: golang
  dependency-version: 1.25.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

use chroot to run modprobe

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

deadvertise sibling devices on preparation

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

soft check for VFs before attempting unbind

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

address review comments

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

address comments (2)

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

use fuser to check if gpu is free

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

remove unnecessary securityContext

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

don't mix vfio and mig devices

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 8 to 9.
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v8...v9)

---
updated-dependencies:
- dependency-name: golangci/golangci-lint-action
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
jgehrcke and others added 23 commits February 2, 2026 07:32
squashed from:

tests: account for SchedulingDisabled in node iterator
tests: remove migutils.sh and unused helpers
tests: remove unused helpers
tests: fix cleanup
tests: print cwd from test_cd_nvb_failover.sh

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: fix bad conflict resolution

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: fix bad conflict resolution

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…ts/container/main/nvidia/distroless/cc-v4.0.1-dev

build(deps): bump nvidia/distroless/cc from v4.0.0-dev to v4.0.1-dev in /deployments/container
tests: add tests/bats/specs/gpu-simple-mig-ts.yaml

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

remove focus tag

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Messages of squashed commits:

gpu plugin: fix nil pointer deref on static MIG unprep

Found by adding pod deletion (+wait) to test
`static MIG: allocate (1 cnt)`

gpu plugin: tweak health check log msg content & verbosity

gpu plugin: revert checkpoint schema change

dynmig: remove RequestedCanonicalName from PreparedMigDevice again

gpu plugin: do not store Health in gpu/mig info in checkpoint

dynmig: remove MigCapable bool (squash)

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Also
- run CD upgrade test in CI
- debuggability improvements, fixes

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
minor change to tests (squash me)

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…dits

The common edits returned by the nvcdi API already include
NVIDIA_VISIBLE_DEVICES=void. This change removes the code that sets this
envvar again, but ensures that it is explicitly set for the vfio common
edits.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
This change applies the ConfigState.ContainerEdits (i.e. associated with
MPS sharing) directly to the container edits for claim devices instead
of introducing a new named CDI device. This ensures that there is no
need to also include the name of this device in any NodePrepareResource
responses.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
We only instantiate a cdi.Cache to write / remove enerated CDI specs
to / from the CDI spec directory mounted into the plugin container.
This change switches to writing CDI specs using the nvcdi/spec.Interface
(which also handles minimum version detection) and removing the generated
file(s) directly. This aligns spec generation with other tools such as the
GPU Device Plugin and NVIDIA Container Toolkit.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
The vendor and class for GPU CDI device specs can be predefined. This
change removes the ability to set them in the CDIHandler. This means that
the CDI kind k8s.gpu.nvidia.com/claim is always used.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
dynmig: change a timing log msg to level 7
dynmig: remove outdated log message
dynmig: tweak logging around disabling mig mode
dynmig: add comment about SetMigMode err handling

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: increase tail size (debug log)

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Fix mig_ensure_teardown_on_all_nodes to poke every GPU.

Add grouping prefixes to test names:

This is for better readability in the CI log where the
reporter picks a different format from local execution.

In CI, for example the output shows

ok 19 Stress: shared ResourceClaim across 20 pods x 1 repetitions in 8882ms
ok 20 IMEX channel injection (all) in 14851ms

whereas the first line refers to a GPU test, and the
second line refers to a ComputeDomain test.

This patch is a pragmatic improvement to improve that.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
@klueska klueska added this to the Backlog milestone Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.