cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context by AkshatDudeja77 · Pull Request #860 · NVIDIA/k8s-dra-driver-gpu

AkshatDudeja77 · 2026-02-08T09:12:34Z

Wrap MkdirAll errors in gpu-kubelet-plugin with contextual information to improve diagnosability.

No functional changes intended.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

…struction README: refer to external install instructions

This captures the state at 59a01fde91a53105a6a183a2e8a86f7f16b54622 Signed-off-by: Evan Lezar <elezar@nvidia.com>

Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.0-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>

Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0. - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases) - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md) - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0) --- updated-dependencies: - dependency-name: github.com/NVIDIA/nvidia-container-toolkit dependency-version: 1.18.0 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

…github.com/NVIDIA/nvidia-container-toolkit-1.18.0 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0

…ts/container/main/nvidia/distroless/cc-v3.2.0-dev build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container

Increment version to 25.12.0-dev

Commit messages of squashed commits: wip: state Kubecon NA 2025 .gitignore: add top-level code files and archives gpu plugin: log kubelet registration status in health check gpu pluggin: announce devices in slice in predictable order gpu plugin: make counter set construction more concise gpu plugin: introduce DeviceName, use more common code gpu plugin: fix announced names, misc .gitignore: fix wildcards gpu plugin: dynamically enable MIG mode (pragmatic) gpu plugin: fix typo in cdi.go wip: dynamically delete deviceinfo.go: add docstring to String() wip: disable MIG mode after deleting last device create mig device: enrich log messages with device details gpu plugin: introduce common migppCanonicalName() gpu plugin: add ResourceClaimToString() gpu plugin: prepared: add GetDeviceNames() gpu plugin: driver minor log verb change gpu plugin: make MIG dev deletion work device_state: only claim-specific devices cdi: only claim-specific devices device_state: delete MIG dev as part of unprepare device_state: improve log msgs deviceinfo: fix types especially for dynamig MIG deletion gpu plugin: conditionally enable nvcdi logger change note about memory unit (misc) cdioptions: add WithLogger() device_state: better logging, more commentary nvlib: add cleanup comment, capitalize Placement deviceinfo: capitalize Placement cdi: use per-claim mode everywhere, disable nvsandboxutils This fixes device injection for now. tests: add dynamic MIG device allocation test tests: temporary changes for test dev pkg/flags/utils: fix gpu plugin: manually create per-MIG devnode CDI inject and misc (comments, cleanup) tests: add spec files, and work in progress tests: reduce code duplication, introduce common setup tests: add more tests for basic GPU allocation allocatable.go: refine commentary cdi.go: rename to cdiCharDevNode(), improve comments deviceinfo.go: update comments for dev/testing: decrease kp health check freq to reduce verbosity README: add section for first-class dev cmds (flesh out) print-debug in nvtk around container edit creation for mig devs nvlib: fix cleanup when CI was torn down previously driver: go enrich claim logging on prep allocatable: fix migppCanonicalName in unprep path add to previous commit device_state: improve logging around unprep err remove debug-log statements implement one resource slice per GPU, fix index/minor allocatable.go: add note about unit MB vs MiB -- gpu plugin: PU lock: change timeout from 10 to 300 seconds Under alloc/dealloc pressure in the context of dynamic MIG device allocation, it is apparent that requests line up behind this lock. With four physical GPUs and ~7 MIG devices per GPU, there are 28 devices to be managed. If each of these devices runs a job that is expected to have a duration of ~1 minute, then there are ~30 Prepare()s per minute ~30 Unpepare()s per minute That leads up to one required Prep/Unprep operation per second. Now, it becomes apparent that each of these operations may last longer than a second. In any case, I have under pressure seen the PU lock acquisition to frequently time out, in which case the same action is going to be retried _later_, potentially much later. The system converges faster if we just leave these requests _lined up_ (in order) and process them as quickly as we can. Hence, I believe it certainly makes sense to bump this timeout constant to northwards the retrying constant at which the kubelet would retry the Prepare() request anyway. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> -- gpus: DestroyUnknownMIGDevices() upon startup gpu plugin: fix check for 'already prepared' (do earlier) Seen in practice: I1106 20:55:46.273327 1 device_state.go:174] checkpoint updated for claim 2b267b75-227f-4e7a-92a1-14b37e15a595 I1106 20:55:46.273337 1 device_state.go:181] skip prepare: claim 2b267b75-227f-4e7a-92a1-14b37e15a595 found in checkpoint The first log line made this claim look like only partially prepared. Later, unprepare then failed with: I1106 20:56:11.318763 1 device_state.go:284] unprepare noop: claim preparation started but not completed for claim '2b267b75-227f-4e7a-92a1-14b37e15a595' wip: changes in tests/ wip: memsat not so important golangci lint cfg changes gpu: log-based pragmatic timing metrics around prep/unprep, and better logging gpu: rm PU lock from prepare, lock checkpoint mutation, add timer log msgs create mig device: read UUID directly, do not scan through all devices gpus: CDI spec: cache specs per UUID (physical GPUs) and common edits cdi spec cache: return copy (fix mutation bug), initialize NVML less frequently gpus: use long-lived NVML state, re-use handles (reduce latency) memsat: 160 jobs 90 seconds vs 6 min 30 flock: reduce polling period (protects cp updates now) memsat: gpu vs mig kubecon demo state V1 cleanup for demo memsat: demo as performed at kubecon NA fix device_state 204 Revert debug changes to vendor directory dynmig: fixes after vfio conflict resolution (tests pass) Squash-merge upstream/main and fix conflicts (Jan 21/27/28) Squash merge & conflict fix (Jan 27) Squash merge & conflict fix (Jan 28) dynmig: dyn/static distinction in AllocatableDevices, refactor & cleanup - fix: dynmig fg disabled, regular gpu: set perGPUAllocatable[gpuInfo.minor] - Rename MigInfo to MigSpec one-line change to driver.go (squash me) --- Start introducing Mig[Dynamic|Static]DeviceType. This is as part of a bug fix. Saw a new test failure. panic: unexpected type for AllocatableDevice in cmd/gpu-kubelet-plugin/allocatable.go:266 +0x164 and as part of fixing that it really asks for using two different types of allocatable devices. --- - Rename migpp to migspec - dynmig: dynamic/static distinction in AllocatableDevices, cleanup - Fix code paths in regular GPU allocation and static MIG dev allocation along the way. - fix a type check bug (oh.. linting? compiler?) - fix bug: missing RequestedCanonicalName prop - fix inverse boolean expression bug comment cleanup comment cleanup remove unrelated changes (potentially goodies, such as lint config) cleanup: remove commented code, unused code, memsat, etc Squash merge upstream/main, fix conflicts (Jan 29/30) dynmig: add partitions.go, re-enable Passthrough/MPS/TimeSlicing, cleanup gpu plugin: tweak config type validation err msgs Re-enable AllocatableDevices as UUIDProvider, re-enable MPS and TimeSlicing Clean up diff: comments, newlines, etc Move code to partitions.go, misc Tune comments, re-enable PassthroughSupport Move PartGetDevice(), comment cleanup dynmig: type work, change mig dev name to contain profile ID, cleanup minor comment fixes Improve MIG type comments Improve log messages and comments dynmig: introduce Mig[Live/Spec]Tuple, refactor, polish Misc cleanup / comment improvements in cdi.go Adjust to more polished MIG types gpu cdi: tune log msgs, use new types dynmig: mutex w/ PassthroughSupport/NVMLDeviceHealthCheck/MPSSupport Just for now -- what's not tested is broken :-). And these combinations are entirely untested, and code paths not well reviewed. dynmig: tune timing-related log messages and comments more timing log message tuning dynmig: periodic stale-claim cleanup, rollback support in prepare() This implements two critical cleanup strategies. Add code comment about cleanup (not yet claiming correctness) dynmig: cleanup, minimize allocatable.go diff allocatable: move code to make diff simpler allocatable: minimize diff Remove accidentally commited markdown file dynmig: remove outcommented code, tweak comments More code comment cleanup More code comment cleanup dynmig: re-enable DestroyUnknownMIGDevices(ctx) upon startup dynmig: larger cleanup (nvlib.go, comments, ...) improve log message Many cleanup commit squashed into one more cleanup dynmig: comment cleanup Minor comment cleanup more comment cleanup comment cleanup more comment cleanup comment cleanup comment cleanup lint fixes: exhaustruct, forcetypeassert, int parse/cast -- dynmig: JSON-annotate MIG types, rename property While `uuid` itself on any MIG-related type should be rather obviously the UUID of the MIG device, I have noticed that when reading code it often helps to see something like `dev.miguuid` instead of `dev.uuid` because then it takes just a microsecond to understand what is being referred to. This commit also annotated types for JSON (de)serialization. Not yet used. -- dynmig: put long-lived NVML session behind feature gate Improve log messages around NVML init/shutdown -- dynmig: re-activate PU lock & DS lock - Did brief perf testing, with pulock with DynamicMIG fg enabled. - Use PU lock also for DynamicMIG again for now (perf looks OK). - Also use DeviceStae Prepare() lock again in DynamicMIG mode. -- Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Marco Ebert <marco_ebert@icloud.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.1-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> misc fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove cdi spec removal again Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

…ts/container/main/nvidia/distroless/cc-v3.2.1-dev build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container

kubelet plugins: add /opt/bin to binary search paths

tests: cover basic GPU allocation, misc improvements

* Add separate make targets to run GPU and CD specific tests * Add a stress test for GPU allocation * Refactor Makefile to share common docker setup between targets Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

tests: Add separate targets for GPU plugin tests + add stress tests

Bumps golang from 1.25.3 to 1.25.4. --- updated-dependencies: - dependency-name: golang dependency-version: 1.25.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use chroot to run modprobe Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> deadvertise sibling devices on preparation Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> soft check for VFs before attempting unbind Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address review comments Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address comments (2) Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use fuser to check if gpu is free Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> remove unnecessary securityContext Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> don't mix vfio and mig devices Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>

Support VFIO passthrough

Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 8 to 9. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v8...v9) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>

squashed from: tests: account for SchedulingDisabled in node iterator tests: remove migutils.sh and unused helpers tests: remove unused helpers tests: fix cleanup tests: print cwd from test_cd_nvb_failover.sh Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: fix bad conflict resolution Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: fix bad conflict resolution Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

…ts/container/main/nvidia/distroless/cc-v4.0.1-dev build(deps): bump nvidia/distroless/cc from v4.0.0-dev to v4.0.1-dev in /deployments/container

tests: add tests/bats/specs/gpu-simple-mig-ts.yaml Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove focus tag Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Messages of squashed commits: gpu plugin: fix nil pointer deref on static MIG unprep Found by adding pod deletion (+wait) to test `static MIG: allocate (1 cnt)` gpu plugin: tweak health check log msg content & verbosity gpu plugin: revert checkpoint schema change dynmig: remove RequestedCanonicalName from PreparedMigDevice again gpu plugin: do not store Health in gpu/mig info in checkpoint dynmig: remove MigCapable bool (squash) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Also - run CD upgrade test in CI - debuggability improvements, fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

minor change to tests (squash me) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

…dits The common edits returned by the nvcdi API already include NVIDIA_VISIBLE_DEVICES=void. This change removes the code that sets this envvar again, but ensures that it is explicitly set for the vfio common edits. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

This change applies the ConfigState.ContainerEdits (i.e. associated with MPS sharing) directly to the container edits for claim devices instead of introducing a new named CDI device. This ensures that there is no need to also include the name of this device in any NodePrepareResource responses. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

We only instantiate a cdi.Cache to write / remove enerated CDI specs to / from the CDI spec directory mounted into the plugin container. This change switches to writing CDI specs using the nvcdi/spec.Interface (which also handles minimum version detection) and removing the generated file(s) directly. This aligns spec generation with other tools such as the GPU Device Plugin and NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

The vendor and class for GPU CDI device specs can be predefined. This change removes the ability to set them in the CDIHandler. This means that the CDI kind k8s.gpu.nvidia.com/claim is always used. Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

dynmig: change a timing log msg to level 7 dynmig: remove outdated log message dynmig: tweak logging around disabling mig mode dynmig: add comment about SetMigMode err handling Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: increase tail size (debug log) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Fix mig_ensure_teardown_on_all_nodes to poke every GPU. Add grouping prefixes to test names: This is for better readability in the CI log where the reporter picks a different format from local execution. In CI, for example the output shows ok 19 Stress: shared ResourceClaim across 20 pods x 1 repetitions in 8882ms ok 20 IMEX channel injection (all) in 14851ms whereas the first line refers to a GPU test, and the second line refers to a ComputeDomain test. This patch is a pragmatic improvement to improve that. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Introduce dynamic MIG device management

copy-pr-bot · 2026-02-08T09:12:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>

jgehrcke and others added 30 commits October 8, 2025 16:33

CD health check: adjust naming and log messages for clarity

74b60da

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

README: refer to external install instructions

70fbda6

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Merge pull request NVIDIA#699 from jgehrcke/jp/readme-installation-in…

7f591c2

…struction README: refer to external install instructions

[no-relnote] Add cherrypick workflow from gpu-operator repo

05c2d13

This captures the state at 59a01fde91a53105a6a183a2e8a86f7f16b54622 Signed-off-by: Evan Lezar <elezar@nvidia.com>

fix: lazy featuregate init in callers

665653d

Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>

chore: rename GetFeatureGates -> FeatureGates, nil check

45967ec

Signed-off-by: Eric Stroczynski <estroczynski@nvidia.com>

Increment version to 25.12.0-dev

9b20929

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Merge pull request NVIDIA#705 from NVIDIA/dependabot/go_modules/main/…

a772441

…github.com/NVIDIA/nvidia-container-toolkit-1.18.0 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0

Merge pull request NVIDIA#703 from NVIDIA/dependabot/docker/deploymen…

89c8258

…ts/container/main/nvidia/distroless/cc-v3.2.0-dev build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container

Merge pull request NVIDIA#707 from jgehrcke/jp/version25120

f4d11e3

Increment version to 25.12.0-dev

kubelet plugins: add /opt/bin to binary search paths

e8fa8e6

Signed-off-by: Marco Ebert <marco_ebert@icloud.com>

chart: add network policies

245564a

Signed-off-by: Marco Ebert <marco_ebert@icloud.com>

tests: parallelize per-node state dir cleanup

1c2da2c

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: per-user tmp dir (relevant on shared machines)

977f421

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: add nvmm helper

fcd74d1

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: cover basic GPU allocation

1e79179

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> misc fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove cdi spec removal again Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Merge pull request NVIDIA#710 from NVIDIA/dependabot/docker/deploymen…

1ee1b4a

…ts/container/main/nvidia/distroless/cc-v3.2.1-dev build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container

Merge pull request NVIDIA#706 from Gacko/vkptt

852b56f

kubelet plugins: add /opt/bin to binary search paths

Merge pull request NVIDIA#709 from jgehrcke/jp/basic-gpu-tests

59d775b

tests: cover basic GPU allocation, misc improvements

tests: Use BATS_TEST_TMPDIR and failfast on errors during cleanup

3babfe5

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>

Merge pull request NVIDIA#711 from shivamerla/add_gpu_stress_tests

5443e0f

tests: Add separate targets for GPU plugin tests + add stress tests

Merge pull request NVIDIA#668 from varunrsekar/vfio-support-1.33

55fc7b0

Support VFIO passthrough

jgehrcke and others added 23 commits February 2, 2026 07:32

Merge pull request NVIDIA#847 from NVIDIA/dependabot/docker/deploymen…

7d3693f

…ts/container/main/nvidia/distroless/cc-v4.0.1-dev build(deps): bump nvidia/distroless/cc from v4.0.0-dev to v4.0.1-dev in /deployments/container

tests: add MIG+timeslicing test

57bdc5b

tests: add tests/bats/specs/gpu-simple-mig-ts.yaml Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: reduce stress test duration in CI from 3 min to 15 s

c935442

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove focus tag Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: add cleanup and comment

e5f8cdf

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: run more GPU tests in GHA CI (fastfeedback subset)

04ac013

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: cover GPU plugin upgrade (and fix it)

a463315

Also - run CD upgrade test in CI - debuggability improvements, fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

dynmig: conditional NVML shutdown upon driver shutdown

cf50d41

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

gpu plugin: store MigLiveTuple in checkpoint as Concrete

a8be7fc

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

dynmig: review feedback: more panic and docstrings

0ba21b5

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: increase gpu test debuggability

b08cc87

minor change to tests (squash me) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

dynmig: remove (cuda)driverVersion from dynmig partition attributes

486fe0a

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

dynmig: disable MIG mode in obliterateStaleMIGDevices()

89a994e

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: more selective debug log output

55d979f

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: increase tail size (debug log) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: nvmm: use per-kubeconfig cache

734dc0d

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

Merge pull request NVIDIA#852 from jgehrcke/jp/dynamic-mig-26-forward

4440980

Introduce dynamic MIG device management

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Feb 8, 2026

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Feb 8, 2026

AkshatDudeja77 added 2 commits February 8, 2026 14:45

cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context

c421861

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>

chore: DCO sign-off

e4830a5

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>

AkshatDudeja77 force-pushed the wrap-mkdirall-errors branch from 71640c0 to e4830a5 Compare February 8, 2026 09:19

klueska added this to the Backlog milestone Feb 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context#860

cmd/gpu-kubelet-plugin: wrap MkdirAll errors with context#860
AkshatDudeja77 wants to merge 249 commits intoNVIDIA:release-25.8from
AkshatDudeja77:wrap-mkdirall-errors

AkshatDudeja77 commented Feb 8, 2026

Uh oh!

copy-pr-bot bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

AkshatDudeja77 commented Feb 8, 2026

Uh oh!

copy-pr-bot bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants