Skip to content

WIP: Refactor health checks#1538

Draft
elezar wants to merge 2 commits intoNVIDIA:mainfrom
elezar:refactor-health
Draft

WIP: Refactor health checks#1538
elezar wants to merge 2 commits intoNVIDIA:mainfrom
elezar:refactor-health

Conversation

@elezar
Copy link
Member

@elezar elezar commented Nov 26, 2025

Before starting a more serious refactor as in #1508 it may be useful to:

  1. Add some basic unit testing for health checking behaviour
  2. Perform minor code reorganisation so as to move from what we currently have to what we want.

This is a demonstrator, and would still need to have additional work done to get this to the same point where #1508 is. The idea being that we get there more gradually, with fewer unrelated changes.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change is a minor refactor of the nvml health monitor.
It groups similar functionality together so as to make further
extensions / changes simpler.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@ArangoGutierrez
Copy link
Collaborator

Nice draft, LGTM, look forward for when it's reayd for review.

@uristernik
Copy link
Contributor

Combining healthchecks with startup probe could help with the deadlock I described in #1540 and fixed in #1541

@ArangoGutierrez
Copy link
Collaborator

I put together #1554 a combination of this PR and #1508 - I have it as a Draft for now while we discuss over it

ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 12, 2026
…heckHealth

- Add withDevicePlacements wrapper struct for testable device placement
- Add TestCheckHealth test using dgxa100 mock server
- Update checkHealth to use the wrapper pattern
- Vendor go-nvml mock packages including dgxa100

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 12, 2026
…heckHealth

- Add withDevicePlacements wrapper struct for testable device placement
- Add TestCheckHealth test using dgxa100 mock server
- Update checkHealth to use the wrapper pattern
- Vendor go-nvml mock packages including dgxa100

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants