Conversation
541c6cb to
44d450f
Compare
internal/plugin/server.go
Outdated
| }() | ||
|
|
||
| // Start recovery worker to detect when unhealthy devices become healthy | ||
| go plugin.runRecoveryWorker() |
There was a problem hiding this comment.
Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?
There was a problem hiding this comment.
sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR
elezar
left a comment
There was a problem hiding this comment.
In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.
As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:
type DeviceHealthCheckProvider interface {
Start(context.Context) error
Stop()
Health() <-channel Device
(alternatively one could split the Health channel into Healthy() and Unhealthy()).
internal/plugin/server.go
Outdated
| // If health provider not available, wait for context cancellation | ||
| if plugin.healthProvider == nil { | ||
| <-plugin.ctx.Done() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?
There was a problem hiding this comment.
thanks for the suggestion, adopted
internal/rm/health.go
Outdated
| // envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks | ||
| // should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are | ||
| // disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that | ||
| // this is in addition to the Application errors that are already ignored. | ||
| // envDisableHealthChecks defines the environment variable that is | ||
| // checked to determine whether healthchecks should be disabled. If | ||
| // this envvar is set to "all" or contains the string "xids", | ||
| // healthchecks are disabled entirely. If set, the envvar is treated | ||
| // as a comma-separated list of Xids to ignore. Note that this is in | ||
| // addition to the Application errors that are already ignored. |
There was a problem hiding this comment.
This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.
There was a problem hiding this comment.
now in a [no-relnote] commit
internal/rm/health.go
Outdated
| nvml: nvml, | ||
| config: config, | ||
| devices: devices, | ||
| healthChan: make(chan *Device, 64), |
There was a problem hiding this comment.
size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes
internal/rm/health.go
Outdated
| if p.started { | ||
| p.mu.Unlock() | ||
| return fmt.Errorf("health provider already started") | ||
| } | ||
| p.started = true | ||
| p.mu.Unlock() |
There was a problem hiding this comment.
Any reason to not defer p.mu.Unlock() instead?
There was a problem hiding this comment.
defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).
There was a problem hiding this comment.
I don't agree that "slowness" is something we should optimize for. Start() and Stop() should not be running concurrently. As implemented, because we release the lock before Start() has completed (and similarly for Stop()) the remaining code may end up overlapping which is not what we want.
Also note that we set started to true before the health monitor is actually ready and this is not reset in the envent of an error.
There was a problem hiding this comment.
suggestion adopted
internal/rm/health.go
Outdated
| wg sync.WaitGroup | ||
|
|
||
| // State guards | ||
| mu sync.Mutex |
There was a problem hiding this comment.
We could use an IsA relationship to simplify taking and releasing the lock:
| mu sync.Mutex | |
| sync.Mutex |
There was a problem hiding this comment.
thanks for the suggestion, adopted
internal/rm/health.go
Outdated
| ret := p.nvml.Init() | ||
| if ret != nvml.SUCCESS { | ||
| if *r.config.Flags.FailOnInitError { | ||
| if *p.config.Flags.FailOnInitError { |
There was a problem hiding this comment.
nit: Let's not rename r to p in a single commit. (see comment on managing diffs).
There was a problem hiding this comment.
thanks for the suggestion, adopted , now an independent commit
internal/rm/health.go
Outdated
| p.xidsDisabled = getDisabledHealthCheckXids() | ||
| if p.xidsDisabled.IsAllDisabled() { | ||
| klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.
There was a problem hiding this comment.
thanks for the suggestion, adopted
There was a problem hiding this comment.
I see that you have pulled this up to the func (r *nvmlResourceManager) HealthProvider() HealthProvider { implementation. This means that we have to construct the list of disabled XIDs twice. Why not handle this (and other static config) in the NewNVMLHealthProvider constructor instead?
There was a problem hiding this comment.
moved to NewNVMLHealthProvider now
| klog.Warningf("NVML init failed: %v; health checks disabled", ret) | ||
| return nil | ||
| } | ||
| defer func() { |
There was a problem hiding this comment.
Could you explain the move away from a deferred shutdown?
There was a problem hiding this comment.
All error paths after Init() have now individual clean up logic.
There was a problem hiding this comment.
I think this is more error prone than it needs to be. Note that we may want to distinguish between acceptable errors and non acceptable ones. As was the case for the related k8s-dra-driver PR, we may want to call Init() in the constructor and handle these errors (with deferred Shutdown() separately from errors that are triggered during Start.
For example, what about updating the constructor to:
// NewNVMLHealthProvider creates a new health provider for NVML devices.
// Does not start monitoring - caller must call Start().
func newNVMLHealthProvider(nvmllib nvml.Interface, config *spec.Config, devices Devices) (HealthProvider, error) {
xids := getDisabledHealthCheckXids()
if xids.IsAllDisabled() {
return &noopHealthProvider{}, nil
}
ret := nvmllib.Init()
if ret != nvml.SUCCESS {
if *config.Flags.FailOnInitError {
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}
klog.Warningf("NVML init failed: %v; health checks disabled", ret)
return &noopHealthProvider{}, nil
}
defer func() {
ret := nvmllib.Shutdown()
if ret != nvml.SUCCESS {
klog.Infof("Error shutting down NVML: %v", ret)
}
}()
klog.Infof("Ignoring the following XIDs for health checks: %v", xids)
p := &nvmlHealthProvider{
nvml: nvmllib,
config: config,
devices: devices,
unhealthy: make(chan *Device, 64),
xidsDisabled: xids,
}
return p, nil
}
Note that we return a noopHealthProvider if we are tolerant of Init errors at this point. However, we update Start() to look something like:
// Start initializes NVML, registers event handlers, and starts the
// monitoring goroutine. Blocks until initialization completes.
func (r *nvmlHealthProvider) Start(ctx context.Context) (rerr error) {
r.Lock()
defer r.Unlock()
if r.started {
// TODO: Is this an error condition? Could we just return?
return fmt.Errorf("health provider already started")
}
r.Unlock()
// Initialize NVML
ret := r.nvml.Init()
if ret != nvml.SUCCESS {
return fmt.Errorf("failed to initialize NVML: %v", ret)
}
defer func() {
if rerr != nil {
_ = r.nvml.Shutdown()
}
}()
// Create event set
eventSet, ret := r.nvml.EventSetCreate()
if ret != nvml.SUCCESS {
return fmt.Errorf("failed to create event set: %v", ret)
}
defer func() {
if rerr != nil {
_ = eventSet.Free()
}
}()
// Register devices
if err := r.registerDevices(eventSet); err != nil {
return fmt.Errorf("failed to register devices: %w", err)
}
klog.Infof("Health monitoring started for %d devices", len(r.devices))
// Create child context
r.ctx, r.cancel = context.WithCancel(ctx)
// Start monitoring goroutine
r.wg.Add(1)
go r.runEventMonitor(eventSet)
r.started = true
return nil
}
Where we take actions immediately in the event of failure so that we don't have to rely on cleanup() being called eventually for these resources.
There was a problem hiding this comment.
Thanks for the detailed suggestion, adopted
| } | ||
| return fmt.Errorf("failed to create event set: %v", ret) | ||
| } | ||
| defer func() { |
There was a problem hiding this comment.
is there a reason that we don't use the deferred cleanup here?
There was a problem hiding this comment.
All error paths after Init() have now individual clean up logic.
internal/rm/tegra_manager.go
Outdated
| } | ||
|
|
||
| func (n *noopHealthProvider) Start(context.Context) error { | ||
| n.healthChan = make(chan *Device) |
There was a problem hiding this comment.
Why not just do this at construction?
There was a problem hiding this comment.
Also, do we need to actually create a channel? Can we no leave it nil?
91f4a6c to
33636eb
Compare
| r.Lock() | ||
| if r.started { | ||
| r.Unlock() | ||
| return fmt.Errorf("health provider already started") |
There was a problem hiding this comment.
Why is this an error? Assuming we only set started once the HealthProvider / HealthMonitor has been successfully started, is there any harm in calling it again?
There was a problem hiding this comment.
Agree, now is a return nil
internal/rm/health.go
Outdated
| // Get XID filter configuration | ||
| r.xidsDisabled = getDisabledHealthCheckXids() |
There was a problem hiding this comment.
This should be moved to construction.
internal/rm/health.go
Outdated
| e, ret := eventSet.Wait(5000) | ||
| // Wait for NVML event (5 second timeout) | ||
| event, ret := r.eventSet.Wait(5000) | ||
|
|
internal/rm/health.go
Outdated
| continue | ||
| } | ||
| // Create child context | ||
| r.ctx, r.cancel = context.WithCancel(ctx) |
There was a problem hiding this comment.
Could you comment on whether we need to add WithCancel to the context if it already has this done for the plugin?
There was a problem hiding this comment.
we now use the plugin context
internal/plugin/server.go
Outdated
| }() | ||
| // Initialize and start health provider | ||
| plugin.ctx, plugin.cancel = context.WithCancel(context.Background()) | ||
| plugin.healthProvider = plugin.rm.HealthProvider() |
There was a problem hiding this comment.
This should be moved to the constructor.
internal/plugin/server.go
Outdated
| socket: getPluginSocketPath(resourceManager.Resource()), | ||
| // These will be reinitialized every | ||
| // time the plugin server is restarted. | ||
| // server and healthProvider will be reinitialized every time |
There was a problem hiding this comment.
Why can't we instantiate the healthProvider here? Why is it required to reinitilize it on every start?
Would the following be valid?
| // server and healthProvider will be reinitialized every time | |
| healthProvider: resourceManager.HealthProvider(), | |
| // server and healthProvider will be reinitialized every time |
internal/plugin/server.go
Outdated
| } | ||
| }() | ||
| // Initialize and start health provider | ||
| plugin.ctx, plugin.cancel = context.WithCancel(context.Background()) |
There was a problem hiding this comment.
At which point do we rather pass in a ctx and handle the Cancel() call externally?
There was a problem hiding this comment.
we now use the plugin context
Extract device health checking logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context. - Add HealthProvider interface (Start/Stop/Health methods) - Implement nvmlHealthProvider with WaitGroup coordination - Update ResourceManager to return HealthProvider instead of CheckHealth - Update device plugin to use HealthProvider - Add no-op implementation for Tegra devices This refactoring improves code modularity and testability without changing existing behavior. Prepares foundation for future device recovery features. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
|
This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label. |
This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.