[GFD] Prevent GFD from crashing when device is unhealthy#1510
[GFD] Prevent GFD from crashing when device is unhealthy#1510ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
elezar
left a comment
There was a problem hiding this comment.
How is what you're doing here different to just ignoring errors during labelling?
1461168 to
db81e51
Compare
PTAL |
db81e51 to
8769dd8
Compare
Make Merge() resilient to individual labeler failures by logging errors as warnings and continuing with remaining labelers. This prevents GFD from crashing when devices go unhealthy (e.g., XID errors) and allows partial label generation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
8769dd8 to
df34cb8
Compare
| labels, err := labeler.Labels() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error generating labels: %v", err) | ||
| klog.Warningf("Labeler at index %d failed, skipping: %v", i, err) |
There was a problem hiding this comment.
Just a nit: is the index useful in this case? Should we add a Name() or String() function to the labeler?
| labels, err := labeler.Labels() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error generating labels: %v", err) | ||
| klog.Warningf("Labeler at index %d failed, skipping: %v", i, err) |
There was a problem hiding this comment.
Also, does it make sense to use:
| klog.Warningf("Labeler at index %d failed, skipping: %v", i, err) | |
| klog.ErrorS(err, "skipping failed labeller", "index", i) |
insread?
| ) | ||
|
|
||
| // mockLabeler is a test double for the Labeler interface | ||
| type mockLabeler struct { |
There was a problem hiding this comment.
Question: Why not use moq to generatre an fully-featured mock?
elezar
left a comment
There was a problem hiding this comment.
Note that since the errors are thrown when the NVML labeller is constructed this is not sufficient to address the issue that we are seeing. To confirm, use a mock NVML that raises errors in any of the functions that are used.
|
Closing in favor of #1561 |
Modified Merge() to log and skip failed labelers instead of aborting the entire pipeline. When a device fails (XID error, GPU lost, etc.), GFD continues with healthy devices and generates partial labels.