fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340
fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340priyank766 wants to merge 1 commit intokubeflow:mainfrom
Conversation
…w#338) Signed-off-by: priyank <priyank8445@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a one-line bug in LocalProcessBackend.__get_job_status() where the else branch (reached when all steps are in TRAINJOB_COMPLETE state) incorrectly returned TRAINJOB_CREATED instead of TRAINJOB_COMPLETE. This caused wait_for_job_status() to always time out (after 600 seconds) on the local backend, even for successfully completed jobs.
Changes:
- Fix the
elsebranch of__get_job_statusto returnTRAINJOB_COMPLETEinstead ofTRAINJOB_CREATEDwhen all steps have finished successfully.
| status = constants.TRAINJOB_CREATED | ||
| else: | ||
| status = constants.TRAINJOB_CREATED | ||
| status = constants.TRAINJOB_COMPLETE |
There was a problem hiding this comment.
The fix correctly addresses the bug, but no test case covers the scenario where all steps reach TRAINJOB_COMPLETE status — which is exactly the path being fixed. The existing test_wait_for_job_status only tests the non-existent job error case. A test that mocks all step statuses as TRAINJOB_COMPLETE and asserts that __get_job_status (or indirectly get_job) returns TRAINJOB_COMPLETE would prevent this regression from recurring.
What this PR does / why we need it:
LocalProcessBackend.__get_job_status()returnsTRAINJOB_CREATEDwhen all steps have finished, instead ofTRAINJOB_COMPLETE. This causes wait_for_job_status() to always timeout (600s) on the local backend even when jobs complete successfully. This is a one-line fix in theelsebranch to return the correct status.Which issue(s) this PR fixes:
Fixes #338, #315
Checklist: