fix: LocalProcessBackend.__get_job_status never returns Complete#316
fix: LocalProcessBackend.__get_job_status never returns Complete#316kevo-1 wants to merge 5 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this fix @kevo-1!
Please can you add unit tests to verify that Job has the correct statuses after running get_job(): https://github.com/kevo-1/sdk/blob/2427e0b30a6c79e82b9489478aa1923eee583d46/kubeflow/trainer/backends/localprocess/backend_test.py#L399
|
Thank you for the review @andreyvelich, I have just added the tests. |
382bdf9 to
29ae3d0
Compare
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
29ae3d0 to
2eea1b9
Compare
| if mock_step_statuses: | ||
| # Replace first step and add extra steps with proper LocalJob mocks | ||
| first_mock = create_autospec(LocalJob, instance=True) | ||
| first_mock.status = mock_step_statuses[0] | ||
| registered_job.steps[0] = LocalBackendStep( | ||
| step_name=registered_job.steps[0].step_name, | ||
| job=first_mock, | ||
| ) | ||
| for i, step_status in enumerate(mock_step_statuses[1:], start=1): | ||
| extra_mock = create_autospec(LocalJob, instance=True) | ||
| extra_mock.status = step_status | ||
| registered_job.steps.append( | ||
| LocalBackendStep(step_name=f"extra-step-{i}", job=extra_mock) | ||
| ) | ||
| else: | ||
| # Replace the single step's job with a properly-typed mock | ||
| mock_job = create_autospec(LocalJob, instance=True) | ||
| mock_job.status = single_status | ||
| registered_job.steps[0] = LocalBackendStep( | ||
| step_name=registered_job.steps[0].step_name, | ||
| job=mock_job, | ||
| ) |
There was a problem hiding this comment.
LocalBackendStep(job=first_mock/extra_mock/mock_job) passes unittest.mock objects where the field is typed as LocalJob, and Pydantic will validate this as “instance of LocalJob”, so this is likely to raise a validation error / be brittle; prefer mutating the existing LocalJob instance’s internal status (e.g., _status) or creating real LocalJob instances for additional steps instead of using mocks for the job field.
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
astefanutti
left a comment
There was a problem hiding this comment.
Thanks @kevo-1!
/ok-to-test
|
@kevo-1 could you check code formatting please? |
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
|
My bad on this one @astefanutti , I ran the required formats. |
Signed-off-by: Kevin <154337845+kevo-1@users.noreply.github.com>
Summary:
Fixes a fallthrough issue in
LocalProcessBackend.__get_job_statuswhere the status resolution logic had no condition for when all steps areComplete, causing theelsebranch to incorrectly returnTRAINJOB_CREATEDas a fallback.Problem:
When all steps report
TRAINJOB_COMPLETE, none of theif/elifbranches matched, sowait_for_job_status()would never receive theCompletestatus and would always time out after 600 seconds on local development runs.Fix:
Added an explicit
elifcheck before theelsefallback:Testing:
Fixes: #315