Skip to content

fix: LocalProcessBackend.__get_job_status never returns Complete#316

Open
kevo-1 wants to merge 5 commits intokubeflow:mainfrom
kevo-1:fix/job-status-fallback
Open

fix: LocalProcessBackend.__get_job_status never returns Complete#316
kevo-1 wants to merge 5 commits intokubeflow:mainfrom
kevo-1:fix/job-status-fallback

Conversation

@kevo-1
Copy link

@kevo-1 kevo-1 commented Feb 22, 2026

Summary:
Fixes a fallthrough issue in LocalProcessBackend.__get_job_status where the status resolution logic had no condition for when all steps are Complete, causing the else branch to incorrectly return TRAINJOB_CREATED as a fallback.

Problem:
When all steps report TRAINJOB_COMPLETE, none of the if/elif branches matched, so wait_for_job_status() would never receive the Complete status and would always time out after 600 seconds on local development runs.

Fix:
Added an explicit elif check before the else fallback:

elif all(s == constants.TRAINJOB_COMPLETE for s in statuses):
    status = constants.TRAINJOB_COMPLETE

Testing:

  • Existing tests pass

Fixes: #315

Copilot AI review requested due to automatic review settings February 22, 2026 12:42
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@kevo-1 kevo-1 changed the title ix: LocalProcessBackend.__get_job_status never returns Complete Fix: LocalProcessBackend.__get_job_status never returns Complete Feb 22, 2026
@kevo-1 kevo-1 marked this pull request as draft February 22, 2026 12:45
@kevo-1 kevo-1 marked this pull request as ready for review February 22, 2026 12:45
@kevo-1 kevo-1 changed the title Fix: LocalProcessBackend.__get_job_status never returns Complete fix: LocalProcessBackend.__get_job_status never returns Complete Feb 22, 2026
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @kevo-1!
Please can you add unit tests to verify that Job has the correct statuses after running get_job(): https://github.com/kevo-1/sdk/blob/2427e0b30a6c79e82b9489478aa1923eee583d46/kubeflow/trainer/backends/localprocess/backend_test.py#L399

@google-oss-prow google-oss-prow bot added size/L and removed size/XS labels Feb 23, 2026
@kevo-1
Copy link
Author

kevo-1 commented Feb 23, 2026

Thank you for the review @andreyvelich, I have just added the tests.

@kevo-1 kevo-1 force-pushed the fix/job-status-fallback branch from 382bdf9 to 29ae3d0 Compare February 23, 2026 06:09
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines 475 to 496
if mock_step_statuses:
# Replace first step and add extra steps with proper LocalJob mocks
first_mock = create_autospec(LocalJob, instance=True)
first_mock.status = mock_step_statuses[0]
registered_job.steps[0] = LocalBackendStep(
step_name=registered_job.steps[0].step_name,
job=first_mock,
)
for i, step_status in enumerate(mock_step_statuses[1:], start=1):
extra_mock = create_autospec(LocalJob, instance=True)
extra_mock.status = step_status
registered_job.steps.append(
LocalBackendStep(step_name=f"extra-step-{i}", job=extra_mock)
)
else:
# Replace the single step's job with a properly-typed mock
mock_job = create_autospec(LocalJob, instance=True)
mock_job.status = single_status
registered_job.steps[0] = LocalBackendStep(
step_name=registered_job.steps[0].step_name,
job=mock_job,
)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalBackendStep(job=first_mock/extra_mock/mock_job) passes unittest.mock objects where the field is typed as LocalJob, and Pydantic will validate this as “instance of LocalJob”, so this is likely to raise a validation error / be brittle; prefer mutating the existing LocalJob instance’s internal status (e.g., _status) or creating real LocalJob instances for additional steps instead of using mocks for the job field.

Copilot uses AI. Check for mistakes.
Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevo-1!

/ok-to-test

@astefanutti
Copy link
Contributor

@kevo-1 could you check code formatting please?

Signed-off-by: kevo-1 <kevin.bastawrous@gmail.com>
@kevo-1
Copy link
Author

kevo-1 commented Feb 23, 2026

My bad on this one @astefanutti , I ran the required formats.

@kevo-1 kevo-1 requested a review from andreyvelich February 23, 2026 18:15
Signed-off-by: Kevin <154337845+kevo-1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants