fix(trainer): return TRAINJOB_COMPLETE when all steps are done by priyank766 · Pull Request #340 · kubeflow/sdk

priyank766 · 2026-02-28T06:00:36Z

What this PR does / why we need it:
LocalProcessBackend.__get_job_status() returns TRAINJOB_CREATED when all steps have finished, instead of TRAINJOB_COMPLETE. This causes wait_for_job_status() to always timeout (600s) on the local backend even when jobs complete successfully. This is a one-line fix in the else branch to return the correct status.

Which issue(s) this PR fixes:

Fixes #338, #315

Checklist:

Docs included if any changes are user facing

…w#338) Signed-off-by: priyank <priyank8445@gmail.com>

google-oss-prow · 2026-02-28T06:00:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-28T06:00:46Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR fixes a one-line bug in LocalProcessBackend.__get_job_status() where the else branch (reached when all steps are in TRAINJOB_COMPLETE state) incorrectly returned TRAINJOB_CREATED instead of TRAINJOB_COMPLETE. This caused wait_for_job_status() to always time out (after 600 seconds) on the local backend, even for successfully completed jobs.

Changes:

Fix the else branch of __get_job_status to return TRAINJOB_COMPLETE instead of TRAINJOB_CREATED when all steps have finished successfully.

Copilot · 2026-02-28T06:02:06Z

kubeflow/trainer/backends/localprocess/backend.py

            status = constants.TRAINJOB_CREATED
        else:
-            status = constants.TRAINJOB_CREATED
+            status = constants.TRAINJOB_COMPLETE


The fix correctly addresses the bug, but no test case covers the scenario where all steps reach TRAINJOB_COMPLETE status — which is exactly the path being fixed. The existing test_wait_for_job_status only tests the non-existent job error case. A test that mocks all step statuses as TRAINJOB_COMPLETE and asserts that __get_job_status (or indirectly get_job) returns TRAINJOB_COMPLETE would prevent this regression from recurring.

priyank766 · 2026-02-28T06:15:26Z

@andreyvelich
@astefanutti
@jaiakash

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

fix(local): return TRAINJOB_COMPLETE when all steps are done (kubeflo…

8a2611e

…w#338) Signed-off-by: priyank <priyank8445@gmail.com>

Copilot AI review requested due to automatic review settings February 28, 2026 06:00

google-oss-prow bot requested review from Electronic-Waste, kramaranya and szaher February 28, 2026 06:00

google-oss-prow bot added the size/XS label Feb 28, 2026

Copilot started reviewing on behalf of priyank766 February 28, 2026 06:00 View session

priyank766 changed the title ~~fix(local): return TRAINJOB_COMPLETE when all steps are done (#338)~~ fix(local): return TRAINJOB_COMPLETE when all steps are done Feb 28, 2026

Copilot AI reviewed Feb 28, 2026

View reviewed changes

priyank766 changed the title ~~fix(local): return TRAINJOB_COMPLETE when all steps are done~~ fix(local): return TRAINJOB_COMPLETE when all steps are done Feb 28, 2026

priyank766 changed the title ~~fix(local): return TRAINJOB_COMPLETE when all steps are done~~ fix(trainer): return TRAINJOB_COMPLETE when all steps are done Feb 28, 2026

priyank766 requested a review from Copilot February 28, 2026 16:50

Copilot started reviewing on behalf of priyank766 February 28, 2026 16:50 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

priyank766 mentioned this pull request Mar 1, 2026

LocalProcessBackend.__get_job_status never returns Complete #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340

fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340
priyank766 wants to merge 1 commit intokubeflow:mainfrom
priyank766:fix/local-job-status-338

priyank766 commented Feb 28, 2026

Uh oh!

google-oss-prow bot commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

priyank766 commented Feb 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

priyank766 commented Feb 28, 2026

Uh oh!

google-oss-prow bot commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

priyank766 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

priyank766 commented Feb 28, 2026 •

edited

Loading