Skip to content

fix(trainer): make get_job_logs optionally fail fast when pod is missing#339

Open
sjiang83 wants to merge 3 commits intokubeflow:mainfrom
sjiang83:fix-317-get-job-logs-strict
Open

fix(trainer): make get_job_logs optionally fail fast when pod is missing#339
sjiang83 wants to merge 3 commits intokubeflow:mainfrom
sjiang83:fix-317-get-job-logs-strict

Conversation

@sjiang83
Copy link

Fixes #317

The Issue
Currently, when calling TrainerClient.get_job_logs() using the Kubernetes backend, it silently returns an empty iterator if a pod isn't found for the requested step. This usually happens if the step is still Pending or the selector just comes up empty. This behavior creates a lot of ambiguity, especially for MCP tools and AI agents that rely on clear feedback to know what is actually going on.

What this PR does
I've added an opt-in strict flag to the method:

  • strict=False (the default): Keeps the current behavior exactly as is, which might yield nothing.
  • strict=True: Explicitly raises a PodNotFoundError if no pod is found for the requested TrainJob or step.

I also updated the signatures across the other Trainer backends to accept the strict argument so the interface stays consistent (it simply ignores the flag for non-Kubernetes backends). I applied this same strict logic to the Optimizer Kubernetes backend log retrieval as well.

Why this helps MCP and Agents
For agents to debug effectively, they absolutely need to be able to tell the difference between three things:

  • The job exists but the logs just aren't ready or outputting yet.
  • The pod actually cannot be found (which we now catch explicitly with strict=True).
  • Standard permission or RBAC errors (which still bubble up from the Kubernetes client just like before).

By opting into strict=True, we get a much more deterministic observability flow without forcing a breaking change on existing SDK users.

Testing and Compatibility

  • Tests: I added a unit test to verify that strict=True correctly raises an error when all steps are still in the Pending state.
  • Backward compatibility: 100% safe. Since the default behavior remains unchanged, existing users won't notice any difference.

Copilot AI review requested due to automatic review settings February 28, 2026 02:59
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@sjiang83 sjiang83 changed the title trainer: make get_job_logs optionally fail fast when pod is missing fix(trainer): make get_job_logs optionally fail fast when pod is missing Feb 28, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #317 where get_job_logs() silently returned an empty iterator when no pod was found for the requested step, providing no way to distinguish "logs not ready" from "pod not found." The fix adds an opt-in strict flag to get_job_logs() across all backends.

Changes:

  • Adds strict: bool = False parameter to get_job_logs in all backends and TrainerClient, raising PodNotFoundError when strict=True and no pod is found.
  • Introduces a new PodNotFoundError(RuntimeError) class in the Kubernetes trainer backend.
  • Adds a unit test verifying that strict=True raises PodNotFoundError when all steps are pending.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
kubeflow/trainer/backends/kubernetes/backend.py Adds PodNotFoundError class and strict flag to get_job_logs
kubeflow/trainer/backends/base.py Adds strict to the abstract get_job_logs signature
kubeflow/trainer/backends/localprocess/backend.py Adds strict to get_job_logs signature (no-op)
kubeflow/trainer/backends/container/backend.py Adds strict to get_job_logs signature (no-op)
kubeflow/trainer/api/trainer_client.py Exposes strict on the public get_job_logs API
kubeflow/optimizer/backends/kubernetes/backend.py Adds strict flag and corresponding RuntimeError raise to optimizer's get_job_logs
kubeflow/trainer/backends/kubernetes/backend_test.py Adds unit test for strict=True behavior with pending pods

Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>
@sjiang83 sjiang83 force-pushed the fix-317-get-job-logs-strict branch from 6fa100c to f928e74 Compare February 28, 2026 03:04
Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>
@sjiang83 sjiang83 force-pushed the fix-317-get-job-logs-strict branch from 4e3170a to b4acf43 Compare February 28, 2026 03:32
@sjiang83
Copy link
Author

Just a quick heads-up that the CI workflows are paused waiting for a maintainer's approval.

Whenever someone has a second to kick off the unit and E2E tests, I'd really appreciate it!

@sjiang83
Copy link
Author

Just a quick heads-up that the CI workflows are paused waiting for a maintainer's approval.

Whenever someone has a second to kick off the unit and E2E tests, I'd really appreciate it!

Hey @andreyvelich, looks like this needs a quick /ok-to-test to kick off the checks. No rush, thanks!

@jaiakash
Copy link
Member

/ok-to-test

Signed-off-by: sjiang83 <sjiang83@fordham.edu>
@sjiang83
Copy link
Author

All the CI tests are green across the board now.

Just waiting on an LGTM and an approve label whenever a maintainer has a moment to take a look. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_job_logs silently returns empty iterator when no pod is found

3 participants