Skip to content

Conversation

@Vasuk12
Copy link
Contributor

@Vasuk12 Vasuk12 commented Dec 12, 2025

Issue #397

  • Add ServerShutdownError exception for graceful shutdown handling
  • Detect server shutdown via health check and set _server_shutting_down flag
  • Convert ServerDisconnectedError to ServerShutdownError during shutdown
  • Catch ServerShutdownError in runner and log at debug level (no traceback)

- Add ServerShutdownError exception for graceful shutdown handling
- Detect server shutdown via health check and set _server_shutting_down flag
- Convert ServerDisconnectedError to ServerShutdownError during shutdown
- Catch ServerShutdownError in runner and log at debug level (no traceback)

Signed-off-by: Vasu <vasukhare88@gmail.com>
Copilot AI review requested due to automatic review settings December 12, 2025 03:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements graceful handling of server shutdown scenarios to eliminate traceback spam in logs. It introduces a new ServerShutdownError exception that is raised when the server is detected to be shutting down, allowing the runner to handle these situations gracefully with debug-level logging instead of full exception traces.

Key Changes:

  • Introduced ServerShutdownError exception for distinguishing shutdown scenarios from other errors
  • Added server health monitoring via _server_shutting_down flag that is set when health checks fail
  • Implemented conversion from ServerDisconnectedError to ServerShutdownError during detected shutdowns
  • Added exception handlers in the runner to catch ServerShutdownError and log at debug level without tracebacks

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
agentlightning/store/client_server.py Defines ServerShutdownError exception; adds _server_shutting_down flag to track server state; updates _wait_until_healthy to set shutdown flag and log as warning instead of error; modifies _request_json to convert ServerDisconnectedError to ServerShutdownError when shutdown is detected
agentlightning/runner/agent.py Imports ServerShutdownError; adds exception handling in _post_process_rollout_result, _step_impl, and iter methods to catch and log ServerShutdownError at debug level without tracebacks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 12, 2025

This fix handles both ServerDisconnectedError and CancelledError during server shutdown...

@ultmaster
Copy link
Contributor

Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it.

Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error.
Pros: I think we can get an accurate awareness of the server status.
Cons: It must work with client-server execution strategy. For example, isolated store mode won't benefit from this solution.

Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet.
Pros: Sounds perfect.
Cons: Not sure whether it's doable.

So what do you think?

await store.add_otel_span(rollout.rollout_id, rollout.attempt.attempt_id, reward_span)
except (ServerShutdownError, asyncio.CancelledError):
# Server is shutting down or request was cancelled - handle gracefully without traceback
logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't the only place with such issue. We'd better handle it in store client.

self._dequeue_first_unsuccessful: bool = True

# Track server shutdown state to handle errors gracefully
self._server_shutting_down: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having a _server_online shared flag and we can sunset hacky flags like _dequeue_first_unsuccessful.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 12, 2025

Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it.

Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error. Pros: I think we can get an accurate awareness of the server status. Cons: It must work with client-server execution strategy. For example, isolated store mode won't benefit from this solution.

Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet. Pros: Sounds perfect. Cons: Not sure whether it's doable.

So what do you think?

I think Option 1 is more feasible for client-server mode. Instead of handling all exception types, we only need to check a shared flag. I've added Option 2 which works for client-server execution and external store servers. But Option 1 would be faster and more accurate for client-server execution. I think we should keep Option 2 as the fallback for external store servers (where shared flag isn't available), and add Option 1 as an enhancement for client-server execution mode. Is that alright?

@ultmaster
Copy link
Contributor

You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 14, 2025

You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag.

Okay.. Give me some time to implement option 1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants