-
Notifications
You must be signed in to change notification settings - Fork 785
Handle server shutdown gracefully to prevent traceback spam #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add ServerShutdownError exception for graceful shutdown handling - Detect server shutdown via health check and set _server_shutting_down flag - Convert ServerDisconnectedError to ServerShutdownError during shutdown - Catch ServerShutdownError in runner and log at debug level (no traceback) Signed-off-by: Vasu <vasukhare88@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements graceful handling of server shutdown scenarios to eliminate traceback spam in logs. It introduces a new ServerShutdownError exception that is raised when the server is detected to be shutting down, allowing the runner to handle these situations gracefully with debug-level logging instead of full exception traces.
Key Changes:
- Introduced
ServerShutdownErrorexception for distinguishing shutdown scenarios from other errors - Added server health monitoring via
_server_shutting_downflag that is set when health checks fail - Implemented conversion from
ServerDisconnectedErrortoServerShutdownErrorduring detected shutdowns - Added exception handlers in the runner to catch
ServerShutdownErrorand log at debug level without tracebacks
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| agentlightning/store/client_server.py | Defines ServerShutdownError exception; adds _server_shutting_down flag to track server state; updates _wait_until_healthy to set shutdown flag and log as warning instead of error; modifies _request_json to convert ServerDisconnectedError to ServerShutdownError when shutdown is detected |
| agentlightning/runner/agent.py | Imports ServerShutdownError; adds exception handling in _post_process_rollout_result, _step_impl, and iter methods to catch and log ServerShutdownError at debug level without tracebacks |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rmatting Signed-off-by: Vasu <vasukhare88@gmail.com>
Signed-off-by: Vasu <vasukhare88@gmail.com>
|
This fix handles both ServerDisconnectedError and CancelledError during server shutdown... |
|
Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it. Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error. Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet. So what do you think? |
| await store.add_otel_span(rollout.rollout_id, rollout.attempt.attempt_id, reward_span) | ||
| except (ServerShutdownError, asyncio.CancelledError): | ||
| # Server is shutting down or request was cancelled - handle gracefully without traceback | ||
| logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't the only place with such issue. We'd better handle it in store client.
| self._dequeue_first_unsuccessful: bool = True | ||
|
|
||
| # Track server shutdown state to handle errors gracefully | ||
| self._server_shutting_down: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about having a _server_online shared flag and we can sunset hacky flags like _dequeue_first_unsuccessful.
I think Option 1 is more feasible for client-server mode. Instead of handling all exception types, we only need to check a shared flag. I've added Option 2 which works for client-server execution and external store servers. But Option 1 would be faster and more accurate for client-server execution. I think we should keep Option 2 as the fallback for external store servers (where shared flag isn't available), and add Option 1 as an enhancement for client-server execution mode. Is that alright? |
|
You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag. |
Okay.. Give me some time to implement option 1! |
Issue #397