-
Notifications
You must be signed in to change notification settings - Fork 42
Remove asyncio usage in jobs monitoring #796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most of the calls inside the async wrappers are blocking: sync rest calls and system calls. So asyncio only adds a level of complexity without bringing any benefit.
📝 WalkthroughWalkthroughThe codebase transitions from asynchronous to synchronous execution patterns across multiple subsystems. Async methods are converted to sync equivalents using time.sleep instead of asyncio.sleep, WebSocket handling shifts to synchronous clients, and dev dependencies are updated to remove pytest-asyncio. Public method signatures throughout the codebase change from async to sync. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/_core/base_runner.py (1)
120-130:⚠️ Potential issue | 🟠 Major
delayed_submit_testnow blocks the monitor loop for its entire delay.Previously with asyncio,
delayed_submit_testwas scheduled as a concurrent task — the monitor loop continued while the delay elapsed. Nowtime.sleep(delay)blocks the calling thread. This matters when multiple dependent tests are triggered (e.g., inhandle_dependenciesat Line 346 orcheck_and_schedule_start_post_init_dependent_testsat Line 179): each call blocks fordelayseconds sequentially, stalling the entire monitoring loop forN × 5seconds.If this is acceptable given the workloads, no change needed — but it's a behavioral regression from the async version worth acknowledging.
Alternative: submit immediately with an initial grace period tracked per-job
One option is to remove the sleep entirely and track a
not_beforetimestamp on the test run, deferring actual submission until the next monitor iteration after the delay has elapsed. This preserves the delay semantics without blocking the loop.
Greptile OverviewGreptile SummaryThis PR removes Key integration points are Blocking issues to fix before merge:
Confidence Score: 2/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12 files reviewed, 5 comments
| def shutdown(self): | ||
| """Gracefully shut down the runner, terminating all outstanding jobs.""" | ||
| self.shutting_down = True | ||
| logging.info("Terminating all jobs...") | ||
| for job in self.jobs: | ||
| logging.info(f"Terminating job {job.id} for test {job.test_run.name}") | ||
| self.system.kill(job) | ||
| logging.info("All jobs have been killed.") | ||
| logging.info("Waiting for all jobs to be killed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading shutdown completion log
shutdown() logs "Waiting for all jobs to be killed." but the implementation does not wait for anything (it only loops over self.jobs and calls self.system.kill(job)). This is misleading during failures/cancel, and can also mask cases where kill() is async/eventually-consistent and jobs remain running. Either restore the previous “All jobs have been killed.” wording, or implement an actual wait/poll loop until is_job_completed/is_job_running indicates termination.
| def cancel_on_signal( | ||
| self, | ||
| signum: int, | ||
| frame: Optional[FrameType], # noqa: Vulture | ||
| ): | ||
| logging.info(f"Signal {signum} received, shutting down...") | ||
| asyncio.get_running_loop().call_soon_threadsafe(self._cancel_all) | ||
| self.runner.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking work in signal handler
cancel_on_signal() now calls self.runner.shutdown() directly from the Python signal handler. Since shutdown() loops over jobs and calls system.kill(job) (which can involve blocking subprocess/HTTP calls), this can run for a long time inside the signal handler and delay process responsiveness/termination. Previously this was queued onto the asyncio loop (call_soon_threadsafe). With the sync design, it’s safer to make the handler set a flag and let the main loop perform shutdown outside the signal handler context.
| @@ -512,9 +512,11 @@ async def fetch_training_logs( | |||
| } | |||
|
|
|||
| ssl_context = ssl._create_unverified_context() | |||
| async with websockets.connect(url, extra_headers=headers, ssl=ssl_context) as websocket: | |||
| with output_file_path.open("w") as log_file: | |||
| async for message in websocket: | |||
| if isinstance(message, bytes): | |||
| message = message.decode("utf-8") | |||
| log_file.write(str(message)) | |||
| with ( | |||
| ws_connect(url, additional_headers=headers, ssl=ssl_context) as websocket, | |||
| output_file_path.open("w") as log_file, | |||
| ): | |||
| for message in websocket: | |||
| if isinstance(message, bytes): | |||
| message = message.decode("utf-8") | |||
| log_file.write(str(message)) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebSocket can hang indefinitely
fetch_training_logs() now uses websockets.sync.client.connect and iterates for message in websocket: with no timeout/termination condition. If the server keeps the socket open (or stalls without closing), this can block forever and prevent on_job_completion() from returning (RunAI runner calls store_logs() synchronously). Consider adding a receive timeout / max duration, or otherwise ensuring the loop terminates.
This is new behavior due to the async→sync refactor and can deadlock the overall runner completion path.
Additional Comments (2)
Also appears in the same flow when
This affects log collection on job completion (src/cloudai/systems/runai/runai_runner.py:44-50). |
Summary
Most of the calls inside the async wrappers are blocking: sync rest calls and system calls. So
asyncioonly adds a level of complexity without bringing any benefits.Test Plan
conf/common+ some private ones.Additional Notes
–