-
-
Notifications
You must be signed in to change notification settings - Fork 13k
[BugFix][Async] Clear spec tokens for requests re-entering input batch #30796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in asynchronous scheduling with speculative decoding where preempted or resumed requests could have invalid speculative tokens. These requests don't have pre-step draft tokens, so any scheduled speculative tokens are incorrect.
The change correctly identifies these requests (those not in the persistent batch) and, when in async scheduling mode, clears any associated speculative tokens from the scheduler_output. This is done by removing the request ID from scheduled_spec_decode_tokens and adjusting total_num_scheduled_tokens and num_scheduled_tokens accordingly.
The implementation is clean and directly solves the described problem, preventing unnecessary computation and potential downstream errors. The logic appears sound and consistent with the existing codebase. I have no further suggestions.
|
@njhill @benchislett Could you please review this PR when you have time ? |
|
I don't fully understand the fix. What scenario is causing all these conditions to trigger? What behaviour is changing with this patch? |
Hi @benchislett, thanks for the review. This PR addresses a corner case in Async Scheduling + Speculative Decoding. I observed this behavior sporadically in my own deployment. Since it is difficult to provide a deterministic reproduction script due to the specific timing and load conditions required, I will use a concrete example to illustrate the scenario. Configuration:
Timeline:
This PR: Why this PR is safe: |
|
Update: This PR could cause the scheduler’s and worker’s cached states for a request to become inconsistent, making issues more likely. Since not removing the tokens does not actually cause any functional problems, we will temporarily drop/abandon this PR. |
Purpose
In async scheduling + spec, requests (re-entering input batch) do not have pre-step draft tokens (since they were not running in the previous step). Therefore, any
scheduled_spec_decode_tokensassigned by the scheduler for these requests are essentially invalid placeholders. Retaining them leads to unnecessary computation and potential unexpected behavior.