Skip to content

Conversation

@izhuhaoran
Copy link
Contributor

@izhuhaoran izhuhaoran commented Dec 16, 2025

Purpose

In async scheduling + spec, requests (re-entering input batch) do not have pre-step draft tokens (since they were not running in the previous step). Therefore, any scheduled_spec_decode_tokens assigned by the scheduler for these requests are essentially invalid placeholders. Retaining them leads to unnecessary computation and potential unexpected behavior.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@izhuhaoran izhuhaoran changed the title [BugFix][Async] clear spec tokens for preempted or resumed reqs in async + spec decode [BugFix][Async] clear spec tokens for preempted or resumed reqs in async Dec 16, 2025
@mergify mergify bot added the v1 label Dec 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in asynchronous scheduling with speculative decoding where preempted or resumed requests could have invalid speculative tokens. These requests don't have pre-step draft tokens, so any scheduled speculative tokens are incorrect.

The change correctly identifies these requests (those not in the persistent batch) and, when in async scheduling mode, clears any associated speculative tokens from the scheduler_output. This is done by removing the request ID from scheduled_spec_decode_tokens and adjusting total_num_scheduled_tokens and num_scheduled_tokens accordingly.

The implementation is clean and directly solves the described problem, preventing unnecessary computation and potential downstream errors. The logic appears sound and consistent with the existing codebase. I have no further suggestions.

@izhuhaoran
Copy link
Contributor Author

@njhill @benchislett Could you please review this PR when you have time ?

@benchislett
Copy link
Collaborator

I don't fully understand the fix. What scenario is causing all these conditions to trigger? What behaviour is changing with this patch?

@izhuhaoran
Copy link
Contributor Author

I don't fully understand the fix. What scenario is causing all these conditions to trigger? What behaviour is changing with this patch?

Hi @benchislett, thanks for the review.

This PR addresses a corner case in Async Scheduling + Speculative Decoding. I observed this behavior sporadically in my own deployment. Since it is difficult to provide a deterministic reproduction script due to the specific timing and load conditions required, I will use a concrete example to illustrate the scenario.

Configuration:

  • max_num_batched_tokens = 40
  • num_spec_tokens = 3

Timeline:

  • Step N: Requests 0-10 are in the running queue.
  • Step N+1:
    • The scheduler processes the running queue [0, 1, ..., 9, 10].
    • Requests 0-9 are scheduled. They consume 10 reqs * 4 tokens (1 + 3) = 40 tokens. The budget is full.
    • Request 10 is skipped (unscheduled) due to the budget limit.
    • Consequence: In GPUModelRunner, Request 10 is removed from the input_batch because it wasn't scheduled for this step. It loses its "active" status in the worker.
  • Step N+2:
    • Suppose Request 0 finishes (e.g., reaches max length), freeing up budget.
    • Request 10 is now scheduled.
    • The Scheduler assigns scheduled_spec_decode_tokens to Request 10 (as it is technically a "running" request).
    • The Conflict: In GPUModelRunner, Request 10 is treated as a "resumed" request (req_index is None) because it was missing from the input_batch in Step N+1. Since it didn't run in Step N+1, it has no cached draft tokens to verify.

This PR:
We can safely clear the scheduled_spec_decode_tokens for Request 10 at this step. Of course, even if we don’t clear it, nothing will break — it would only result in incorrect draft token IDs being computed (which would still be properly rejected by the rejection sampler). However, clearing it avoids unnecessary computation and helps prevent any unexpected bugs later on.

Why this PR is safe:
Since req_index is None and req in scheduled_spec_decode_tokens, we know the request is re-entering the execution batch. It has no valid draft tokens from the immediate previous step to verify. Clearing the spec tokens is safe. This change acts as a defensive measure. In normal continuous decoding, this logic is skipped, so it does not affect standard behavior.

@izhuhaoran izhuhaoran changed the title [BugFix][Async] clear spec tokens for preempted or resumed reqs in async [BugFix][Async] Clear spec tokens for requests re-entering input batch in Async Dec 17, 2025
@izhuhaoran izhuhaoran changed the title [BugFix][Async] Clear spec tokens for requests re-entering input batch in Async [BugFix][Async] Clear spec tokens for requests re-entering input batch Dec 17, 2025
@izhuhaoran izhuhaoran marked this pull request as draft December 17, 2025 18:31
@izhuhaoran
Copy link
Contributor Author

Update:
Currently, we only deepcopy the scheduler_output when the previous step’s _draft_token_ids is None. Removing scheduled_spec_decode_tokens here may affect the scheduler-side scheduled_spec_decode_tokens, which would then impact how update_from_output updates fields like num_computed_tokens and num_output_placeholders.

This PR could cause the scheduler’s and worker’s cached states for a request to become inconsistent, making issues more likely.

Since not removing the tokens does not actually cause any functional problems, we will temporarily drop/abandon this PR.

@izhuhaoran izhuhaoran closed this Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants