Conversation
There was a problem hiding this comment.
Pull request overview
This pull request updates the Ascend backend to use the "TND" (Token, Num_heads, Dim) layout instead of "BSH" (Batch, Sequence, Head) layout for the paged decode attention operation. The changes also update PyTorch version requirements to support newer versions.
Changes:
- Updated torch and related package version constraints to support versions up to 2.10.0 and 2.25.0
- Changed the attention layout from "BSH" to "TND" in the Ascend NPU operations
- Added q_seqlens parameter to the paged_decode_attention function signature across the codebase
- Modified cudagraph buffer initialization and handling for the new q_seqlens parameter
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| requirements/ascend/torch.txt | Updated version constraints for torch, torch-npu, and torchvision to support newer releases |
| dlinfer/vendor/ascend/torch_npu_ops.py | Changed attention layout to TND, added q_seqlens parameter, refactored control flow, and updated API call to npu_fused_infer_attention_score |
| dlinfer/ops/llm.py | Added q_seqlens parameter to function signature and forwarded it to vendor implementation |
| dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py | Modified buffer initialization for q_seqlens and added buffer filling logic |
Comments suppressed due to low confidence (1)
dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py:104
- Missing buffer fill operation: The q_seqlens value from attn_metadata is extracted at line 91 but is never copied into input_buffers["q_seqlens"]. This is inconsistent with how other metadata fields like kv_seqlens (line 103), block_offsets (line 102), and kv_start_indices (line 104) are handled. Add 'input_buffers["q_seqlens"][:batch_size] = q_seqlens' after line 104 to properly fill the buffer.
q_seqlens: Tensor = attn_metadata.q_seqlens
kv_start_indices: Tensor = attn_metadata.kv_start_indices
input_buffers: BuffType = graph_meta.input_buffers
batch_size, num_blocks = block_offsets.size()
num_tokens = input_ids.size(-1)
# fill buffer
input_buffers["input_ids"][:, :num_tokens] = input_ids
input_buffers["position_ids"][:, :num_tokens] = position_ids
input_buffers["block_offsets"][:batch_size, :num_blocks] = block_offsets
input_buffers["kv_seqlens"][:batch_size] = kv_seqlens
input_buffers["kv_start_indices"][:batch_size] = kv_start_indices
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| input_buffers["q_seqlens"] = torch.arange( | ||
| 1, max_batches + 1, dtype=torch.int32 | ||
| ) | ||
|
|
||
| input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32) |
There was a problem hiding this comment.
Potentially incorrect initialization values: The q_seqlens buffer is initialized with torch.arange(1, max_batches + 1), creating values [1, 2, 3, ..., max_batches]. However, according to the comment, this represents "cu_attn actual_seq_lengths" for the decoding phase. In decoding, each request typically processes exactly 1 token, so all values should be 1 (similar to how camb_cudagraph.py initializes it with torch.ones). The incrementing sequence suggests this might be confusing q_seqlens with cumulative sequence lengths (which is what q_start_loc represents). Verify that this initialization is correct for the TND layout and the actual_seq_lengths parameter of npu_fused_infer_attention_score.
| input_buffers["q_seqlens"] = torch.arange( | |
| 1, max_batches + 1, dtype=torch.int32 | |
| ) | |
| input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32) | |
| input_buffers["q_seqlens"] = torch.ones( | |
| max_batches, dtype=torch.int32, device=device | |
| ) | |
| input_buffers["kv_seqlens"] = torch.ones( | |
| max_batches, dtype=torch.int32, device=device | |
| ) |
No description provided.