Skip to content

TND instead of BSH#299

Open
jinminxi104 wants to merge 3 commits intoDeepLink-org:mainfrom
jinminxi104:pa-tnd
Open

TND instead of BSH#299
jinminxi104 wants to merge 3 commits intoDeepLink-org:mainfrom
jinminxi104:pa-tnd

Conversation

@jinminxi104
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates the Ascend backend to use the "TND" (Token, Num_heads, Dim) layout instead of "BSH" (Batch, Sequence, Head) layout for the paged decode attention operation. The changes also update PyTorch version requirements to support newer versions.

Changes:

  • Updated torch and related package version constraints to support versions up to 2.10.0 and 2.25.0
  • Changed the attention layout from "BSH" to "TND" in the Ascend NPU operations
  • Added q_seqlens parameter to the paged_decode_attention function signature across the codebase
  • Modified cudagraph buffer initialization and handling for the new q_seqlens parameter

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
requirements/ascend/torch.txt Updated version constraints for torch, torch-npu, and torchvision to support newer releases
dlinfer/vendor/ascend/torch_npu_ops.py Changed attention layout to TND, added q_seqlens parameter, refactored control flow, and updated API call to npu_fused_infer_attention_score
dlinfer/ops/llm.py Added q_seqlens parameter to function signature and forwarded it to vendor implementation
dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py Modified buffer initialization for q_seqlens and added buffer filling logic
Comments suppressed due to low confidence (1)

dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py:104

  • Missing buffer fill operation: The q_seqlens value from attn_metadata is extracted at line 91 but is never copied into input_buffers["q_seqlens"]. This is inconsistent with how other metadata fields like kv_seqlens (line 103), block_offsets (line 102), and kv_start_indices (line 104) are handled. Add 'input_buffers["q_seqlens"][:batch_size] = q_seqlens' after line 104 to properly fill the buffer.
    q_seqlens: Tensor = attn_metadata.q_seqlens
    kv_start_indices: Tensor = attn_metadata.kv_start_indices

    input_buffers: BuffType = graph_meta.input_buffers

    batch_size, num_blocks = block_offsets.size()
    num_tokens = input_ids.size(-1)

    # fill buffer
    input_buffers["input_ids"][:, :num_tokens] = input_ids
    input_buffers["position_ids"][:, :num_tokens] = position_ids
    input_buffers["block_offsets"][:batch_size, :num_blocks] = block_offsets
    input_buffers["kv_seqlens"][:batch_size] = kv_seqlens
    input_buffers["kv_start_indices"][:batch_size] = kv_start_indices

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to 66
input_buffers["q_seqlens"] = torch.arange(
1, max_batches + 1, dtype=torch.int32
)

input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially incorrect initialization values: The q_seqlens buffer is initialized with torch.arange(1, max_batches + 1), creating values [1, 2, 3, ..., max_batches]. However, according to the comment, this represents "cu_attn actual_seq_lengths" for the decoding phase. In decoding, each request typically processes exactly 1 token, so all values should be 1 (similar to how camb_cudagraph.py initializes it with torch.ones). The incrementing sequence suggests this might be confusing q_seqlens with cumulative sequence lengths (which is what q_start_loc represents). Verify that this initialization is correct for the TND layout and the actual_seq_lengths parameter of npu_fused_infer_attention_score.

Suggested change
input_buffers["q_seqlens"] = torch.arange(
1, max_batches + 1, dtype=torch.int32
)
input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32)
input_buffers["q_seqlens"] = torch.ones(
max_batches, dtype=torch.int32, device=device
)
input_buffers["kv_seqlens"] = torch.ones(
max_batches, dtype=torch.int32, device=device
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant