TND instead of BSH by jinminxi104 · Pull Request #299 · DeepLink-org/dlinfer

jinminxi104 · 2026-01-27T07:46:06Z

No description provided.

Copilot

Pull request overview

This pull request updates the Ascend backend to use the "TND" (Token, Num_heads, Dim) layout instead of "BSH" (Batch, Sequence, Head) layout for the paged decode attention operation. The changes also update PyTorch version requirements to support newer versions.

Changes:

Updated torch and related package version constraints to support versions up to 2.10.0 and 2.25.0
Changed the attention layout from "BSH" to "TND" in the Ascend NPU operations
Added q_seqlens parameter to the paged_decode_attention function signature across the codebase
Modified cudagraph buffer initialization and handling for the new q_seqlens parameter

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
requirements/ascend/torch.txt	Updated version constraints for torch, torch-npu, and torchvision to support newer releases
dlinfer/vendor/ascend/torch_npu_ops.py	Changed attention layout to TND, added q_seqlens parameter, refactored control flow, and updated API call to npu_fused_infer_attention_score
dlinfer/ops/llm.py	Added q_seqlens parameter to function signature and forwarded it to vendor implementation
dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py	Modified buffer initialization for q_seqlens and added buffer filling logic

Comments suppressed due to low confidence (1)

dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py:104

Missing buffer fill operation: The q_seqlens value from attn_metadata is extracted at line 91 but is never copied into input_buffers["q_seqlens"]. This is inconsistent with how other metadata fields like kv_seqlens (line 103), block_offsets (line 102), and kv_start_indices (line 104) are handled. Add 'input_buffers["q_seqlens"][:batch_size] = q_seqlens' after line 104 to properly fill the buffer.

    q_seqlens: Tensor = attn_metadata.q_seqlens
    kv_start_indices: Tensor = attn_metadata.kv_start_indices

    input_buffers: BuffType = graph_meta.input_buffers

    batch_size, num_blocks = block_offsets.size()
    num_tokens = input_ids.size(-1)

    # fill buffer
    input_buffers["input_ids"][:, :num_tokens] = input_ids
    input_buffers["position_ids"][:, :num_tokens] = position_ids
    input_buffers["block_offsets"][:batch_size, :num_blocks] = block_offsets
    input_buffers["kv_seqlens"][:batch_size] = kv_seqlens
    input_buffers["kv_start_indices"][:batch_size] = kv_start_indices

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py

Copilot · 2026-01-27T07:50:05Z

dlinfer/framework/lmdeploy_ext/cudagraph/ascend_cudagraph.py

+    input_buffers["q_seqlens"] = torch.arange(
+        1, max_batches + 1, dtype=torch.int32
    )

    input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32)


Potentially incorrect initialization values: The q_seqlens buffer is initialized with torch.arange(1, max_batches + 1), creating values [1, 2, 3, ..., max_batches]. However, according to the comment, this represents "cu_attn actual_seq_lengths" for the decoding phase. In decoding, each request typically processes exactly 1 token, so all values should be 1 (similar to how camb_cudagraph.py initializes it with torch.ones). The incrementing sequence suggests this might be confusing q_seqlens with cumulative sequence lengths (which is what q_start_loc represents). Verify that this initialization is correct for the TND layout and the actual_seq_lengths parameter of npu_fused_infer_attention_score.

Suggested change

input_buffers["q_seqlens"] = torch.arange(

1, max_batches + 1, dtype=torch.int32

)

input_buffers["kv_seqlens"] = torch.ones(max_batches, dtype=torch.int32)

input_buffers["q_seqlens"] = torch.ones(

max_batches, dtype=torch.int32, device=device

)

input_buffers["kv_seqlens"] = torch.ones(

max_batches, dtype=torch.int32, device=device

)

TND instead of BSH

2583d62

jinminxi104 requested review from Copilot, tangzhiyi11 and yao-fengchen January 27, 2026 07:46

Copilot started reviewing on behalf of jinminxi104 January 27, 2026 07:46 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

jinminxi104 added 2 commits February 4, 2026 11:37

Change lmdeploy branch and repository in workflow

66eb3d7

Update main.yml

c023c01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TND instead of BSH#299

TND instead of BSH#299
jinminxi104 wants to merge 3 commits intoDeepLink-org:mainfrom
jinminxi104:pa-tnd

jinminxi104 commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jinminxi104 commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant