Support cross attention kv cache by larryliu0820 · Pull Request #187 · huggingface/optimum-executorch

larryliu0820 · 2025-11-18T08:30:08Z

To avoid excessive computation we want to support kv cache for cross attention in Whisper.

Fundamentally we only run k_proj and v_proj once on the encoder output hidden state, at the first token generation, then we should keep the key_states and value_states and reuse them in all the subsequent token generation.

For whisper-large-v3-turbo, where we have 4 layers of decoder:

WhisperDecoder(
  (embed_tokens): Embedding(51866, 1280, padding_idx=50257)
  (embed_positions): WhisperPositionalEmbedding(448, 1280)
  (layers): ModuleList(
    (0-3): 4 x WhisperDecoderLayer(
      (self_attn): WhisperAttention(
        (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
        (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
      )
      (activation_fn): GELUActivation()
      (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (encoder_attn): WhisperAttention(
        (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
        (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
      )
      (encoder_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (fc1): Linear(in_features=1280, out_features=5120, bias=True)
      (fc2): Linear(in_features=5120, out_features=1280, bias=True)
      (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    )
  )
  (layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
)

Without KV cache in encoder_attn, we are doing 2 1280x1280 MM for each layer, so in total 8 1280x1280 MM for each token generated. This largely impacts token/sec perf number.

This PR replaces encoder_attn with a WhisperCrossAttention class, where we replaces if condition with torch.cond. The logic becomes:

If KV cache values are all zero:
- Compute KV projections
Otherwise:
- Clone from KV cache. Note here we can't directly return KV cache, due to the non-aliasing requirement.
After torch.cond:
- Write back the values from either branch back to KV cache

Notice that we still have 1 extra read and 1 extra write, but it should be much faster than MM.

optimum/exporters/executorch/integrations.py

optimum/exporters/executorch/whisper_attention.py

jackzhxng

Oh also run make style for formatting

larryliu0820 · 2025-11-24T23:29:22Z

This works, gives correct output, but eventually we still need to copy data from GPU to CPU, just for the predicate. There's no way we can workaround it.

For whisper-large-v3-turbo, there are 4 decoder layers, so we see 4 cudaAsyncMemcpy blocks in each token generation:

This is too expensive to be a good solution.

To avoid excessive computation we want to support kv cache for cross attention in Whisper. Fundamentally we only run `k_proj` and `v_proj` once on the encoder output hidden state, at the first token generation, then we should keep the `key_states` and `value_states` and reuse them in all the subsequent token generation. For whisper-large-v3-turbo, where we have 4 layers of decoder: ``` WhisperDecoder( (embed_tokens): Embedding(51866, 1280, padding_idx=50257) (embed_positions): WhisperPositionalEmbedding(448, 1280) (layers): ModuleList( (0-3): 4 x WhisperDecoderLayer( (self_attn): WhisperAttention( (k_proj): Linear(in_features=1280, out_features=1280, bias=False) (v_proj): Linear(in_features=1280, out_features=1280, bias=True) (q_proj): Linear(in_features=1280, out_features=1280, bias=True) (out_proj): Linear(in_features=1280, out_features=1280, bias=True) ) (activation_fn): GELUActivation() (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) (encoder_attn): WhisperAttention( (k_proj): Linear(in_features=1280, out_features=1280, bias=False) (v_proj): Linear(in_features=1280, out_features=1280, bias=True) (q_proj): Linear(in_features=1280, out_features=1280, bias=True) (out_proj): Linear(in_features=1280, out_features=1280, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1280, out_features=5120, bias=True) (fc2): Linear(in_features=5120, out_features=1280, bias=True) (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) ) ) (layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True) ) ``` Without KV cache in `encoder_attn`, we are doing 2 1280x1280 MM for each layer, so in total 8 1280x1280 MM for each token generated. This largely impacts token/sec perf number. This PR replaces `encoder_attn` with a `WhisperCrossAttention` class, where we replaces `if` condition with `torch.cond`. The logic becomes: - If `cache_initialized` is False: - Compute KV projections, update KV cache - Otherwise: - Directly return cached KV cache - After torch.cond: - Set `cache_initialized` to True Notice that we still have 1 extra read and 1 extra write, but it should be much faster than MM.

…16485) As titled, so that we can include huggingface/optimum-executorch#187

larryliu0820 requested a review from jackzhxng November 18, 2025 08:30

jackzhxng reviewed Nov 19, 2025

View reviewed changes

larryliu0820 force-pushed the whisper_cond branch from b6e172d to 6ca7dd0 Compare November 21, 2025 09:29

larryliu0820 force-pushed the whisper_cond branch from 14a9a0e to 128b100 Compare November 25, 2025 22:11

larryliu0820 force-pushed the whisper_cond branch 3 times, most recently from 89d8db3 to 4bdde1c Compare December 23, 2025 07:16

larryliu0820 force-pushed the whisper_cond branch from 4bdde1c to cbf9682 Compare January 4, 2026 18:01

JacobSzwejbka approved these changes Jan 6, 2026

View reviewed changes

larryliu0820 force-pushed the whisper_cond branch from cbf9682 to 37957fe Compare January 7, 2026 00:57

larryliu0820 merged commit 732b113 into main Jan 7, 2026
69 of 83 checks passed

larryliu0820 deleted the whisper_cond branch January 7, 2026 01:13

larryliu0820 mentioned this pull request Jan 7, 2026

Bump optimum-executorch pin to get optimization for cross attention pytorch/executorch#16485

Merged

larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 8, 2026

Bump optimum-executorch pin to get optimization for cross attention (#…

7815c38

…16485) As titled, so that we can include huggingface/optimum-executorch#187

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cross attention kv cache#187

Support cross attention kv cache#187
larryliu0820 merged 1 commit intomainfrom
whisper_cond

larryliu0820 commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackzhxng left a comment

Uh oh!

larryliu0820 commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

larryliu0820 commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

larryliu0820 commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants