Why don't we use the pruned tokens to compute attention in the prefill stage?

I noticed that in the prefill stage, although we prune the token number to max capacity prompt (e.g, 2k), we still use full attention to compute attention.
For example, we input a 6k prompt to generate a response, and in the prefill stage, we cache the 2k most important tokens.
However, we still use 6k instead of 2k to compute attention.

Why don't we use the pruned 2k tokens to compute attention in the prefill stage? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why don't we use the pruned tokens to compute attention in the prefill stage? #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why don't we use the pruned tokens to compute attention in the prefill stage? #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions