I have a question about maskgit cross-attention with text tokens

In cvivit, you would train video clip with the fixed number of frames.
When training maskgit to do cross attention with text tokens, how did you cut(?) corresponding text tokens for the given frames?
Thank you!