I have a question about text embedding length and video length.
When pretraining C-ViViT, the length of video seq is (1, 11, 3, 128, 128) = (Batchsize, Frames, channel, H, W).
I want to know the length of text embedding to be cross-attentioned with video tokens so, as you implement this code, could you let me know it?