Infinite-Video

$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

Official implementation of the paper $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation.

Saul Santos, António Farinhas, Daniel McNamee and André Martins

**Abstract**: *Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces ∞-Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming ``sticky'' memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.*

If you use this code in your work, please cite our paper.

Resources

Paper (arXiv)

All material is made available under the MIT license. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicating any changes that you've made.

Video LLaMA

Python requirements and installation

This code was tested on Python 3.10.10. To install, follow the steps of moviechat

Reproducibility

1 - Run eval_code/extract_features on the intended dataset with the desired number of frames.

2 - Run each script in eval_code/eval with the hyperparameters mentioned in the paper: Example:

 python3 eval_code/eval/run_inference_inf_video_llama_nextqa.py     --cfg-path eval_configs/infvideollama.yaml     --num-beams 1   --temperature 1 --video-folder next_qa/features --q-folder /mnt/scratch-artemis/saul/next_qa/val.csv     --output-dir /MovieChat/nextqa_val     --max_int 256    --num_basis 256  --tau 0.75  --alpha 1.0 --task inf_video_llama --sticky

3 - For open-ended questions run eval_code/validate/run_eval_qa_chatgpt.pywith the output of the moviechat script run.

4 - For multiple-choice questions, we predict the answers as open-ended and use langchain to select the most similar option, run eval_code/validate/run_eval_langchain.pywith the output from the dataset run script as:

python eval_code/validate/run_eval_langchain.py --pred_path egoschema/nframes_8_nchunks_256_moviechatplus/preds.json --num_tasks 100

Then compute accuracy with run_eval.py

VideoChat2

Python requirements and installation

Follow the instructions of videochat2

Reproducibility

1 - Run each script in eval_code with the hyperparameters mentioned in the paper: Example:

 python3 eval_code/run_nextqa_mistral.py  --video-folder  /NExTQA/videos --data_path /next_qa/val.csv --output-dir nextqa_val --max_int 16     --num_samples 8    --num_basis 64  --tau 0.75 --alpha 1.0

Acknowledgment

The experiments in this work benefit from the following open-source codes:

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang and Gaoang Wang. MovieChat: From Dense Token to Sparse Memory for Long Video Understanding, CVPR 2024. https://github.com/rese1f/MovieChat
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Guo Chen, Limin Wanga and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2
Pedro Henrique Martins, Zita Marinho, André F. T. Martins, ∞-former: Infinite Memory Transformer, Proc. ACL 2022. https://github.com/deep-spin/infinite-former

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
infty-Video-LLaMA		infty-Video-LLaMA
infty-VideoChat2		infty-VideoChat2
LICENSE		LICENSE
README.md		README.md
inf_video_llama.png		inf_video_llama.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Infinite-Video

$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

Resources

Video LLaMA

Python requirements and installation

Reproducibility

VideoChat2

Python requirements and installation

Reproducibility

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

License

deep-spin/Infinite-Video

Folders and files

Latest commit

History

Repository files navigation

Infinite-Video

$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

Resources

Video LLaMA

Python requirements and installation

Reproducibility

VideoChat2

Python requirements and installation

Reproducibility

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages