This repository contains the official implementation of the paper Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on finallayer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full finetuning is often impractical, parameter-efficient fine-tuning–and particularly side-tuning (ST)– has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention–a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods
Download the pre-extracted features from here and modify the link of the docker container initialization from below.
To set up the environment using Docker, follow these steps:
-
Build the Docker Image:
docker build -t sdst_image:latest . -
Run the Docker Container:
docker run --gpus 'all' -it --rm --shm-size 200gb -v ./:/SDST -v ./model_results:/SDST/model_results -v <path_to_data>:/data sdst_image
Modify the data path for the path that you saved the data into (see the Data Preparation below)
This part refers to the installation of additional dependencies like RoiAlign. See the original repository for more details.
cd models/ops; python setup.py build_ext --inplace; cd ../..To train the model from scratch, run the following command, where CONFIG_PATH with the path to your desired experiment configuration file:
python tools/launch.py -c ./configs/CONFIG_PATH --exp_name <experiment_name>Concretely, to train on QVHighlights:
python tools/launch.py ./configs/qvhighlights/sdst_qvhighlights.py --exp_name debugTo train on Charades-STA:
python tools/launch.py ./configs/charades/sdst_charades.py --exp_name debugOr to train on TACOS:
python tools/launch.py ./configs/tacos/sdst_tacos.py --exp_name debugTo evaluate the performance of a given model, run the following command:
python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --eval
For QVHighlights:
python tools/launch.py configs/qvhighlights/sdst_qvhighlights.py --checkpoint /SDST/checkpoints_sdst/checkpoint_qvhighlights.pth --evalFor Charades-STA:
python tools/launch.py configs/charades/sdst_charades.py --checkpoint /SDST/checkpoints_sdst/checkpoint_charades_sta.pth --evalFor Tacos:
python tools/launch.py configs/tacos/sdst_tacos.py --checkpoint /SDST/checkpoints_sdst/checkpoint_tacos.pth --evalTo generate a submission give a trained model, run the following command:
python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --dumpFor instance, to do so for QVHighlgihts:
python tools/launch.py configs/qvhighlights/sdst_qvhighlights.py --checkpoint /SDST/checkpoints_sdst/checkpoint_qvhighlights.pth --dumpFor any questions or inquiries, please contact david dot pujolperich at gmail dot com
This implementation is based on the excellent work of R2-Tuning.
If you find this work useful, please cite our paper:
@inproceedings{pujol2025sparse,
title={Sparse-dense side-tuner for efficient video temporal grounding},
author={Pujol-Perich, David and Escalera, Sergio and Clap{\'e}s, Albert},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={21515--21524},
year={2025}
}
