If our work assists your research, feel free to give us a star ⭐ and cite us using
@article{zhou2025mettle,
title={Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation},
author={Zhou, Jinxing and Li, Zhihui and Yu, Yongqiang and Zhou, Yanghao and Guo, Ruohao and Li, Guangyao and Mao, Yuxin and Han, Mingfei and Chang, Xiaojun and Wang, Meng},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
pages={1-17},
doi={10.1109/TPAMI.2025.3642821}
}
Comparison with Prior Works. (a) Mainstream parameter-efficient methods for audiovisual adaptation insert learnable adapters within each frozen transformer layer. This alters the original feature output, inducing heavy memory overhead during backpropagation.
(b) The core idea of our memory-efficient method aims to generate compact meta-tokens via parallel distillation from each transformer layer, preventing gradient propagation through the transformer backbone.
(c) Comparison with prior state-of-the-art methods (e.g., DG-SCT and AVMoE) in terms of accuracy, trainable parameters, memory and runtime during training on the AVEL, AVVP, and AVS tasks. The multi-source setting of AVS is illustrated. Swin-V2-L and HTS-AT are used as the visual and audio transformer backbones, respectively. Memory and runtime values represent the per-sample consumption during model training (i.e., batch size = 1).

Illustration of our Meta-Token Learning (Mettle) framework. We introduce Mettle for both classification tasks, including AVEL and AVVP and the segmentation task, namely AVS. (a) For the classification task, we propose the Layer-Centric Distillation (LCD) module to distill features from each pretrained transformer layer into learnable meta-tokens. The distilled meta-tokens across transformer layers are further aggregated using simple average pooling for class prediction. LCD is applied at each timestamp for audio and visual modalities. (b) For the segmentation task, LCD is applied only at the final stage of the pretrained transformers to capture high-level semantics. Then, through the Meta-Token Injection (MTI) module, the distilled audio and visual meta-tokens are injected into fine-grained visual tokens embedded by earlier transformer layers, better adapting the features for downstream task.

-
cd Mettle pip install -r requirements.txt
-
Download
checkpoints.zipfrom Baidu Disk (pwd: 2025), and extract it into the directory./Mettle/.
-
Download
frames.zipBaidu Disk (pwd: 2025),wave.zipfrom Baidu Disk (pwd: 2025), and extract them into the directory./data/AVE/. -
Go to AVE task directory.
cd Mettle/AVE -
Train/Test (modality-shared)
bash run_mettle_shared.sh
-
Train/Test (modality-specific)
bash run_mettle_specific_swin_htsat.sh
-
Download extracted feats, frame and wave of LLP dataset from Baidu Disk (pwd: 2025), and extract it into the directory
./data/AVVP/. -
Go to AVVP task directory:
cd Mettle/AVVP -
Train/Test (modality-specific)
bash run_mettle_specific_swin_htsat.sh
-
-
Download Dataset
Download AVSBench dataset from here.
The downloaded data should be placed to the directory
./data/. -
Download Wave
Download wave for task S4 (Baidu Disk (pwd: 2025)) and task MS3 (Baidu Disk (pwd: 2025)), and extract them into the directory
./data/AVSBench_data/Single-source/s4_data/and./data/AVSBench_data/Multi-sources/ms3_data/, respectively.
-
-
The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from Baidu Disk (pwd: 2025) and placed to the directory
./Mettle/AVS/pretrained_backbones/. -
Go to AVS task directory.
# for S4 task: cd Mettle/AVS/avs_scripts/avs_s4 # for MS3 task: cd Mettle/AVS/avs_scripts/avs_ms3
-
Train/Test (modality-shared)
bash run_mettle_shared_swin.sh
-
Train/Test (modality-specific)
bash run_mettle_specific_swin_htsat.sh
-
Download
frames.zipfrom Baidu Disk (pwd: 2025),audio_wave.zipfrom Baidu Disk (pwd: 2025), and extract them into the directory./data/AVQA/. -
Go to AVQA task directory.
cd Mettle/AVQA -
Audio-Visual Grounding Generation
-
Download the
./grounding_gen/models_grounding_gen/lavish_grounding_gen_best.ptfrom Baidu Disk (pwd: 2025) to skip the Audio-Visual Grounding Generation process. -
Or, run the below script:
python grounding_gen/main_grd_gen.py
-
-
Train/Test (modality-specific)
bash run_mettle_specific_swin_htsat.sh
Notice: If the Baidu NetDisk cannot be assessed from the users, you may download the data from AVMOE.
Our code is based on LAVisH, DG-SCT, AVMOE, AVSBench, PSP, MM-Pyr, and MUSIC-AVQA.