Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

📄 Citation

If our work assists your research, feel free to give us a star ⭐ and cite us using

@article{zhou2025mettle,
  title={Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation}, 
  author={Zhou, Jinxing and Li, Zhihui and Yu, Yongqiang and Zhou, Yanghao and Guo, Ruohao and Li, Guangyao and Mao, Yuxin and Han, Mingfei and Chang, Xiaojun and Wang, Meng},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  year={2025},
  pages={1-17},
  doi={10.1109/TPAMI.2025.3642821}
}

📝 Introduction

Comparison with Prior Works. (a) Mainstream parameter-efficient methods for audiovisual adaptation insert learnable adapters within each frozen transformer layer. This alters the original feature output, inducing heavy memory overhead during backpropagation. (b) The core idea of our memory-efficient method aims to generate compact meta-tokens via parallel distillation from each transformer layer, preventing gradient propagation through the transformer backbone. (c) Comparison with prior state-of-the-art methods (e.g., DG-SCT and AVMoE) in terms of accuracy, trainable parameters, memory and runtime during training on the AVEL, AVVP, and AVS tasks. The multi-source setting of AVS is illustrated. Swin-V2-L and HTS-AT are used as the visual and audio transformer backbones, respectively. Memory and runtime values represent the per-sample consumption during model training (i.e., batch size = 1).

Illustration of our Meta-Token Learning (Mettle) framework. We introduce Mettle for both classification tasks, including AVEL and AVVP and the segmentation task, namely AVS. (a) For the classification task, we propose the Layer-Centric Distillation (LCD) module to distill features from each pretrained transformer layer into learnable meta-tokens. The distilled meta-tokens across transformer layers are further aggregated using simple average pooling for class prediction. LCD is applied at each timestamp for audio and visual modalities. (b) For the segmentation task, LCD is applied only at the final stage of the pretrained transformers to capture high-level semantics. Then, through the Meta-Token Injection (MTI) module, the distilled audio and visual meta-tokens are injected into fine-grained visual tokens embedded by earlier transformer layers, better adapting the features for downstream task.

🤗 Requirements and Installation

Getting Started

cd Mettle
pip install -r requirements.txt

Download HTS-AT Backbone

Download checkpoints.zip from Baidu Disk (pwd: 2025), and extract it into the directory ./Mettle/.

AVE

Download Data

Download frames.zip Baidu Disk (pwd: 2025), wave.zip from Baidu Disk (pwd: 2025), and extract them into the directory ./data/AVE/.
Usage

Go to AVE task directory.
```
cd Mettle/AVE
```
Train/Test (modality-shared)
```
bash run_mettle_shared.sh
```
Train/Test (modality-specific)
```
bash run_mettle_specific_swin_htsat.sh
```

AVVP

Download Data

Download extracted feats, frame and wave of LLP dataset from Baidu Disk (pwd: 2025), and extract it into the directory ./data/AVVP/.
Usage

Go to AVVP task directory:
```
cd Mettle/AVVP
```
Train/Test (modality-specific)
```
bash run_mettle_specific_swin_htsat.sh
```

AVS

Download Data
- Download Dataset
  
  Download AVSBench dataset from here.
  
  The downloaded data should be placed to the directory ./data/.
- Download Wave
  
  Download wave for task S4 (Baidu Disk (pwd: 2025)) and task MS3 (Baidu Disk (pwd: 2025)), and extract them into the directory ./data/AVSBench_data/Single-source/s4_data/ and ./data/AVSBench_data/Multi-sources/ms3_data/, respectively.
Download pretrained backbones

The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from Baidu Disk (pwd: 2025) and placed to the directory ./Mettle/AVS/pretrained_backbones/.

Usage

Go to AVS task directory.

# for S4 task:
cd Mettle/AVS/avs_scripts/avs_s4

# for MS3 task:
cd Mettle/AVS/avs_scripts/avs_ms3

Train/Test (modality-shared)
```
bash run_mettle_shared_swin.sh
```
Train/Test (modality-specific)
```
bash run_mettle_specific_swin_htsat.sh
```

AVQA

Download Data

Download frames.zip from Baidu Disk (pwd: 2025), audio_wave.zip from Baidu Disk (pwd: 2025), and extract them into the directory ./data/AVQA/.
Usage

Go to AVQA task directory.
```
cd Mettle/AVQA
```
Audio-Visual Grounding Generation
- Download the ./grounding_gen/models_grounding_gen/lavish_grounding_gen_best.pt from Baidu Disk (pwd: 2025) to skip the Audio-Visual Grounding Generation process.
- Or, run the below script:
```
python grounding_gen/main_grd_gen.py
```
Train/Test (modality-specific)
```
bash run_mettle_specific_swin_htsat.sh
```

Notice: If the Baidu NetDisk cannot be assessed from the users, you may download the data from AVMOE.

👍Acknowledgments

Our code is based on LAVisH, DG-SCT, AVMOE, AVSBench, PSP, MM-Pyr, and MUSIC-AVQA.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
AVE		AVE
AVQA		AVQA
AVS		AVS
AVVP		AVVP
figures		figures
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

📄 Citation

📝 Introduction

🤗 Requirements and Installation

Getting Started

Download HTS-AT Backbone

AVE

Download Data

Usage

AVVP

Download Data

Usage

AVS

Download Data

Download pretrained backbones

Usage

AVQA

Download Data

Usage

👍Acknowledgments

About

Uh oh!

Languages

jasongief/Mettle

Folders and files

Latest commit

History

Repository files navigation

Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

📄 Citation

📝 Introduction

🤗 Requirements and Installation

Getting Started

Download HTS-AT Backbone

AVE

Download Data

Usage

AVVP

Download Data

Usage

AVS

Download Data

Download pretrained backbones

Usage

AVQA

Download Data

Usage

👍Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages