Skip to content

[2025 TPAMI] Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Notifications You must be signed in to change notification settings

jasongief/Mettle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Paper link

📄 Citation

If our work assists your research, feel free to give us a star ⭐ and cite us using

@article{zhou2025mettle,
  title={Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation}, 
  author={Zhou, Jinxing and Li, Zhihui and Yu, Yongqiang and Zhou, Yanghao and Guo, Ruohao and Li, Guangyao and Mao, Yuxin and Han, Mingfei and Chang, Xiaojun and Wang, Meng},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  year={2025},
  pages={1-17},
  doi={10.1109/TPAMI.2025.3642821}
}

📝 Introduction

Comparison with Prior Works. (a) Mainstream parameter-efficient methods for audiovisual adaptation insert learnable adapters within each frozen transformer layer. This alters the original feature output, inducing heavy memory overhead during backpropagation. (b) The core idea of our memory-efficient method aims to generate compact meta-tokens via parallel distillation from each transformer layer, preventing gradient propagation through the transformer backbone. (c) Comparison with prior state-of-the-art methods (e.g., DG-SCT and AVMoE) in terms of accuracy, trainable parameters, memory and runtime during training on the AVEL, AVVP, and AVS tasks. The multi-source setting of AVS is illustrated. Swin-V2-L and HTS-AT are used as the visual and audio transformer backbones, respectively. Memory and runtime values represent the per-sample consumption during model training (i.e., batch size = 1). comparison

Illustration of our Meta-Token Learning (Mettle) framework. We introduce Mettle for both classification tasks, including AVEL and AVVP and the segmentation task, namely AVS. (a) For the classification task, we propose the Layer-Centric Distillation (LCD) module to distill features from each pretrained transformer layer into learnable meta-tokens. The distilled meta-tokens across transformer layers are further aggregated using simple average pooling for class prediction. LCD is applied at each timestamp for audio and visual modalities. (b) For the segmentation task, LCD is applied only at the final stage of the pretrained transformers to capture high-level semantics. Then, through the Meta-Token Injection (MTI) module, the distilled audio and visual meta-tokens are injected into fine-grained visual tokens embedded by earlier transformer layers, better adapting the features for downstream task. framework

🤗 Requirements and Installation

  • Getting Started
    cd Mettle
    pip install -r requirements.txt
  • Download HTS-AT Backbone

    Download checkpoints.zip from Baidu Disk (pwd: 2025), and extract it into the directory ./Mettle/.

AVE

  • Download Data

    Download frames.zip Baidu Disk (pwd: 2025), wave.zip from Baidu Disk (pwd: 2025), and extract them into the directory ./data/AVE/.

  • Usage

    Go to AVE task directory.

    cd Mettle/AVE
    
  • Train/Test (modality-shared)

    bash run_mettle_shared.sh
  • Train/Test (modality-specific)

    bash run_mettle_specific_swin_htsat.sh

AVVP

  • Download Data

    Download extracted feats, frame and wave of LLP dataset from Baidu Disk (pwd: 2025), and extract it into the directory ./data/AVVP/.

  • Usage

    Go to AVVP task directory:

    cd Mettle/AVVP
    
  • Train/Test (modality-specific)

    bash run_mettle_specific_swin_htsat.sh
    

AVS

  • Download Data
    • Download Dataset

      Download AVSBench dataset from here.

      The downloaded data should be placed to the directory ./data/.

    • Download Wave

      Download wave for task S4 (Baidu Disk (pwd: 2025)) and task MS3 (Baidu Disk (pwd: 2025)), and extract them into the directory ./data/AVSBench_data/Single-source/s4_data/ and ./data/AVSBench_data/Multi-sources/ms3_data/, respectively.

  • Download pretrained backbones

    The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from Baidu Disk (pwd: 2025) and placed to the directory ./Mettle/AVS/pretrained_backbones/.

  • Usage

    Go to AVS task directory.

    # for S4 task:
    cd Mettle/AVS/avs_scripts/avs_s4
    
    # for MS3 task:
    cd Mettle/AVS/avs_scripts/avs_ms3
  • Train/Test (modality-shared)

    bash run_mettle_shared_swin.sh
  • Train/Test (modality-specific)

    bash run_mettle_specific_swin_htsat.sh

AVQA

  • Download Data

    Download frames.zip from Baidu Disk (pwd: 2025), audio_wave.zip from Baidu Disk (pwd: 2025), and extract them into the directory ./data/AVQA/.

  • Usage

    Go to AVQA task directory.

    cd Mettle/AVQA
    
  • Audio-Visual Grounding Generation

    • Download the ./grounding_gen/models_grounding_gen/lavish_grounding_gen_best.pt from Baidu Disk (pwd: 2025) to skip the Audio-Visual Grounding Generation process.

    • Or, run the below script:

      python grounding_gen/main_grd_gen.py
  • Train/Test (modality-specific)

    bash run_mettle_specific_swin_htsat.sh
    

Notice: If the Baidu NetDisk cannot be assessed from the users, you may download the data from AVMOE.

👍Acknowledgments

Our code is based on LAVisH, DG-SCT, AVMOE, AVSBench, PSP, MM-Pyr, and MUSIC-AVQA.