-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
I’m encountering persistent Out-of-Memory (OOM) errors when training a multimodal model (Qwen2.5-VL-7B-Instruct) on a single A800 GPU (80GB VRAM). Below is the detailed context and troubleshooting steps I’ve tried, followed by my core question on hyperparameter tuning.
- Training Setup
GPU: Single NVIDIA A800 (80GB VRAM)
Model: Qwen2.5-VL-7B-Instruct (multimodal, 7B parameters)
Framework: PyTorch + DeepSpeed (Zero-3 with offload configured)
Key Training Command Snippet:
bash
torchrun --nproc_per_node="1"
--master_addr="127.0.0.1" --master_port="12345"
grpo_jsonl.py
--deepspeed local_scripts/zero3.json \ # Zero-3 with optimizer offload to CPU
--model_name_or_path Qwen2.5-VL-7B-Instruct
--num_generations 6 \ # Initial value (OOM)
--per_device_train_batch_size 48 \ # Initial value (OOM)
--gradient_accumulation_steps 1 \ # Initial value
--bf16
--gradient_checkpointing true
--attn_implementation flash_attention_2 - Core Question
For a single A800 (80GB) running Qwen2.5-VL-7B-Instruct with DeepSpeed Zero-3 (optimizer offload) + BF16 + gradient checkpointing + FlashAttention-2:What is the maximum stable combination of the following hyperparameters that avoids OOM?
--num_generations (number of generations per prompt)
--per_device_train_batch_size (batch size per GPU)
--gradient_accumulation_steps (steps to accumulate gradients)
Are there any known benchmarks or rules of thumb for 7B multimodal models on A800 (80GB) to balance these parameters? - Additional Context
The model includes visual encoders (multimodal), which add extra VRAM overhead compared to text-only 7B models.
I’ve verified no other processes are using GPU memory (nvidia-smi shows 0% usage pre-training).
DeepSpeed Zero-3 config: stage: 3, offload_optimizer: {device: "cpu"}, offload_parameters: {device: "cpu"} (tried both with/without parameter offload – parameter offload caused slower training but still OOM).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels