Fine-tuning Qwen3-VL-2B for chart question answering using SFT and GRPO.
This repository implements a two-stage training pipeline for chart question answering:
- SFT (Supervised Fine-Tuning): Initial fine-tuning on ChartQA dataset
- GRPO (Group Relative Policy Optimization): Reinforcement learning to improve format compliance and accuracy
- Base Model: Qwen/Qwen3-VL-2B-Instruct
- SFT Model: Nhaass/Qwen3-VL-2B-ChartQA
- GRPO Model: Nhaass/Qwen3-VL-2B-ChartQA-GRPO
pip install -r requirements.txtChartQADataset/
├── train/
│ ├── train_augmented.json
│ └── png/
├── val/
│ ├── val_augmented.json
│ └── png/
└── test/
├── test_augmented.json
└── png/
cd sft
python train.pycd grpo
python train.pypython test_model.pyChartQA/
├── sft/ # Supervised fine-tuning
│ ├── config.py
│ ├── model.py
│ ├── data_loader.py
│ ├── collator.py
│ ├── callbacks.py
│ ├── trainer.py
│ └── train.py
│
├── grpo/ # GRPO training
│ ├── config.py
│ ├── model.py
│ ├── data_loader.py
│ ├── rewards.py
│ ├── callbacks.py
│ ├── trainer.py
│ └── train.py
│
├── test_model.py # Evaluation script
└── requirements.txt
- Model: Qwen3-VL-2B-Thinking
- LoRA rank: 16
- Batch size: 1 (with gradient accumulation)
- Learning rate: 2e-5
- Reward components: format, accuracy, length
- Lambda weights: configurable
- KL coefficient: 0.1
- Num samples per prompt: 4
GRPO uses a multi-component reward:
reward = λ_format × format × (λ_acc × accuracy + λ_len × length) + λ_format × format - 1
Where:
- format: 1 if output has JSON format, 0 otherwise
- accuracy: 1 if answer matches ground truth, 0 otherwise
- length: Dynamic reward based on response length
Test the model on ChartQA test set:
python test_model.pyResults are saved to test_results.json.
You can try the live demo on Hugging Face: ChartQA-Qwen3-VL-2B Demo
MIT License - see LICENSE file for details.
@misc{chartqa-qwen3vl,
title={ChartQA with Qwen3-VL: SFT and GRPO Training},
author={ChartQA Project},
year={2026},
url={https://github.com/yourusername/ChartQA}
}