Check whether the GRPO really improves scores further using for SFT fine-tuned model.
- Model: Qwen2.5-1.5B
- Number Data: 1000(sft),1000(rl),1000(eval)
- Epochs: 20(sft),1by1(rl)
- Methods: SFT,GRPO
Check whether the RLs really improves scores further using for SFT fine-tuned model.
- Model: Qwen2.5-0.5B
- Number Data: 400(sft,rl),400(eval)
- Epochs: 10(sft),1(rl)
- Methods: SFT,GRPO,CPPO,DRGRPO,DRGRPOCPPO,RAFT,REINFORCE