generated from eliahuhorwitz/Academic-project-page-template
-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Thanks for the great benchmark!
I am currently trying to reproduce some model results on C_simQA using OpenCompass, using Qwen-2.5-72B-instruct as the judge model. However, the performance I've evaluated is significantly higher than the performance reported in the paper.
I'm wondering if the judge model could lead to such a large difference? After reviewing some cases, I found that the overall judge results are in line with expectations.
Do you have any suggestions?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
