About the evaluation performance

Thanks for the great benchmark!

I am currently trying to reproduce some model results on C_simQA using OpenCompass, using Qwen-2.5-72B-instruct as the judge model. However, the performance I've evaluated is significantly higher than the performance reported in the paper.

I'm wondering if the judge model could lead to such a large difference? After reviewing some cases, I found that the overall judge results are in line with expectations.

Do you have any suggestions?

![Image](https://github.com/user-attachments/assets/7d5b400e-0895-4b08-8de3-71e4fe8e392f)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the evaluation performance #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About the evaluation performance #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions