-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0
Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?
hf (pretrained=OpenNLPLab/TransNormerLLM-385M,trust_remote_code=True), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|Filter|n-shot|Metric|Value | |Stderr|
|-----|-------|------|-----:|------|-----:|---|-----:|
|boolq|Yaml |none | 0|acc |0.4859|± |0.0087|
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels