Skip to content

Benchmark results can not be reproduced #8

@waneon

Description

@waneon

I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0
Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?

hf (pretrained=OpenNLPLab/TransNormerLLM-385M,trust_remote_code=True), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|Filter|n-shot|Metric|Value |   |Stderr|
|-----|-------|------|-----:|------|-----:|---|-----:|
|boolq|Yaml   |none  |     0|acc   |0.4859|±  |0.0087|

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions