-
Notifications
You must be signed in to change notification settings - Fork 437
Description
I’ve discovered a bug in the HumanEval dataset, specifically in Task ID: HumanEval/161, which causes incorrect evaluation of LLM-generated code.
In the prompt for this task, the function name is defined as solve, as shown below:
def solve(s):
"""You are given a string s.
if s[i] is a letter, reverse its case from lower to upper or vise versa,
otherwise keep it as it is.
If the string contains no letters, reverse the string.
The function should return the resulted string.
Examples
solve("1234") = "4321"
solve("ab") = "AB"
solve("#a@C") = "#A@c"
"""
However, the test code for the same task calls a function named candidate:
def check(candidate):
# Check some simple cases
assert candidate("AsDf") == "aSdF"
assert candidate("1234") == "4321"
assert candidate("ab") == "AB"
assert candidate("#a@C") == "#A@c"
assert candidate("#AsdfW^45") == "#aSDFw^45"
assert candidate("#6@2") == "2@6#"
# Check some edge cases that are easy to work out by hand.
assert candidate("#$a^D") == "#$A^d"
assert candidate("#ccc") == "#CCC"
# Don't remove this line:
...
Because LLMs typically generate a function named solve (as instructed by the prompt), this discrepancy leads to false negatives in test results. The generated function may be logically correct, but it fails the tests due to the incorrect function name.
Why this matters:
This inconsistency causes misleading evaluation outcomes and can unfairly penalize model performance.
Suggested Resolution:
Ensure the function name in the prompt examples matches the function name expected by the test code (candidate), or vice versa — ideally using a consistent and correct name like solve across both.
Additional Issue: Test Code is Not Executed Properly
Beyond the function name mismatch, I also noticed that the check() function in the test code is defined but never called. This causes an even more serious issue during evaluation: the assert statements inside check() are never executed.