Function Name Mismatch in Prompt vs. Test Code – HumanEval/161

I’ve discovered a bug in the HumanEval dataset, specifically in `Task ID: HumanEval/161`, which causes incorrect evaluation of LLM-generated code.

In the prompt for this task, the function name is defined as `solve`, as shown below:

```
def solve(s):
    """You are given a string s.
    if s[i] is a letter, reverse its case from lower to upper or vise versa, 
    otherwise keep it as it is.
    If the string contains no letters, reverse the string.
    The function should return the resulted string.
    Examples
    solve("1234") = "4321"
    solve("ab") = "AB"
    solve("#a@C") = "#A@c"
    """
```

However, the test code for the same task calls a function named `candidate`:

```
def check(candidate):
    # Check some simple cases
    assert candidate("AsDf") == "aSdF"
    assert candidate("1234") == "4321"
    assert candidate("ab") == "AB"
    assert candidate("#a@C") == "#A@c"
    assert candidate("#AsdfW^45") == "#aSDFw^45"
    assert candidate("#6@2") == "2@6#"

    # Check some edge cases that are easy to work out by hand.
    assert candidate("#$a^D") == "#$A^d"
    assert candidate("#ccc") == "#CCC"

    # Don't remove this line:
    ...
```

Because LLMs typically generate a function named solve (as instructed by the prompt), this discrepancy leads to false negatives in test results. The generated function may be logically correct, but it fails the tests due to the incorrect function name.

**Why this matters:**
This inconsistency causes misleading evaluation outcomes and can unfairly penalize model performance.

**Suggested Resolution:**
Ensure the function name in the prompt examples matches the function name expected by the test code (`candidate`), or vice versa — ideally using a consistent and correct name like solve across both.

### Additional Issue: Test Code is Not Executed Properly

Beyond the function name mismatch, I also noticed that the `check()` function in the test code is defined but never called. This causes an even more serious issue during evaluation: the `assert` statements inside `check()` are never executed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function Name Mismatch in Prompt vs. Test Code – HumanEval/161 #60

Additional Issue: Test Code is Not Executed Properly

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Function Name Mismatch in Prompt vs. Test Code – HumanEval/161 #60

Description

Additional Issue: Test Code is Not Executed Properly

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions