Skip to content

Updating llava prompt and evaluation parser#3

Open
aarti-cerebras wants to merge 5 commits intojiayuww:mainfrom
CerebrasResearch:main
Open

Updating llava prompt and evaluation parser#3
aarti-cerebras wants to merge 5 commits intojiayuww:mainfrom
CerebrasResearch:main

Conversation

@aarti-cerebras
Copy link

Issue:

  1. The evaluation parser is too specific and has many conditions based on the question id.
    For example:
    https://github.com/jiayuww/SpatialEval/blob/main/evals/evaluation.py#L39
    https://github.com/jiayuww/SpatialEval/blob/main/evals/evaluation.py#L49
    https://github.com/jiayuww/SpatialEval/blob/main/evals/evaluation.py#L110

  2. This check is incorrect. For example: outputs from model as below get marked as correct leading to inflated scores.

{"ref": "no", "model_output": "none", "eval_result": 1}
{"ref": "0", "model_output": "90", "eval_result": 1}

Updates in this PR

In order to fix the above issues,

  1. This PR updates the prompt to more precisely instruct the model to structure the output in the format that is supported by the evaluation flow.

Previous prompt:

prompt = f"{item['text']}\nFirst, provide a concise answer in one sentence. Then, elaborate on the reasoning behind your answer in a detailed, step-by-step explanation."

Updated prompt

prompt = f"{item['text']}\nFirst, provide a single letter response that selects the option which best answers the question. Then, elaborate on the reasoning behind your answer in a detailed, step-by-step explanation. Please strictly format the final response as follows -- Answer: <A single letter option that best answers the question>.\nReason: <reasoning behind the answer>\nRead the question again:\n{item['text']}"

Diff between previous and updated prompt above: commit
Screenshot 2025-07-09 at 6 30 13 PM

  1. A robust generic parser that better handles the variability observed in model outputs and gives a correct reward ONLY if the answer matches the reference answer (addressing the above highlighted issue). Current parser handles the following variations:
    • Specified Format: Answer before Reason
    • Direct Answer: When the model responds with the choice text instead of a single letter.
    • Reason before Answer: we do not penalize the model during evaluation when the format is not strictly followed.

To elaborate, when the model produces a direct response i.e the choice text string, we consider that as a valid response (if and only if the text is part of the choices). Next, we look for answers of format "Answer:" as well as answers which have optional phrase "the correct answer is". Additional support for optional phrases is left to the user to be added based on the use-case.

Testing:
Ran on liuhaotian/llava-v1.6-mistral-7b and below are the scores. Validated all logs and they are correct.

Subset Score1 Score2 Score3 Avg Paper (reported avg of three runs)
SpatialReal 0.459259259 0.459259259 0.437037037 0.451851852 0.368 - Table 6
SpatialMap 0.284 0.291333333 0.292 0.289111111 0.25 - Table 8
MazeNav 0.254666667 0.259333333 0.254 0.256 0.23 - Table 9
SpatialGrid 0.441333333 0.444666667 0.449333333 0.445111111 0.47 - Table 10

@aarti-cerebras
Copy link
Author

Hi @jiayuww
Could you please take a look at this PR at your convenience ? would love to hear your thoughts on it.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant