Updating llava prompt and evaluation parser by aarti-cerebras · Pull Request #3 · jiayuww/SpatialEval

aarti-cerebras · 2025-07-22T00:47:44Z

Issue:

The evaluation parser is too specific and has many conditions based on the question id.
For example:
https://github.com/jiayuww/SpatialEval/blob/main/evals/evaluation.py#L39
https://github.com/jiayuww/SpatialEval/blob/main/evals/evaluation.py#L49
https://github.com/jiayuww/SpatialEval/blob/main/evals/evaluation.py#L110
This check is incorrect. For example: outputs from model as below get marked as correct leading to inflated scores.

{"ref": "no", "model_output": "none", "eval_result": 1}
{"ref": "0", "model_output": "90", "eval_result": 1}

Updates in this PR

In order to fix the above issues,

This PR updates the prompt to more precisely instruct the model to structure the output in the format that is supported by the evaluation flow.

Previous prompt:

prompt = f"{item['text']}\nFirst, provide a concise answer in one sentence. Then, elaborate on the reasoning behind your answer in a detailed, step-by-step explanation."

Updated prompt

prompt = f"{item['text']}\nFirst, provide a single letter response that selects the option which best answers the question. Then, elaborate on the reasoning behind your answer in a detailed, step-by-step explanation. Please strictly format the final response as follows -- Answer: <A single letter option that best answers the question>.\nReason: <reasoning behind the answer>\nRead the question again:\n{item['text']}"

Diff between previous and updated prompt above: commit

A robust generic parser that better handles the variability observed in model outputs and gives a correct reward ONLY if the answer matches the reference answer (addressing the above highlighted issue). Current parser handles the following variations:
- Specified Format: Answer before Reason
- Direct Answer: When the model responds with the choice text instead of a single letter.
- Reason before Answer: we do not penalize the model during evaluation when the format is not strictly followed.

To elaborate, when the model produces a direct response i.e the choice text string, we consider that as a valid response (if and only if the text is part of the choices). Next, we look for answers of format "Answer:" as well as answers which have optional phrase "the correct answer is". Additional support for optional phrases is left to the user to be added based on the use-case.

Testing:
Ran on liuhaotian/llava-v1.6-mistral-7b and below are the scores. Validated all logs and they are correct.

Subset	Score1	Score2	Score3	Avg	Paper (reported avg of three runs)
SpatialReal	0.459259259	0.459259259	0.437037037	0.451851852	0.368 - Table 6
SpatialMap	0.284	0.291333333	0.292	0.289111111	0.25 - Table 8
MazeNav	0.254666667	0.259333333	0.254	0.256	0.23 - Table 9
SpatialGrid	0.441333333	0.444666667	0.449333333	0.445111111	0.47 - Table 10

Updating llava prompt and evaluation parser

aarti-cerebras · 2025-07-22T00:49:35Z

Hi @jiayuww
Could you please take a look at this PR at your convenience ? would love to hear your thoughts on it.
Thanks

aarti-cerebras and others added 5 commits June 9, 2025 21:38

Update prompt for llava models

c50a28d

Generic eval parser for updated prompt with llava

590df21

call parse_arguments in if-main

c478366

Minor fix with answer letter

0055777

Merge pull request #1 from CerebrasResearch/aarti/llava_parser

db3adc6

Updating llava prompt and evaluation parser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating llava prompt and evaluation parser#3

Updating llava prompt and evaluation parser#3
aarti-cerebras wants to merge 5 commits intojiayuww:mainfrom
CerebrasResearch:main

aarti-cerebras commented Jul 22, 2025

Uh oh!

aarti-cerebras commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aarti-cerebras commented Jul 22, 2025

Issue:

Updates in this PR

Uh oh!

aarti-cerebras commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant