-
Notifications
You must be signed in to change notification settings - Fork 14
[For merge][part 2] Support Gedit Evaulate #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate | ||
| from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever | ||
| from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer | ||
| from ais_bench.benchmark.models import VLLMCustomAPIChat | ||
| from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content | ||
| from ais_bench.benchmark.datasets import ( | ||
| Aime2025Dataset, | ||
| Aime2025JDGDataset, | ||
| ) | ||
| from ais_bench.benchmark.datasets.utils.llm_judge import get_a_or_b, LLMJudgeCorrectEvaluator | ||
|
|
||
|
|
||
| aime2025_reader_cfg = dict(input_columns=["question"], output_column="answer") | ||
|
|
||
|
|
||
| aime2025_infer_cfg = dict( | ||
| prompt_template=dict( | ||
| type=PromptTemplate, | ||
| template=dict( | ||
| round=[ | ||
| dict( | ||
| role="HUMAN", | ||
| prompt="{question}\nRemember to put your final answer within \\boxed{}.", | ||
| ), | ||
| ], | ||
| ), | ||
| ), | ||
| retriever=dict(type=ZeroRetriever), | ||
| inferencer=dict(type=GenInferencer), | ||
| ) | ||
|
|
||
| GRADER_TEMPLATE = """ | ||
| Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. | ||
|
|
||
| Here are some evaluation criteria: | ||
| 1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct. | ||
| 2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question. | ||
| 3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct. | ||
| 4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct. | ||
| 5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer. | ||
| 6. If the candidate's answer is semantically incomplete at the end, please judge it as inconsistent. | ||
|
|
||
| Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of: | ||
| A: Means the answer is consistent with the standard answer. | ||
| B: Means the answer is inconsistent with the standard answer. | ||
| Just return the letters "A" or "B", with no text around it. | ||
|
|
||
| Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer. | ||
|
|
||
|
|
||
| <Original Question Begin>: \n{question}\n<Original Question End>\n\n | ||
| <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n | ||
| <Predicted Answer Begin>: \n{model_answer}\n<Predicted End>\n\n | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
|
|
||
| Judging the correctness of candidates' answers, please return the the letters "A" or "B" first before your thinking: | ||
| """.strip() | ||
|
|
||
| aime2025_judge_infer_cfg = dict( | ||
| judge_reader_cfg = dict(input_columns=["question", "answer", "model_answer"], output_column="model_pred_uuid"), | ||
| judge_model=dict( | ||
| attr="service", | ||
| type=VLLMCustomAPIChat, | ||
| abbr="judge", # Be added after dataset abbr | ||
| path="", | ||
| model="", | ||
| stream=True, | ||
| request_rate=0, | ||
| use_timestamp=False, | ||
| retry=2, | ||
| api_key="", | ||
| host_ip="localhost", | ||
| host_port=8080, | ||
| url="", | ||
| max_out_len=512, | ||
| batch_size=1, | ||
| trust_remote_code=False, | ||
| generation_kwargs=dict( | ||
| temperature=0.01, | ||
| ignore_eos=False, | ||
| ), | ||
| pred_postprocessor=dict(type=extract_non_reasoning_content), | ||
| ), | ||
| judge_dataset_type=Aime2025JDGDataset, | ||
| prompt_template=dict( | ||
| type=PromptTemplate, | ||
| template=dict( | ||
| begin=[ | ||
| dict( | ||
| role='SYSTEM', | ||
| fallback_role='HUMAN', | ||
| prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.", | ||
| ) | ||
| ], | ||
| round=[ | ||
| dict(role='HUMAN', prompt=GRADER_TEMPLATE), | ||
| ], | ||
| ), | ||
| ), | ||
| retriever=dict(type=ZeroRetriever), | ||
| inferencer=dict(type=GenInferencer), | ||
| ) | ||
|
|
||
| aime2025_eval_cfg = dict( | ||
| evaluator=dict(type=LLMJudgeCorrectEvaluator), | ||
| pred_postprocessor=dict(type=get_a_or_b), | ||
| ) | ||
|
|
||
| aime2025_datasets = [ | ||
| dict( | ||
| abbr="aime2025", | ||
| type=Aime2025Dataset, | ||
| path="ais_bench/datasets/aime2025/aime2025.jsonl", | ||
| reader_cfg=aime2025_reader_cfg, | ||
| infer_cfg=aime2025_infer_cfg, | ||
| judge_infer_cfg=aime2025_judge_infer_cfg, | ||
| eval_cfg=aime2025_eval_cfg, | ||
| ) | ||
| ] | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,143 @@ | ||||||||||
| from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate | ||||||||||
| from ais_bench.benchmark.openicl.icl_prompt_template.icl_prompt_template_mm import MMPromptTemplate | ||||||||||
| from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever | ||||||||||
| from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer | ||||||||||
| from ais_bench.benchmark.openicl.icl_inferencer.icl_lmm_gen_inferencer import LMMGenInferencer | ||||||||||
| from ais_bench.benchmark.models import VLLMCustomAPIChat | ||||||||||
| from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content | ||||||||||
| from ais_bench.benchmark.datasets.g_edit import ( | ||||||||||
| GEditDataset, | ||||||||||
| GEditSCJDGDataset, | ||||||||||
| GEditPQJDGDataset, | ||||||||||
| ) | ||||||||||
| from ais_bench.benchmark.datasets.utils.lmm_judge import get_lmm_point_list, LMMJudgeImageEditEvaluator | ||||||||||
|
|
||||||||||
| SC_GRADER_TEMPLATE = """ | ||||||||||
| RULES: | ||||||||||
|
|
||||||||||
| Two images will be provided: The first being the original AI-generated image and the second being an edited version of the first. | ||||||||||
| The objective is to evaluate how successfully the editing instruction has been executed in the second image. | ||||||||||
|
|
||||||||||
| Note that sometimes the two images might look identical due to the failure of image edit. | ||||||||||
|
|
||||||||||
| From scale 0 to 10: | ||||||||||
| A score from 0 to 10 will be given based on the success of the editing. (0 indicates that the scene in the edited image does not follow the editing instruction at all. 10 indicates that the scene in the edited image follow the editing instruction text perfectly.) | ||||||||||
| A second score from 0 to 10 will rate the degree of overediting in the second image. (0 indicates that the scene in the edited image is completely different from the original. 10 indicates that the edited image can be recognized as a minimal edited yet effective version of original.) | ||||||||||
|
Comment on lines
+24
to
+25
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a couple of grammatical errors in the prompt which could be corrected for clarity:
Suggested change
|
||||||||||
| Put the score in a list such that output score = [score1, score2], where 'score1' evaluates the editing success and 'score2' evaluates the degree of overediting. | ||||||||||
|
|
||||||||||
| Editing instruction: {question} | ||||||||||
| """.strip() | ||||||||||
|
|
||||||||||
| PQ_GRADER_TEMPLATE = """ | ||||||||||
| RULES: | ||||||||||
|
|
||||||||||
| The image is an AI-generated image. | ||||||||||
| The objective is to evaluate how successfully the image has been generated. | ||||||||||
|
|
||||||||||
| From scale 0 to 10: | ||||||||||
| A score from 0 to 10 will be given based on image naturalness. | ||||||||||
| ( | ||||||||||
| 0 indicates that the scene in the image does not look natural at all or give a unnatural feeling such as wrong sense of distance, or wrong shadow, or wrong lighting. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a grammatical error here: "give a unnatural" should be "gives an unnatural".
Suggested change
|
||||||||||
| 10 indicates that the image looks natural. | ||||||||||
| ) | ||||||||||
| A second score from 0 to 10 will rate the image artifacts. | ||||||||||
| ( | ||||||||||
| 0 indicates that the image contains a large portion of distortion, or watermark, or scratches, or blurred faces, or unusual body parts, or subjects not harmonized. | ||||||||||
| 10 indicates the image has no artifacts. | ||||||||||
| ) | ||||||||||
| Put the score in a list such that output score = [naturalness, artifacts] | ||||||||||
| """.strip() | ||||||||||
|
|
||||||||||
| JDG_DATASETS_CLASS_MAP = { | ||||||||||
| "SC": GEditSCJDGDataset, | ||||||||||
| "PQ": GEditPQJDGDataset, | ||||||||||
| } | ||||||||||
|
|
||||||||||
| JDG_TEMPLATE_MAP = { | ||||||||||
| "SC": SC_GRADER_TEMPLATE, | ||||||||||
| "PQ": PQ_GRADER_TEMPLATE, | ||||||||||
| } | ||||||||||
|
|
||||||||||
|
|
||||||||||
| gedit_reader_cfg = dict( | ||||||||||
| input_columns=['question', 'image'], | ||||||||||
| output_column='task_type' | ||||||||||
| ) | ||||||||||
|
|
||||||||||
|
|
||||||||||
| gedit_infer_cfg = dict( | ||||||||||
| prompt_template=dict( | ||||||||||
| type=MMPromptTemplate, | ||||||||||
| template=dict( | ||||||||||
| round=[ | ||||||||||
| dict(role="HUMAN", prompt_mm={ | ||||||||||
| "text": {"type": "text", "text": "{question}"}, | ||||||||||
| "image": {"type": "image_url", "image_url": {"url": "data:image/png;base64,{image}"}}, | ||||||||||
| }) | ||||||||||
| ] | ||||||||||
| ) | ||||||||||
| ), | ||||||||||
| retriever=dict(type=ZeroRetriever), | ||||||||||
| inferencer=dict(type=LMMGenInferencer) | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| gedit_datasets = [] | ||||||||||
|
|
||||||||||
| for metric in ["SC", "PQ"]: | ||||||||||
| gedit_judge_infer_cfg = dict( | ||||||||||
| judge_reader_cfg = dict(input_columns=["question", "model_answer", "image"], output_column="model_pred_uuid"), | ||||||||||
| judge_model=dict( | ||||||||||
| attr="service", | ||||||||||
| type=VLLMCustomAPIChat, | ||||||||||
| abbr=f"{metric}_judge", # Be added after dataset abbr | ||||||||||
| path="", | ||||||||||
| model="", | ||||||||||
| stream=True, | ||||||||||
| request_rate=0, | ||||||||||
| use_timestamp=False, | ||||||||||
| retry=2, | ||||||||||
| api_key="", | ||||||||||
| host_ip="localhost", | ||||||||||
| host_port=8080, | ||||||||||
| url="", | ||||||||||
| max_out_len=512, | ||||||||||
| batch_size=16, | ||||||||||
| trust_remote_code=False, | ||||||||||
| generation_kwargs=dict( | ||||||||||
| temperature=0.01, | ||||||||||
| ignore_eos=False, | ||||||||||
| ), | ||||||||||
| pred_postprocessor=dict(type=extract_non_reasoning_content), | ||||||||||
| ), | ||||||||||
| judge_dataset_type=JDG_DATASETS_CLASS_MAP[metric], | ||||||||||
| prompt_template=dict( | ||||||||||
| type=MMPromptTemplate, | ||||||||||
| template=dict( | ||||||||||
| round=[ | ||||||||||
| dict(role='HUMAN', prompt_mm={ | ||||||||||
| "text": {"type": "text", "text": JDG_TEMPLATE_MAP[metric]}, | ||||||||||
| "image": {"type": "image_url", "image_url": {"url": "data:image/png;base64,{image}"}}, | ||||||||||
| }) | ||||||||||
| ], | ||||||||||
| ), | ||||||||||
| ), | ||||||||||
| retriever=dict(type=ZeroRetriever), | ||||||||||
| inferencer=dict(type=GenInferencer), | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| gedit_eval_cfg = dict( | ||||||||||
| evaluator=dict(type=LMMJudgeImageEditEvaluator, metric=metric), | ||||||||||
| pred_postprocessor=dict(type=get_lmm_point_list), | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| gedit_datasets.append( | ||||||||||
| dict( | ||||||||||
| abbr=f"gedit", | ||||||||||
| type=GEditDataset, | ||||||||||
| path="ais_bench/datasets/GEdit-Bench", | ||||||||||
| reader_cfg=gedit_reader_cfg, | ||||||||||
| infer_cfg=gedit_infer_cfg, | ||||||||||
| judge_infer_cfg=gedit_judge_infer_cfg, | ||||||||||
| eval_cfg=gedit_eval_cfg, | ||||||||||
| ) | ||||||||||
| ) | ||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| from ais_bench.benchmark.models.local_models.qwen_image_edit_mindie_sd import QwenImageEditModel | ||
|
|
||
| models = [ | ||
| dict( | ||
| attr="local", # local or service | ||
| type=QwenImageEditModel, # transformers >= 4.33.0 用这个,prompt 是构造成对话格式 | ||
| abbr='qwen-image-edit', | ||
| path='/home/yanhe/models/Qwen-Image-Edit-2509/', # path to model dir, current value is just a example | ||
| device_kwargs=dict( | ||
| ), | ||
| infer_kwargs=dict( # 模型参数参考 huggingface.co/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoModel.from_pretrained | ||
| num_inference_steps=50, | ||
| num_images_per_prompt=1, | ||
| ), | ||
| run_cfg = dict(num_gpus=1, num_procs=1), # 多卡/多机多卡 参数,使用torchrun拉起任务 | ||
| batch_size=1, # 每次推理的batch size | ||
| ) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instruction on line 48, "Simply reply with either CORRECT, INCORRECT", contradicts the instructions on lines 46 and 55, which ask for "A" or "B". This inconsistency could confuse the LLM judge. Since the post-processor
get_a_or_bexpects 'A' or 'B', these lines should be removed for consistency.