[feature] [big PR only for review] Support GEdit evaluate by SJTUyh · Pull Request #155 · AISBench/benchmark

SJTUyh · 2026-03-03T02:38:04Z

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

Please describe the motivation of this PR and the goal you want to achieve through this PR.
请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。

Support GEdit Evaluate by Judge Model
Add a Qwen-image-Edit backend run on Ascend base on this realization： https://modelers.cn/models/MindIE/Qwen-Image-Edit-2509

📝 Modification / 修改内容

Please briefly describe what modification is made in this PR.
请简要描述此拉取请求中进行的修改。

📐 Associated Test Results / 关联测试结果

Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接，例如 CI 管道、测试报告等。

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改？如果是，请描述它如何破坏兼容性，以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。

⚠️ Performance degradation (Optional) / 性能下降（可选）

If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降，请描述性能下降的影响和预期的性能改进。

🌟 Use cases (Optional) / 使用案例（可选）

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
`/gemini help`	Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

Copilot

Pull request overview

Adds end-to-end support for running and evaluating the GEdit image-editing benchmark within AISBench, including a local Qwen-Image-Edit (MindIE SD) integration and a new judge/eval workflow to score edited images via LMM/LLM judging.

Changes:

Introduces GEdit dataset + LMM/LLM judge dataset/evaluator utilities and configs (including multi-device run example).
Extends CLI workflow to run judge inference as a first-class step and adjusts output-handling to propagate data_abbr.
Adds Qwen-Image-Edit (MindIE SD) third-party pipeline/transformer/scheduler + a local model wrapper.

Reviewed changes

Copilot reviewed 45 out of 53 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/UT/tasks/test_openicl_api_infer.py	Updates UT setup to include a default `task_state_manager`.
tests/UT/openicl/icl_inferencer/output_handler/test_ppl_inferencer_output_handler.py	Updates tests for new `get_result(..., data_abbr, ...)` signature.
tests/UT/openicl/icl_inferencer/output_handler/test_gen_inferencer_output_handler.py	Updates tests for new `data_abbr` plumbing and adjusts mocking.
tests/UT/openicl/icl_inferencer/output_handler/test_bfcl_v3_output_handler.py	Updates tests for new `data_abbr` plumbing.
tests/UT/cli/test_workers.py	Adjusts unit tests for worker behavior/config expectations.
ais_bench/tools/dataset_processors/gedit/display_results.py	New helper script to parse and tabulate GEdit judge results.
ais_bench/tools/dataset_processors/gedit/convert_preds.py	New helper script to convert predictions into GEdit expected folder format.
ais_bench/tools/dataset_processors/init.py	Package marker for dataset processor tooling.
ais_bench/tools/init.py	Package marker for tools.
ais_bench/third_party/mindie_sd/qwenimage_edit/transformer_qwenimage.py	Adds QwenImage transformer implementation (third-party) with NPU-specific paths.
ais_bench/third_party/mindie_sd/qwenimage_edit/scheduling_flow_match_euler_discrete.py	Adds FlowMatch Euler scheduler implementation (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/pipeline_qwenimage_edit_plus.py	Adds QwenImage Edit Plus diffusion pipeline (third-party) with CFG/SP support.
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/utils.py	Adds distributed rank grouping helpers (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/parallel_mgr.py	Adds distributed parallel environment management (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/group_coordinator.py	Adds process-group wrapper and collectives helpers (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/all_to_all.py	Adds all-to-all utilities for sequence parallel (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/init.py	Package marker for distributed helpers.
ais_bench/third_party/mindie_sd/qwenimage_edit/attn_layer.py	Adds a long-context attention implementation using fused attention + SP.
ais_bench/third_party/mindie_sd/qwenimage_edit/init.py	Package marker for qwenimage_edit.
ais_bench/third_party/mindie_sd/init.py	Package marker for mindie_sd third-party integration.
ais_bench/configs/lmm_exmaple/multi_device_run_qwen_image_edit.py	New multi-device example config for Qwen-Image-Edit + GEdit.
ais_bench/benchmark/utils/prompt/prompt.py	Uses `deepcopy` for mm template content to avoid shared-mutation bugs.
ais_bench/benchmark/utils/image_process.py	Adds PIL-to-base64 utility for image datasets/prompts.
ais_bench/benchmark/utils/file/file.py	Adds JSONL load/dump helpers using mmap + orjson.
ais_bench/benchmark/utils/config/build.py	Extends dataset builder to pass `task_state_manager` into dataset configs.
ais_bench/benchmark/tasks/openicl_infer.py	Passes `task_state_manager` into dataset construction.
ais_bench/benchmark/tasks/openicl_eval.py	Passes `task_state_manager` into dataset construction and updates task run signature.
ais_bench/benchmark/tasks/openicl_api_infer.py	Passes `task_state_manager` into dataset construction and stores it in `run()`.
ais_bench/benchmark/openicl/icl_prompt_template/icl_prompt_template_mm.py	Minor formatting tweak in mm template generation path.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/ppl_inferencer_output_handler.py	Updates handler API to accept `data_abbr`.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/lmm_gen_inferencer_output_handler.py	New output handler for LMM generation outputs (incl. saving images).
ais_bench/benchmark/openicl/icl_inferencer/output_handler/gen_inferencer_output_handler.py	Extends handler API to accept `data_abbr` and adds prompt URL truncation logic.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/bfcl_v3_output_handler.py	Extends handler API to accept `data_abbr`.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/base_handler.py	Threads `data_abbr` through result generation + cache consumer dispatch.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/init.py	Package marker for output handlers.
ais_bench/benchmark/openicl/icl_inferencer/icl_lmm_gen_inferencer.py	New inferencer wiring for LMM generation + new handler usage.
ais_bench/benchmark/models/output.py	Adds `LMMOutput` supporting mixed (image/text) outputs and saving images.
ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py	New local model wrapper for Qwen-Image-Edit MindIE SD pipeline.
ais_bench/benchmark/models/local_models/base.py	Refactors base model abstractions and adds `BaseLMModel`.
ais_bench/benchmark/models/local_models/init.py	Package marker for local models.
ais_bench/benchmark/datasets/utils/lmm_judge.py	Adds LMM judge utilities + datasets for image-edit judging (base64 conversion).
ais_bench/benchmark/datasets/utils/llm_judge.py	Adds LLM judge dataset + correctness evaluator utilities.
ais_bench/benchmark/datasets/utils/datasets.py	Minor whitespace normalization.
ais_bench/benchmark/datasets/g_edit.py	Adds GEdit dataset loader + SC/PQ judge dataset types.
ais_bench/benchmark/datasets/base.py	Adds `TaskStateManager` propagation + JDG dataset base implementation.
ais_bench/benchmark/datasets/aime2025.py	Adds Aime2025 judge dataset type and import cleanups.
ais_bench/benchmark/configs/models/lmm_models/qwen_image_edit.py	Adds model config for Qwen-Image-Edit local model wrapper.
ais_bench/benchmark/configs/datasets/gedit/gedit_gen_0_shot_llmjudge.py	Adds GEdit dataset config + SC/PQ judge configs and evaluator setup.
ais_bench/benchmark/configs/datasets/aime2025/aime2025_gen_0_shot_llmjudge.py	Adds AIME2025 0-shot + judge/eval config.
ais_bench/benchmark/cli/workers.py	Adds `JudgeInfer` worker and integrates it into workflows.
ais_bench/benchmark/cli/config_manager.py	Minor whitespace normalization.
ais_bench/benchmark/cli/argument_parser.py	Adds new CLI modes: `judge` and `infer_judge`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T01:45:13Z

ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py

+        #self.device_id = device_kwargs.get('device_id', DEFAULT_DEVICE_ID)
+        # 在这里声明环境变量
+        self.logger.debug(f"device id from kwargs: {device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}")
+        os.environ["ASCEND_RT_VISIBLE_DEVICES"] = f"{device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}"
+        self.device_id = DEFAULT_DEVICE_ID
+        self.device_str = f"{self.device}:{DEFAULT_DEVICE_ID}"


Device selection looks inconsistent: you set ASCEND_RT_VISIBLE_DEVICES from device_kwargs['device_id'], but then force self.device_id = DEFAULT_DEVICE_ID and build device_str from the default. This effectively ignores the requested device_id and can route all processes to the same device. Please set self.device_id from device_kwargs (after applying any visibility mapping) and use it consistently for torch.npu.set_device and device_str.

Suggested change

#self.device_id = device_kwargs.get('device_id', DEFAULT_DEVICE_ID)

# 在这里声明环境变量

self.logger.debug(f"device id from kwargs: {device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}")

os.environ["ASCEND_RT_VISIBLE_DEVICES"] = f"{device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}"

self.device_id = DEFAULT_DEVICE_ID

self.device_str = f"{self.device}:{DEFAULT_DEVICE_ID}"

device_id = device_kwargs.get('device_id', DEFAULT_DEVICE_ID)

# 在这里声明环境变量

self.logger.debug(f"device id from kwargs: {device_id}")

os.environ["ASCEND_RT_VISIBLE_DEVICES"] = f"{device_id}"

self.device_id = device_id

self.device_str = f"{self.device}:{self.device_id}"

when set ASCEND_RT_VISIBLE_DEVICES to only 1 device，the device id is 0

ais_bench/benchmark/openicl/icl_inferencer/output_handler/gen_inferencer_output_handler.py

Copilot · 2026-03-04T01:45:14Z

ais_bench/benchmark/datasets/base.py

+    def update_task_state(self, state: Dict):
+        if self.task_state_manager is not None:
+            self.task_state_manager.update_task_state(state)
+        else:
+            self.logger.warning("Task state manager is not initialized, cannot update task state")


update_task_state assumes self.task_state_manager always exists, but it's only set when a non-None task_state_manager is passed into __init__. If any dataset calls update_task_state() without a manager, this will raise AttributeError. Please initialize self.task_state_manager = None in __init__ (or use getattr(self, 'task_state_manager', None)).

ais_bench/benchmark/datasets/base.py

ais_bench/benchmark/datasets/utils/llm_judge.py

ais_bench/benchmark/utils/file/file.py

Copilot · 2026-03-04T01:45:15Z

ais_bench/third_party/mindie_sd/qwenimage_edit/scheduling_flow_match_euler_discrete.py

+        if self.config.stochastic_sampling:
+            print("ljf  进入采样器，涉及随机")
+            x0 = sample - current_sigma * model_output
+            noise = torch.randn_like(sample)
+            prev_sample = (1.0 - next_sigma) * x0 + next_sigma * noise
+        else:
+            print("ljf  进入采样器，无随机")
+            prev_sample = sample + dt * model_output


The scheduler step() contains unconditional print(...) statements. This will spam stdout during inference loops and can significantly slow down generation, especially in distributed runs. Please replace with logger.debug(...)/logger.info(...) behind a verbosity flag, or remove entirely.

Copilot · 2026-03-04T01:45:15Z

ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py

+        print(f"in _generate")
+        #self.logger.info(f"输入: {input}")
+        if isinstance(input, str):


There are raw print(...) calls in _generate (e.g., print(f"in _generate")). This will pollute CLI output and is hard to control in multi-process runs. Please switch these to self.logger.debug/info or remove them.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…inferencer_output_handler.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

GaoHuaZhang · 2026-03-04T07:02:55Z

ais_bench/benchmark/datasets/g_edit.py

+            data_dict = {key: [example[key]] for key in example.keys()}
+            return Dataset.from_dict(data_dict)
+
+        max_workers = 4  # Adjust based on system resources


【review】使用全局常量维护，方便修改和理解

GaoHuaZhang · 2026-03-04T07:04:05Z

ais_bench/benchmark/datasets/g_edit.py

+
+            dataset = dataset.select(range(start_idx, end_idx))
+        else:
+            dataset = dataset.select(range(GEDIT_COUNT))


【review】在 split_count <= 1 时固定执行 dataset.select(range(GEDIT_COUNT))，而 GEDIT_COUNT = 1，这会导致默认只加载 1 条样本，评测结果严重失真。应该是加载全量数据？

【reply】调试使用需要删除

GaoHuaZhang · 2026-03-04T07:05:23Z

ais_bench/benchmark/datasets/base.py

+        dataset_batches = []
+        current_batch = []
+
+        if isinstance(dataset_content, Dataset):


【review】和下面的DatasetDict处理逻辑存在大量冗余重复代码，可以先将dataset_content转化为DatasetDict后采用相同逻辑处理

GaoHuaZhang · 2026-03-04T07:23:53Z

ais_bench/benchmark/datasets/utils/lmm_judge.py

+            return pred_item
+
+        # 使用并行处理加速图片处理
+        max_workers = min(8, os.cpu_count())  # 根据CPU核心数调整


【review】采用全局常量维护，提高可维护性方便理解

GaoHuaZhang · 2026-03-04T07:30:21Z

ais_bench/benchmark/datasets/base.py

+        predictions: list = self._load_from_predictions(predictions_path)
+
+        # 为数据集添加 model_answer 列
+        batch_size = 10  # 批处理大小，可以根据实际情况调整


【review】采用全局常量维护

GaoHuaZhang · 2026-03-04T07:33:32Z

ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py

+        self.guidance_scale = infer_kwargs.get('guidance_scale', DEFAULT_GUIDANCE_SCALE)
+        self.seed = infer_kwargs.get('seed', DEFAULT_SEED)
+        self.num_images_per_prompt = infer_kwargs.get('num_images_per_prompt', DEFAULT_NUM_IMAGES_PER_PROMPT)
+        self.quant_desc_path = infer_kwargs.get('quant_desc_path', DEFAULT_QUANT_DESC_PATH)


【review】再初始化完成后，打印模型参数配置，方便观测参数是否生效

GaoHuaZhang · 2026-03-04T07:34:14Z

ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py

+
+        # 如果没有图像输入，使用默认图像
+        if not images:
+            raise ValueError("QwenImageEditModel requires image input")


【review】使用错误码维护异常报错

GaoHuaZhang · 2026-03-04T07:34:37Z

ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py

+                torch.npu.synchronize()
+            end_time = time.time()
+            infer_time = end_time - start_time
+            self.logger.info(f"推理完成，耗时: {infer_time:.2f}秒")


【review】修改为英文日志，防止编码导致的乱码

SJTUyh added 30 commits February 2, 2026 14:16

judge llm

ac8bef8

Merge branch 'master_center' into dev

85aa1fa

reconstruct the judgedatasets

16a9848

reconstruct judgedataset

df6bc4d

suppport gedit infer

312bb1d

add qwen image edit dep

aae408c

add qwen image edit dep

ce167ed

llm eval

898385c

base judge ds class generalize

af1240f

llm eval

e18209c

fix judge worker bug

c543451

fix judge worker bug

5724e58

lmm eval fix

77c8fc5

support multi judge dataset tasks

68a1596

support multi judge dataset tasks

04fa57c

fix custom config

7355dcb

support multi judge dataset tasks

963a3b4

support multi judge dataset tasks

f42cdb7

judge fix

9d26569

asnyc process predictions

5c92fd7

fast trans to dataset

dcfc50e

fast trans to dataset

880bc8c

add a gedit display tool

6d13d2a

add task_state_manager to base dataset

6ecd383

add task_state_manager to base dataset

bcb61ec

add process bar to task state manager

fcee08f

add base jdg process bar to task state manager

d4861fe

add lmm jdg process bar to task state manager

d7be38b

self task state manager in api infer task

d0279e8

load function from static to member

b3291c0

SJTUyh added 9 commits March 3, 2026 18:28

remove copy from org result

8da5705

fix conifg device

8de36e4

fix

6d24a50

fix

2346a52

fix ut

3186fcb

fix ut

4ed77d3

fix ut

9284ee0

delete unused dataset config

01c16a5

Merge branch 'master_center' into edit_dev_eval

38a071e

SJTUyh changed the title ~~[feature] [sub feature 4] Support GEdit evaluate~~ [feature] [big PR only for review] Support GEdit evaluate Mar 4, 2026

zhongzhouTan-coder requested a review from Copilot March 4, 2026 01:39

Copilot started reviewing on behalf of zhongzhouTan-coder March 4, 2026 01:39 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

SJTUyh and others added 4 commits March 4, 2026 14:17

Update ais_bench/benchmark/datasets/base.py

0a5bc1f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update ais_bench/benchmark/openicl/icl_inferencer/output_handler/gen_…

d0b9df1

…inferencer_output_handler.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update ais_bench/benchmark/datasets/utils/llm_judge.py

fefb51a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update ais_bench/benchmark/utils/file/file.py

8b91ac6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

GaoHuaZhang reviewed Mar 4, 2026

View reviewed changes

fix

a845479

GaoHuaZhang reviewed Mar 4, 2026

View reviewed changes

fix review

fb442eb

GaoHuaZhang reviewed Mar 4, 2026

View reviewed changes

SJTUyh added 2 commits March 4, 2026 15:36

fix review

f886815

fix review

d80b428

SJTUyh closed this Mar 5, 2026

Conversation

SJTUyh commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SJTUyh commented Mar 3, 2026 •

edited

Loading