Skip to content

[feature] [big PR only for review] Support GEdit evaluate#155

Closed
SJTUyh wants to merge 62 commits intoAISBench:masterfrom
SJTUyh:edit_dev_eval
Closed

[feature] [big PR only for review] Support GEdit evaluate#155
SJTUyh wants to merge 62 commits intoAISBench:masterfrom
SJTUyh:edit_dev_eval

Conversation

@SJTUyh
Copy link
Collaborator

@SJTUyh SJTUyh commented Mar 3, 2026

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

Please describe the motivation of this PR and the goal you want to achieve through this PR.
请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。

📝 Modification / 修改内容

Please briefly describe what modification is made in this PR.
请简要描述此拉取请求中进行的修改。

📐 Associated Test Results / 关联测试结果

Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接,例如 CI 管道、测试报告等。
image

image

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。

⚠️ Performance degradation (Optional) / 性能下降(可选)

If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。

🌟 Use cases (Optional) / 使用案例(可选)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @xxx
  • Relevant Module Owners / 相关模块负责人: @xxx
  • Other Collaboration Notes / 其他协作说明:

🌟 Useful CI Command / 实用的CI命令

Command / 命令 Introduction / 介绍
/gemini review Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
/gemini summary Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
/gemini help Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
/readthedocs build Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

@SJTUyh SJTUyh changed the title [feature] [sub feature 4] Support GEdit evaluate [feature] [big PR only for review] Support GEdit evaluate Mar 4, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for running and evaluating the GEdit image-editing benchmark within AISBench, including a local Qwen-Image-Edit (MindIE SD) integration and a new judge/eval workflow to score edited images via LMM/LLM judging.

Changes:

  • Introduces GEdit dataset + LMM/LLM judge dataset/evaluator utilities and configs (including multi-device run example).
  • Extends CLI workflow to run judge inference as a first-class step and adjusts output-handling to propagate data_abbr.
  • Adds Qwen-Image-Edit (MindIE SD) third-party pipeline/transformer/scheduler + a local model wrapper.

Reviewed changes

Copilot reviewed 45 out of 53 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/UT/tasks/test_openicl_api_infer.py Updates UT setup to include a default task_state_manager.
tests/UT/openicl/icl_inferencer/output_handler/test_ppl_inferencer_output_handler.py Updates tests for new get_result(..., data_abbr, ...) signature.
tests/UT/openicl/icl_inferencer/output_handler/test_gen_inferencer_output_handler.py Updates tests for new data_abbr plumbing and adjusts mocking.
tests/UT/openicl/icl_inferencer/output_handler/test_bfcl_v3_output_handler.py Updates tests for new data_abbr plumbing.
tests/UT/cli/test_workers.py Adjusts unit tests for worker behavior/config expectations.
ais_bench/tools/dataset_processors/gedit/display_results.py New helper script to parse and tabulate GEdit judge results.
ais_bench/tools/dataset_processors/gedit/convert_preds.py New helper script to convert predictions into GEdit expected folder format.
ais_bench/tools/dataset_processors/init.py Package marker for dataset processor tooling.
ais_bench/tools/init.py Package marker for tools.
ais_bench/third_party/mindie_sd/qwenimage_edit/transformer_qwenimage.py Adds QwenImage transformer implementation (third-party) with NPU-specific paths.
ais_bench/third_party/mindie_sd/qwenimage_edit/scheduling_flow_match_euler_discrete.py Adds FlowMatch Euler scheduler implementation (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/pipeline_qwenimage_edit_plus.py Adds QwenImage Edit Plus diffusion pipeline (third-party) with CFG/SP support.
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/utils.py Adds distributed rank grouping helpers (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/parallel_mgr.py Adds distributed parallel environment management (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/group_coordinator.py Adds process-group wrapper and collectives helpers (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/all_to_all.py Adds all-to-all utilities for sequence parallel (third-party).
ais_bench/third_party/mindie_sd/qwenimage_edit/distributed/init.py Package marker for distributed helpers.
ais_bench/third_party/mindie_sd/qwenimage_edit/attn_layer.py Adds a long-context attention implementation using fused attention + SP.
ais_bench/third_party/mindie_sd/qwenimage_edit/init.py Package marker for qwenimage_edit.
ais_bench/third_party/mindie_sd/init.py Package marker for mindie_sd third-party integration.
ais_bench/configs/lmm_exmaple/multi_device_run_qwen_image_edit.py New multi-device example config for Qwen-Image-Edit + GEdit.
ais_bench/benchmark/utils/prompt/prompt.py Uses deepcopy for mm template content to avoid shared-mutation bugs.
ais_bench/benchmark/utils/image_process.py Adds PIL-to-base64 utility for image datasets/prompts.
ais_bench/benchmark/utils/file/file.py Adds JSONL load/dump helpers using mmap + orjson.
ais_bench/benchmark/utils/config/build.py Extends dataset builder to pass task_state_manager into dataset configs.
ais_bench/benchmark/tasks/openicl_infer.py Passes task_state_manager into dataset construction.
ais_bench/benchmark/tasks/openicl_eval.py Passes task_state_manager into dataset construction and updates task run signature.
ais_bench/benchmark/tasks/openicl_api_infer.py Passes task_state_manager into dataset construction and stores it in run().
ais_bench/benchmark/openicl/icl_prompt_template/icl_prompt_template_mm.py Minor formatting tweak in mm template generation path.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/ppl_inferencer_output_handler.py Updates handler API to accept data_abbr.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/lmm_gen_inferencer_output_handler.py New output handler for LMM generation outputs (incl. saving images).
ais_bench/benchmark/openicl/icl_inferencer/output_handler/gen_inferencer_output_handler.py Extends handler API to accept data_abbr and adds prompt URL truncation logic.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/bfcl_v3_output_handler.py Extends handler API to accept data_abbr.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/base_handler.py Threads data_abbr through result generation + cache consumer dispatch.
ais_bench/benchmark/openicl/icl_inferencer/output_handler/init.py Package marker for output handlers.
ais_bench/benchmark/openicl/icl_inferencer/icl_lmm_gen_inferencer.py New inferencer wiring for LMM generation + new handler usage.
ais_bench/benchmark/models/output.py Adds LMMOutput supporting mixed (image/text) outputs and saving images.
ais_bench/benchmark/models/local_models/qwen_image_edit_mindie_sd.py New local model wrapper for Qwen-Image-Edit MindIE SD pipeline.
ais_bench/benchmark/models/local_models/base.py Refactors base model abstractions and adds BaseLMModel.
ais_bench/benchmark/models/local_models/init.py Package marker for local models.
ais_bench/benchmark/datasets/utils/lmm_judge.py Adds LMM judge utilities + datasets for image-edit judging (base64 conversion).
ais_bench/benchmark/datasets/utils/llm_judge.py Adds LLM judge dataset + correctness evaluator utilities.
ais_bench/benchmark/datasets/utils/datasets.py Minor whitespace normalization.
ais_bench/benchmark/datasets/g_edit.py Adds GEdit dataset loader + SC/PQ judge dataset types.
ais_bench/benchmark/datasets/base.py Adds TaskStateManager propagation + JDG dataset base implementation.
ais_bench/benchmark/datasets/aime2025.py Adds Aime2025 judge dataset type and import cleanups.
ais_bench/benchmark/configs/models/lmm_models/qwen_image_edit.py Adds model config for Qwen-Image-Edit local model wrapper.
ais_bench/benchmark/configs/datasets/gedit/gedit_gen_0_shot_llmjudge.py Adds GEdit dataset config + SC/PQ judge configs and evaluator setup.
ais_bench/benchmark/configs/datasets/aime2025/aime2025_gen_0_shot_llmjudge.py Adds AIME2025 0-shot + judge/eval config.
ais_bench/benchmark/cli/workers.py Adds JudgeInfer worker and integrates it into workflows.
ais_bench/benchmark/cli/config_manager.py Minor whitespace normalization.
ais_bench/benchmark/cli/argument_parser.py Adds new CLI modes: judge and infer_judge.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +114 to +119
#self.device_id = device_kwargs.get('device_id', DEFAULT_DEVICE_ID)
# 在这里声明环境变量
self.logger.debug(f"device id from kwargs: {device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}")
os.environ["ASCEND_RT_VISIBLE_DEVICES"] = f"{device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}"
self.device_id = DEFAULT_DEVICE_ID
self.device_str = f"{self.device}:{DEFAULT_DEVICE_ID}"
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device selection looks inconsistent: you set ASCEND_RT_VISIBLE_DEVICES from device_kwargs['device_id'], but then force self.device_id = DEFAULT_DEVICE_ID and build device_str from the default. This effectively ignores the requested device_id and can route all processes to the same device. Please set self.device_id from device_kwargs (after applying any visibility mapping) and use it consistently for torch.npu.set_device and device_str.

Suggested change
#self.device_id = device_kwargs.get('device_id', DEFAULT_DEVICE_ID)
# 在这里声明环境变量
self.logger.debug(f"device id from kwargs: {device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}")
os.environ["ASCEND_RT_VISIBLE_DEVICES"] = f"{device_kwargs.get('device_id', DEFAULT_DEVICE_ID)}"
self.device_id = DEFAULT_DEVICE_ID
self.device_str = f"{self.device}:{DEFAULT_DEVICE_ID}"
device_id = device_kwargs.get('device_id', DEFAULT_DEVICE_ID)
# 在这里声明环境变量
self.logger.debug(f"device id from kwargs: {device_id}")
os.environ["ASCEND_RT_VISIBLE_DEVICES"] = f"{device_id}"
self.device_id = device_id
self.device_str = f"{self.device}:{self.device_id}"

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when set ASCEND_RT_VISIBLE_DEVICES to only 1 device,the device id is 0

Comment on lines +52 to +56
def update_task_state(self, state: Dict):
if self.task_state_manager is not None:
self.task_state_manager.update_task_state(state)
else:
self.logger.warning("Task state manager is not initialized, cannot update task state")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_task_state assumes self.task_state_manager always exists, but it's only set when a non-None task_state_manager is passed into __init__. If any dataset calls update_task_state() without a manager, this will raise AttributeError. Please initialize self.task_state_manager = None in __init__ (or use getattr(self, 'task_state_manager', None)).

Copilot uses AI. Check for mistakes.
Comment on lines +456 to +463
if self.config.stochastic_sampling:
print("ljf 进入采样器,涉及随机")
x0 = sample - current_sigma * model_output
noise = torch.randn_like(sample)
prev_sample = (1.0 - next_sigma) * x0 + next_sigma * noise
else:
print("ljf 进入采样器,无随机")
prev_sample = sample + dt * model_output
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler step() contains unconditional print(...) statements. This will spam stdout during inference loops and can significantly slow down generation, especially in distributed runs. Please replace with logger.debug(...)/logger.info(...) behind a verbosity flag, or remove entirely.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +225
print(f"in _generate")
#self.logger.info(f"输入: {input}")
if isinstance(input, str):
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are raw print(...) calls in _generate (e.g., print(f"in _generate")). This will pollute CLI output and is hard to control in multi-process runs. Please switch these to self.logger.debug/info or remove them.

Copilot uses AI. Check for mistakes.
SJTUyh and others added 4 commits March 4, 2026 14:17
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…inferencer_output_handler.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
data_dict = {key: [example[key]] for key in example.keys()}
return Dataset.from_dict(data_dict)

max_workers = 4 # Adjust based on system resources
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】使用全局常量维护,方便修改和理解


dataset = dataset.select(range(start_idx, end_idx))
else:
dataset = dataset.select(range(GEDIT_COUNT))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】在 split_count <= 1 时固定执行 dataset.select(range(GEDIT_COUNT)),而 GEDIT_COUNT = 1,这会导致默认只加载 1 条样本,评测结果严重失真。应该是加载全量数据?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【reply】调试使用需要删除

dataset_batches = []
current_batch = []

if isinstance(dataset_content, Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】和下面的DatasetDict处理逻辑存在大量冗余重复代码,可以先将dataset_content转化为DatasetDict后采用相同逻辑处理

return pred_item

# 使用并行处理加速图片处理
max_workers = min(8, os.cpu_count()) # 根据CPU核心数调整
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】采用全局常量维护,提高可维护性方便理解

predictions: list = self._load_from_predictions(predictions_path)

# 为数据集添加 model_answer 列
batch_size = 10 # 批处理大小,可以根据实际情况调整
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】采用全局常量维护

self.guidance_scale = infer_kwargs.get('guidance_scale', DEFAULT_GUIDANCE_SCALE)
self.seed = infer_kwargs.get('seed', DEFAULT_SEED)
self.num_images_per_prompt = infer_kwargs.get('num_images_per_prompt', DEFAULT_NUM_IMAGES_PER_PROMPT)
self.quant_desc_path = infer_kwargs.get('quant_desc_path', DEFAULT_QUANT_DESC_PATH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】再初始化完成后,打印模型参数配置,方便观测参数是否生效


# 如果没有图像输入,使用默认图像
if not images:
raise ValueError("QwenImageEditModel requires image input")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】使用错误码维护异常报错

torch.npu.synchronize()
end_time = time.time()
infer_time = end_time - start_time
self.logger.info(f"推理完成,耗时: {infer_time:.2f}秒")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】修改为英文日志,防止编码导致的乱码

@SJTUyh SJTUyh closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants