[UT] [part 1] Add ut for gedit evaluate#165
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the test coverage for the benchmark system by introducing new unit tests for the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive suite of unit tests for new and existing components, significantly improving test coverage. The new tests for JudgeInfer, BaseJDGDataset, and various dataset utilities are well-structured. I've identified a few opportunities to enhance the new tests by addressing code duplication and correcting a minor issue with logger mocking. Overall, this is a valuable contribution to the project's stability.
| assert result["accuracy"] == 0.0 | ||
|
|
||
|
|
||
| class TestGEditDataset: |
There was a problem hiding this comment.
The setup logic for creating a GEditDataset instance (ds = GEditDataset.__new__(...) and mocking its attributes) is duplicated across test_load_basic, test_load_with_split, and test_load_use_raw. To improve maintainability and reduce code duplication, consider extracting this common setup into a setup_method for the TestGEditDataset class.
| def test_load_from_predictions_file_not_exists(self): | ||
| """测试文件不存在的情况""" | ||
| ds = LLMJudgeDataset.__new__(LLMJudgeDataset) | ||
| ds.logger = MagicMock() | ||
|
|
||
| with patch('os.path.exists', return_value=False): | ||
| result = ds._load_from_predictions('/test/nonexistent.jsonl') | ||
| assert result == [] |
There was a problem hiding this comment.
The _load_from_predictions method uses a module-level logger, but this test mocks an instance-level logger (ds.logger), which is not used. To correctly test the logging behavior, you should patch the module-level logger ais_bench.benchmark.datasets.utils.llm_judge.logger and assert that its warning method is called.
| def test_load_from_predictions_file_not_exists(self): | |
| """测试文件不存在的情况""" | |
| ds = LLMJudgeDataset.__new__(LLMJudgeDataset) | |
| ds.logger = MagicMock() | |
| with patch('os.path.exists', return_value=False): | |
| result = ds._load_from_predictions('/test/nonexistent.jsonl') | |
| assert result == [] | |
| @patch('ais_bench.benchmark.datasets.utils.llm_judge.logger') | |
| def test_load_from_predictions_file_not_exists(self, mock_logger): | |
| """测试文件不存在的情况""" | |
| ds = LLMJudgeDataset.__new__(LLMJudgeDataset) | |
| with patch('os.path.exists', return_value=False): | |
| result = ds._load_from_predictions('/test/nonexistent.jsonl') | |
| assert result == [] | |
| mock_logger.warning.assert_called_once_with('Prediction file does not exist: /test/nonexistent.jsonl') |
| assert result == "[10, 20, 30, 40]" | ||
|
|
||
|
|
||
| class TestLMMImgJDGDataset: |
There was a problem hiding this comment.
The setup logic for creating an LMMImgJDGDataset instance is duplicated across the tests in this class (test_load_from_predictions_file_not_exists, test_load_from_predictions_success, test_load_from_predictions_with_nonexistent_image). To improve code clarity and reduce duplication, consider extracting this common setup into a setup_method.
There was a problem hiding this comment.
Pull request overview
This PR adds unit tests for the GEdit evaluate functionality, covering LMM judge utilities, LLM judge utilities, the GEdit dataset and evaluator classes, the BaseJDGDataset class, and the JudgeInfer worker class. It is part 1 of a test coverage improvement effort for the evaluate pipeline.
Changes:
- Added new test files for
lmm_judge(image editing evaluator, dataset classes, point list extraction) andllm_judge(A/B extraction, dataset loading, correctness evaluation) - Added new test file for
g_edit(GEdit dataset, evaluator, SC/PQ judge dataset classes) and extendedtest_base.pywith tests forBaseJDGDataset,task_state_manager, parameter validation, andDatasetDicthandling - Extended
test_workers.pywith tests forJudgeInferworker (config updates, preprocessing, dataset merging, result post-processing, task config updates) andWORK_FLOWdict assertions
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/UT/datasets/utils/test_lmm_judge.py |
New tests for get_lmm_point_list, LMMImgJDGDataset, ImgSCJDGDataset, ImgPQJDGDataset, and LMMJudgeImageEditEvaluator |
tests/UT/datasets/utils/test_llm_judge.py |
New tests for get_a_or_b, LLMJudgeDataset, and LLMJudgeCorrectEvaluator |
tests/UT/datasets/test_g_edit.py |
New tests for GEditEvaluator, GEditDataset, GEditSCJDGDataset, and GEditPQJDGDataset |
tests/UT/datasets/test_base.py |
Extended with tests for BaseJDGDataset, task state management, parameter validation, and DatasetDict path |
tests/UT/cli/test_workers.py |
Added JudgeInfer import, WORK_FLOW assertions for judge workflows, and comprehensive TestJudgeInfer class |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import os | ||
| import pytest | ||
| from unittest.mock import patch, MagicMock, call | ||
| from unittest.mock import patch, MagicMock, call, mock_open |
There was a problem hiding this comment.
mock_open is imported but never used in this file.
| from unittest.mock import patch, MagicMock, call, mock_open | |
| from unittest.mock import patch, MagicMock, call |
| @@ -1,9 +1,10 @@ | |||
| import unittest | |||
| from unittest.mock import patch, MagicMock | |||
| import pytest | |||
There was a problem hiding this comment.
pytest is imported but never used in this file.
| import pytest |
| import base64 | ||
|
|
||
| # 添加项目根目录到Python路径 | ||
| sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../../'))) |
There was a problem hiding this comment.
The sys.path.insert path traverses one level too many. This file is at tests/UT/datasets/test_g_edit.py (3 directories deep), so '../../../../' goes one level above the project root. It should be '../../../' to reach the project root, consistent with other test files at the same directory depth (e.g., tests/UT/datasets/test_base.py doesn't need this at all, and files at depth 4 like tests/UT/datasets/utils/test_llm_judge.py correctly use '../../../../').
| sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../../'))) | |
| sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../'))) |
| import pytest | ||
| from unittest.mock import patch, MagicMock, mock_open | ||
| from io import BytesIO | ||
| from PIL import Image | ||
| import base64 |
There was a problem hiding this comment.
Several imports are unused: pytest, mock_open, BytesIO, and base64 are imported but never used in this file. Remove unused imports to keep the code clean.
| import pytest | |
| from unittest.mock import patch, MagicMock, mock_open | |
| from io import BytesIO | |
| from PIL import Image | |
| import base64 | |
| from unittest.mock import patch, MagicMock | |
| from PIL import Image |
| from unittest.mock import patch, MagicMock, mock_open | ||
| import tempfile | ||
| import json |
There was a problem hiding this comment.
Unused imports: mock_open, tempfile, and json are imported but never used in this file.
| from unittest.mock import patch, MagicMock, mock_open | |
| import tempfile | |
| import json | |
| from unittest.mock import patch, MagicMock |
| from unittest.mock import patch, MagicMock, mock_open | ||
| import tempfile | ||
| import json |
There was a problem hiding this comment.
Unused imports: mock_open and json are imported but never used in this file.
| from unittest.mock import patch, MagicMock, mock_open | |
| import tempfile | |
| import json | |
| from unittest.mock import patch, MagicMock | |
| import tempfile |
Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。
PR Type / PR类型
Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)
🔍 Motivation / 变更动机
Please describe the motivation of this PR and the goal you want to achieve through this PR.
请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。
📝 Modification / 修改内容
Please briefly describe what modification is made in this PR.
请简要描述此拉取请求中进行的修改。
📐 Associated Test Results / 关联测试结果
Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接,例如 CI 管道、测试报告等。
related scripts coverage




coverage sum:

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。
If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。
🌟 Use cases (Optional) / 使用案例(可选)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build