From 3ef8edf8c1f6b954fcc41eca6ba31dc33aa61a43 Mon Sep 17 00:00:00 2001 From: SJTUyh Date: Thu, 5 Mar 2026 18:26:50 +0800 Subject: [PATCH 1/2] add en docs for Judge Model --- .../configs/datasets/aime2025/README_en.md | 3 +- .../judge_model_evaluate.md | 313 ++++++++++++++++++ .../lmm_generate/gedit_bench.md | 222 +++++++++++++ .../extended_benchmark/lmm_generate/index.rst | 6 + docs/source_en/index.rst | 8 + 5 files changed, 551 insertions(+), 1 deletion(-) create mode 100644 docs/source_en/advanced_tutorials/judge_model_evaluate.md create mode 100644 docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md create mode 100644 docs/source_en/extended_benchmark/lmm_generate/index.rst diff --git a/ais_bench/benchmark/configs/datasets/aime2025/README_en.md b/ais_bench/benchmark/configs/datasets/aime2025/README_en.md index da5177b7..273dc9fb 100644 --- a/ais_bench/benchmark/configs/datasets/aime2025/README_en.md +++ b/ais_bench/benchmark/configs/datasets/aime2025/README_en.md @@ -25,4 +25,5 @@ rm aime2025.zip ## Available Dataset Tasks | Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | | --- | --- | --- | --- | --- | --- | -| aime2025_gen | Generative task for the AIME2025 dataset | Accuracy | 0-shot | Chat format | aime2025_gen_0_shot_chat_prompt.py | \ No newline at end of file +| aime2025_gen | Generative task for the AIME2025 dataset | Accuracy | 0-shot | Chat format | aime2025_gen_0_shot_chat_prompt.py | +| aime2025_gen_0_shot_llmjudge | AIME2025 | Generative task for the AIME2025 dataset | Accuracy evaluated by judge model | 0-shot | Chat format | aime2025_gen_0_shot_llmjudge.py | diff --git a/docs/source_en/advanced_tutorials/judge_model_evaluate.md b/docs/source_en/advanced_tutorials/judge_model_evaluate.md new file mode 100644 index 00000000..934ff188 --- /dev/null +++ b/docs/source_en/advanced_tutorials/judge_model_evaluate.md @@ -0,0 +1,313 @@ +# Evaluation Using Judge Model + +## Why Use Judge Model for Evaluation + +In conventional evaluation tasks, the process of evaluating model inference results typically involves extracting answers from the inference results using methods like regular expressions and comparing them with ground truth answers to determine whether the model's inference results are correct, ultimately calculating a total score. The overall process is as follows: + +```mermaid +graph LR; + A[Execute inference based on given dataset] --> B((Inference Results)) + B --> C[Evaluate based on inference results] + C --> D((Accuracy Data)) + D --> E[Generate summary report based on accuracy data] + E --> F((Present Results)) +``` + +However, in some evaluation scenarios, there is no ground truth answer, or it's not only necessary to check the ground truth answer but also to verify whether the process of obtaining the ground truth answer is correct. In such cases, conventional answer extraction methods cannot meet these evaluation requirements. Therefore, it's necessary to introduce a judge model to evaluate the inference results of the tested model. The overall evaluation process with judge model intervention is as follows: + +```mermaid +graph LR; + A[Execute inference based on given dataset] --> B((Tested Model's Inference Results)) + B --> C[Judge model evaluates tested model's inference results]:::green + C --> D((Judge Model's Evaluation Results)):::green + D --> E[Extract relevant metric scores from judge model's evaluation results] + E --> F((Accuracy Data)) + F --> G[Present Results] + + classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px; +``` + +## Quick Start + +Taking the aime2025 dataset evaluation as an example, the usage is basically consistent with [AISBench Quick Start](https://ais-bench-benchmark-rf.readthedocs.io/en/latest/get_started/quick_start.html#). This quick start section only covers the differences. + +### Command Meaning + +In the AISBench command, specify the judge model dataset task `aime2025_gen_0_shot_llmjudge` through `--datasets`. + +```shell +ais_bench --models vllm_api_general_chat --datasets aime2025_gen_0_shot_llmjudge +``` + +> Note: Judge model dataset tasks differ from regular dataset tasks in configuration, but both types of dataset tasks can be mixed in a single dataset task. + +### Task Meaning Query (Optional) + +Same as Quick Start, not repeated here. + +### Pre-run Preparation + +- `--models`: Using `vllm_api_general_chat` model task requires preparing an inference service that supports `v1/chat/completions` sub-service. You can refer to ๐Ÿ”— [VLLM Launch OpenAI Compatible Server](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server) to start the inference service (the tested model is one inference service, and the judge model is another inference service; for quick start, you can also share one service if convenient). +- `--datasets`: Using `aime2025_gen_0_shot_llmjudge` dataset task requires preparing the aime2025 dataset, which can be downloaded from ๐Ÿ”— [aime2025 dataset archive provided by opencompass](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime2025.zip). Deploy the extracted `aime2025/` folder to the `ais_bench/datasets` folder under the AISBench evaluation tool root path. + +### Modify Task Configuration Files + +Each model task, dataset task, and result presentation task corresponds to a configuration file. Before running the command, you need to modify the contents of these configuration files. The paths to these configuration files can be queried by adding `--search` to the original AISBench command, for example: + +```shell +ais_bench --models vllm_api_general_chat --datasets aime2025_gen_0_shot_llmjudge --search +``` + +> โš ๏ธ **Note**: Executing the command with search will print the absolute paths of the task configuration files. + +Executing the query command will produce the following results: + +```shell +โ•’โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•• +โ”‚ Task Type โ”‚ Task Name โ”‚ Config File Path โ”‚ +โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก +โ”‚ --models โ”‚ vllm_api_general_chat โ”‚ /your_workspace/benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ --datasets โ”‚ aime2025_gen_0_shot_llmjudge โ”‚ /your_workspace/benchmark/ais_bench/benchmark/configs/datasets/aime2025/aime2025_gen_0_shot_llmjudge.py โ”‚ +โ•˜โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•› + +``` + +- The configuration method for `vllm_api_general_chat` corresponding to the tested model task configuration file is the same as in Quick Start, not repeated here. +- In the `aime2025_gen_0_shot_llmjudge` corresponding judge model dataset task configuration file, you need to modify the judge model configuration: + ```python + judge_model=dict( + attr="service", + type=VLLMCustomAPIChat, + abbr="judge", # abbr identifies the uniqueness of the judge model + path="", # Specify the absolute path of the model serialization vocabulary file (generally not needed for accuracy test scenarios) + model="", # Specify the model name loaded on the server side, configure according to the actual model name pulled by the VLLM inference service (configuring an empty string will automatically obtain it) + stream=False, + request_rate=0, # Request sending frequency, send 1 request per 1/request_rate seconds to the server, if less than 0.1, send all requests at once + use_timestamp=False, # Whether to schedule requests according to timestamp in the dataset, applicable to datasets containing timestamp (such as Mooncake Trace) + retry=2, # Maximum retry times for each request + api_key="", # Custom API key, default is empty string + host_ip="localhost", # Specify the IP of the judge model inference service + host_port=8080, # Specify the port of the judge model inference service + url="", # Custom URL path to access the judge model inference service (needs to be configured when the base url is not a combination of http://host_ip:host_port, after configuration host_ip and host_port will be ignored) + max_out_len=512, # Maximum number of tokens output by the inference service + batch_size=1, # Maximum concurrency of request sending + trust_remote_code=False, # Whether the tokenizer trusts remote code, default False; + generation_kwargs=dict( # Model inference parameters, refer to VLLM documentation for configuration, AISBench evaluation tool does not process, attached in the sent request + temperature=0.01, + ignore_eos=False, + ) + pred_postprocessor=dict(type=extract_non_reasoning_content), + ), + ``` +The meaning of judge model task configuration is exactly the same as the tested model task configuration. + + +### Execute Command + +After modifying the configuration files, execute the command to start the service-based accuracy evaluation (based on judge model evaluation): + +```bash +ais_bench --models vllm_api_general_chat --datasets aime2025_gen_0_shot_llmjudge +``` + +### View Task Execution Details + +Full Process Progress Display + +After executing the AISBench command, the status of the currently executing task will be displayed on a real-time refreshing dashboard in the command line (press "P" key on the keyboard to stop refreshing for copying dashboard information, press "P" again to continue refreshing), for example: + +```shell +Base path of result&log : outputs/default/20260305_153318 +Task Progress Table (Updated at: 2026-03-05 15:34:33) +Page: 1/1 Total 2 rows of data +Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to exit ++--------------------------------+-----------+---------------------------------------------------+-------------+----------+-----------------------------------------------+---------------------------------------------------+ +| Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters | ++================================+===========+===================================================+=============+==========+===============================================+===================================================+ +| vllm-api-general-chat/aime2025 | 2818438 | [##############################] 30/30 [2.0 it/s] | 0:00:12 | finish | logs/infer/vllm-api-general-chat/aime2025.out | {'POST': 30, 'RECV': 30, 'FINISH': 30, 'FAIL': 0} | ++--------------------------------+-----------+---------------------------------------------------+-------------+----------+-----------------------------------------------+---------------------------------------------------+ + +``` + +Judge Model Inference Process: + +```shell +Base path of result&log : outputs/default/20260305_153318 +Task Progress Table (Updated at: 2026-03-05 15:34:33) +Page: 1/1 Total 2 rows of data +Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to exit + ++--------------------------------------+-----------+---------------------------------------------------+-------------+----------+-----------------------------------------------------+---------------------------------------------------+ +| Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters | ++======================================+===========+===================================================+=============+==========+=====================================================+===================================================+ +| vllm-api-general-chat/aime2025-judge | 2821633 | [##############################] 30/30 [2.0 it/s] | 0:00:58 | finish | logs/infer/vllm-api-general-chat/aime2025-judge.out | {'POST': 30, 'RECV': 30, 'FINISH': 30, 'FAIL': 0} | ++--------------------------------------+-----------+---------------------------------------------------+-------------+----------+-----------------------------------------------------+---------------------------------------------------+ + +``` + +Process of Extracting Answers from Judge Model Inference Results: + +```shell +Base path of result&log : outputs/default/20260305_153318 +Task Progress Table (Updated at: 2026-03-05 15:34:33) +Page: 1/1 Total 2 rows of data +Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to exit + ++--------------------------------------+-----------+------------+-------------+----------+----------------------------------------------------+---------------------+ +| Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters | ++======================================+===========+============+=============+==========+====================================================+=====================+ +| vllm-api-general-chat/aime2025-judge | 2826026 | NA | 0:00:00 | finish | logs/eval/vllm-api-general-chat/aime2025-judge.out | None | ++--------------------------------------+-----------+------------+-------------+----------+----------------------------------------------------+---------------------+ + +``` + +The task execution detail logs will be continuously written to the default output path, which is displayed on the real-time refreshing dashboard, i.e., `Log Path`. `Log Path` (`logs/infer/vllm-api-general-chat/aime2025.out`) is a path under `Base path` (`outputs/default/20260305_153318`). Taking the above dashboard information as an example, the detailed log paths for task execution are: + +```shell +# {Base path}/{Log Path} +# Tested model inference process log +outputs/default/20260305_153318/logs/infer/vllm-api-general-chat/aime2025.out +# Judge model inference process log +outputs/default/20260305_153318/logs/infer/vllm-api-general-chat/aime2025-judge.out +# Process of extracting answers from judge model inference results log +outputs/default/20260305_153318/logs/eval/vllm-api-general-chat/aime2025-judge.out +``` + +> ๐Ÿ’ก If you want to print detailed logs directly during execution, you can add `--debug` to the command: +`ais_bench --models vllm_api_general_chat --datasets aime2025_gen_0_shot_llmjudge --debug` + + +`Base path` (`outputs/default/20260305_153318`) contains all task execution details. After the command execution ends, all execution details are as follows: + +```shell +20260305_153318/ +โ”œโ”€โ”€ configs # Configuration file synthesized from model task, dataset task, and result presentation task configurations +โ”‚ โ””โ”€โ”€ 20260305_153318_2833762.py +โ”œโ”€โ”€ logs # Logs during execution, if --debug is added to the command, there will be no process logs written to disk (all printed directly) +โ”‚ โ”œโ”€โ”€ eval +โ”‚ โ”‚ โ””โ”€โ”€ vllm-api-general-chat +โ”‚ โ”‚ โ””โ”€โ”€ aime2025-judge.out # Log of the accuracy evaluation process based on judge model inference results under predictions/ folder +โ”‚ โ””โ”€โ”€ infer +โ”‚ โ””โ”€โ”€ vllm-api-general-chat +โ”‚ โ”œโ”€โ”€ aime2025-judge.out # Judge model inference results +โ”‚ โ””โ”€โ”€ aime2025.out # Tested model inference process log +โ”œโ”€โ”€ predictions +โ”‚ โ””โ”€โ”€ vllm-api-general-chat +โ”‚ โ”œโ”€โ”€ aime2025.jsonl # Tested model inference results +โ”‚ โ””โ”€โ”€ aime2025-judge.jsonl # Judge model inference results (all outputs returned by the inference service) +โ”œโ”€โ”€ results +โ”‚ โ””โ”€โ”€ vllm-api-general-chat +โ”‚ โ””โ”€โ”€ aime2025-judge.json # Raw scores calculated from accuracy evaluation +โ””โ”€โ”€ summary + โ”œโ”€โ”€ summary_20260305_153318.csv # Final accuracy score presentation (table format) + โ”œโ”€โ”€ summary_20260305_153318.md # Final accuracy score presentation (markdown format) + โ””โ”€โ”€ summary_20260305_153318.txt # Final accuracy score presentation (text format) +``` + +> โš ๏ธ **Note**: Different evaluation scenarios have different task execution details written to disk. Please refer to the specific evaluation scenario guide for details. + + +### Output Results + +The result display example is as follows: + +```bash +| dataset | version | metric | mode | vllm-api-general-chat | +|----- | ----- | ----- | ----- | -----| +| aime2025-judge | 3fb7e8 | accuracy | gen | 100.00 | +``` + +## Other Accuracy Evaluation Function Scenarios + +From the quick start section of the judge model, you can see that except for the additional need to modify the judge model configuration in the data configuration file, the other evaluation execution methods are exactly the same as the conventional evaluation execution methods. Therefore, the execution methods for other accuracy evaluation function scenarios are also exactly the same. + +### Multi-task Evaluation + +Refer to [Accuracy Evaluation Scenario Multi-task Evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#multi-task-evaluation) + +### Multi-task Parallel Evaluation + +Refer to [Accuracy Evaluation Scenario Multi-task Parallel Evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#multi-task-parallel-evaluation) + +### Interrupted Evaluation & Failed Case Re-evaluation + +Refer to [Accuracy Evaluation Scenario Interrupted Evaluation & Failed Case Re-evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#interrupted-evaluation-failed-case-re-evaluation) + +> โš ๏ธ Note: After `--reuse` re-completes the tested model inference results, the judge model will re-evaluate all complete inference results from scratch, and previously judged results will not be used. + +### Merged Sub-dataset Inference + +Refer to [Accuracy Evaluation Scenario Merged Sub-dataset Inference](../base_tutorials/scenes_intro/accuracy_benchmark.md#merged-sub-dataset-inference) + +### Fixed Request Count Evaluation + +Refer to [Accuracy Evaluation Scenario Fixed Request Count Evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#fixed-request-count-evaluation) + +### Multiple Independent Repetitions Inference + +Refer to [Accuracy Evaluation Scenario Multiple Independent Repetitions Inference](../base_tutorials/scenes_intro/accuracy_benchmark.md#multiple-independent-repetitions-inference) + +> โš ๏ธ This scenario requires attention: only need to configure the parameters for multiple independent repetitions inference in the tested model, no need to configure in the judge model configuration. + +### Inference Results Re-evaluation + +Refer to [Accuracy Evaluation Scenario Inference Results Re-evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#inference-results-re-evaluation) + +> โš ๏ธ This scenario requires attention that re-evaluation starts from the judge model inference + +```mermaid +graph LR; + + B((Tested Model's Inference Results)) --> C[Judge model evaluates tested model's inference results]:::green + C --> D((Judge Model's Evaluation Results)):::green + D --> E[Extract relevant metric scores from judge model's evaluation results] + E --> F((Accuracy Data)) + F --> G[Present Results] + + classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px; +``` + +> By default, if judge model inference results exist, the process of re-inference by the judge model will be skipped, and relevant metric scores will be extracted directly from the judge model inference results. +> If you want the judge model to re-infer, you need to manually delete the judge model inference result file `aime2025-judge.jsonl` under the `predictions/` folder, and then re-execute the command. + +## Other Running Mode Extensions + +On the basis of [Default Running Modes](../base_tutorials/all_params/mode.md), several other running modes are provided. + +### Only Complete Judge Model Inference Results Output + +Through `--mode infer_judge`, achieve only completing the output from tested model inference to judge model inference results, without metric extraction. + +```bash +ais_bench --models vllm_api_general_chat --datasets aime2025_gen_0_shot_llmjudge --mode infer_judge +``` + +```mermaid +graph LR; + A[Execute inference based on given dataset] --> B((Tested Model's Inference Results)) + B --> C[Judge model evaluates tested model's inference results]:::green + C --> D((Judge Model's Evaluation Results)):::green + D --> E[Extract relevant metric scores from judge model's evaluation results] + E --> F((Accuracy Data)) + F --> G[Present Results] + + classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px; +``` + +### Only Perform Inference Based on Tested Model Inference Results + +Through `--mode judge`, achieve only performing inference based on tested model inference results, without judge model evaluation and metric extraction (if judge model inference results exist, the process of re-inference by the judge model will be skipped). + +```bash +ais_bench --models vllm_api_general_chat --datasets aime2025_gen_0_shot_llmjudge --reuse 20260305_153318 --mode judge +``` + +```mermaid +graph LR; + + B((Tested Model's Inference Results)) --> C[Judge model evaluates tested model's inference results]:::green + C --> D((Judge Model's Evaluation Results)):::green + + classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px; +``` diff --git a/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md b/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md new file mode 100644 index 00000000..670c65ca --- /dev/null +++ b/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md @@ -0,0 +1,222 @@ +# GEdit-Bench + +## Introduction to GEdit-Bench + +[**GEdit-Bench (Genuine Edit-Bench)**](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/) is an authoritative benchmark for **real-world instruction-based image editing** launched by StepFun in April 2025. Its core value is to test the practical capabilities of models using real user requirements. + +### Core Positioning and Background + +- **Full Name**: Genuine Edit-Bench +- **Developer**: StepFun AI, released together with their image editing model **Step1X-Edit** +- **Core Objective**: To address the limitations of existing benchmarks that rely on synthetic instructions and are detached from real-world scenarios, providing **evaluation standards closer to actual user usage** + +### Core Dataset Information + +- **Data Source**: Collected **over 1000 real user editing requests** from communities like Reddit, after deduplication, privacy removal, and manual annotation +- **Final Scale**: **606 test samples** (including English GEdit-Bench-EN and Chinese GEdit-Bench-CN), totaling 1212 samples in the entire dataset +- **Task Coverage**: 11 categories of high-frequency real editing scenarios + 1. Background replacement/modification (background_change) + 2. Color/tone adjustment (color_alter) + 3. Material/texture transformation (material_alter) + 4. Action/pose editing (motion_change) + 5. Portrait beautification/retouching (ps_human) + 6. Style transfer (style_change) + 7. Object addition/removal/replacement (subject-add) + 8. Text editing (text_change) + 9. Local detail refinement (subject-remove) + 10. Composition adjustment (subject-replace) + 11. Composite editing (multiple instruction combinations) (tone_transfer) + +### Evaluation Metrics (MLLM Automatic Scoring, Full Score 10 Points) + +- **G_SC, Q_SC (Semantic Consistency)**: Matching degree between editing results and instructions +- **G_PQ, Q_PQ (Image Quality)**: Clarity, detail preservation, artifact-free +- **G_O, Q_0 (Overall Score)**: Weighted combination of G_SC and G_PQ + +> Note: `G_` indicates using GPT-4o API as the judge model for scoring, `Q_` indicates using Qwen-2.5-VL-72B-Instruct as the judge model for scoring. + +## AISBench GEdit-Bench Evaluation Practice + +### Evaluating Qwen-Image-Edit Model Based on MindIE Framework + +#### Hardware Requirements + +Ascend Server: +- 800I A2 (single chip 64GB video memory) +- 800I A3 + +#### Environment Preparation (Taking 800I A2 Hardware as Example) + +Complete the evaluation using the image provided by MindIE. + +1. **Pull MindIE Image** + +``` +docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.3.0-800I-A2-py311-openeuler24.03-lts +``` + +2. **Run Container** + +``` +docker run --name ${NAME} -it -d --net=host --shm-size=500g \ + --privileged=true \ + -w /home \ + --device=/dev/davinci_manager \ + --device=/dev/hisi_hdc \ + --device=/dev/devmm_svm \ + --entrypoint=bash \ + -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /usr/local/sbin:/usr/local/sbin \ + -v ${PATH_TO_WORKSPACE}:${PATH_TO_WORKSPACE} \ + -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \ + ${IMAGES_ID} +``` + +> Where: +- `${NAME}`: Container name +- `${PATH_TO_WORKSPACE}`: Local workspace directory path +- `${IMAGES_ID}`: MindIE image ID + +3. **Install Latest Version of AISBench** + +Clone the latest AISBench code in the container-mounted `${PATH_TO_WORKSPACE}` directory: + +```bash +git clone https://github.com/AISBench/benchmark.git +``` + +Enter the container: + +```bash +docker exec -it ${NAME} bash +``` + +In the container, refer to AISBench's [Installation Instructions](../../get_started/install.md) to install the latest AISBench tool. + +4. **Prepare Model Weights and Dataset** + +Refer to [Qwen-Image-Edit-2509](https://huggingface.co/Qwen/Qwen-Image-Edit-2509) to obtain model weights. +Refer to [GEdit-Bench Dataset](https://huggingface.co/datasets/stepfun-ai/GEdit-Bench) to obtain the dataset. +Place the dataset in the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/datasets` directory (using symbolic links is also acceptable). + +#### Evaluation Configuration Preparation + +In the container, navigate to the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/configs/lmm_example` directory, open the `multi_device_run_qwen_image_edit.py` file, and edit the following content to set the model configuration: + +```python +# ...... +# ====== User configuration parameters ========= +qwen_image_edit_models[0]["path"] = "/path/to/Qwen-Image-Edit-2509/" # Modify to actual model weight path +qwen_image_edit_models[0]["infer_kwargs"]["num_inference_steps"] = 50 # Modify to the required inference steps +device_list = [0] # [0, 1, 2, 3] Modify to the actual available NPU device ID list, not necessarily in order, each device will separately load a weight +# ====== User configuration parameters ========= +# ...... +``` + +Note: This configuration file supports splitting the Gedit-Bench dataset into multiple parts on average and distributing them to multiple model instances for inference to improve inference efficiency. + +Execute the following command to find the path where the `gedit_gen_0_shot_llmjudge.py` dataset configuration is located: + +```bash +ais_bench --datasets gedit_gen_0_shot_llmjudge --search +``` + +Edit the judge model related configuration in the `gedit_gen_0_shot_llmjudge.py` file. The judge model configuration is the same as the regular API model configuration (you can refer to the relevant configuration tutorial in Quick Start [Model Configuration Introduction](../../get_started/quick_start.md#task-corresponding-configuration-file-modification)), but in the `judge_model` field: + +```python +# ...... + judge_model=dict( + attr="service", + type=VLLMCustomAPIChat, + abbr=f"{metric}_judge", # Be added after dataset abbr + path="", + model="", + stream=True, + request_rate=0, + use_timestamp=False, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=16, + trust_remote_code=False, + generation_kwargs=dict( + temperature=0.01, + ignore_eos=False, + ), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ), +# ...... +``` + + +#### Start Evaluation + +In the container, navigate to the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/configs/lmm_example` directory and execute the following command to start the evaluation: + +```bash +ais_bench multi_device_run_qwen_image_edit.py --max-num-workers {MAX_NUM_WORKERS} +``` + +Where `{MAX_NUM_WORKERS}` is the maximum number of concurrent workers. It is recommended to set it to twice the number of devices used. For example, if `device_list = [0, 1, 2, 3]`, use `--max-num-workers 8`. + +After the evaluation command completes (taking 4 devices as an example), logs similar to the following will be printed: + +```shell + +The markdown format results is as below: + +| dataset | version | metric | mode | qwen-image-edit-0 | qwen-image-edit-1 | qwen-image-edit-2 | qwen-image-edit-3 | +|----- | ----- | ----- | ----- | ----- | ----- | ----- | -----| +| gedit-0-SC_judge | 16dd59 | SC | gen | 7.20 | - | - | - | +| gedit-0-PQ_judge | 16dd59 | PQ | gen | 7.08 | - | - | - | +| gedit-1-SC_judge | 16dd59 | SC | gen | - | 6.63 | - | - | +| gedit-1-PQ_judge | 16dd59 | PQ | gen | - | 6.73 | - | - | +| gedit-2-SC_judge | 16dd59 | SC | gen | - | - | 7.37 | - | +| gedit-2-PQ_judge | 16dd59 | PQ | gen | - | - | 7.22 | - | +| gedit-3-SC_judge | 16dd59 | SC | gen | - | - | - | 7.31 | +| gedit-3-PQ_judge | 16dd59 | PQ | gen | - | - | - | 7.24 | + +[2026-03-04 15:40:45,583] [ais_bench] [INFO] write markdown summary to /workplace/benchmark/ais_bench/configs/lmm_exmaple/outputs/default/20260213_150110/summary/summary_20260304_152835.md +``` + +This log prints the metadata of the multi-device evaluation. In the `/workplace/benchmark/ais_bench/configs/lmm_exmaple` path, you need to further call the following command-line tool to process the metadata: + +```bash +# +# python3 -m ais_bench.tools.dataset_processors.gedit.display_results --config_path {CONFIG_PATH} --timestamp_path {TIMESTAMP_PATH} +python3 -m ais_bench.tools.dataset_processors.gedit.display_results --config_path ./multi_device_run_qwen_image_edit.py --timestamp_path outputs/default/20260213_150110/ +``` + +Where `{CONFIG_PATH}` is the path of the configuration used to start the ais_bench command (i.e., the `multi_device_run_qwen_image_edit.py` file), +`{TIMESTAMP_PATH}` is the timestamp path where the ais_bench command results are written (i.e., `outputs/default/20260213_150110/`). + +After this command executes, logs similar to the following will be printed, showing the final GEdit-Bench evaluation metric results: + +```shell +[2026-03-04 15:57:52,522] [__main__] [INFO] Finish dumping csv to: outputs/default/20260213_150110/results/gedit_gathered_result.csv +language SC_point PQ_point O_point +---------- ---------- ---------- --------- +zh 7.1230 7.0694 6.9896 +en 7.1280 7.0623 6.9983 +all case 7.1254 7.0660 6.9937 + +``` + +In the `outputs/default/20260213_150110/results/gedit_gathered_result.csv` file, the specific accuracy score for each case is saved. + +#### (Optional Extension) Using AISBench Inference Results in GEdit-Bench Tool + +Execute the following command: + +```bash +# python3 -m ais_bench.tools.dataset_processors.gedit.display_results --config_path {CONFIG_PATH} --timestamp_path {TIMESTAMP_PATH} +python3 -m ais_bench.tools.dataset_processors.gedit.convert_results --config_path ./multi_device_run_qwen_image_edit.py --timestamp_path outputs/default/20260213_150110/ +``` + +After this command executes, a `fullset` folder will be generated in the `outputs/default/20260213_150110/results/` directory. This folder can be directly used for evaluation in the [GEdit-Bench Tool](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/EVAL.md). diff --git a/docs/source_en/extended_benchmark/lmm_generate/index.rst b/docs/source_en/extended_benchmark/lmm_generate/index.rst new file mode 100644 index 00000000..cc2ef8fc --- /dev/null +++ b/docs/source_en/extended_benchmark/lmm_generate/index.rst @@ -0,0 +1,6 @@ +Extended Multimodal Generation Benchmarks +========================================= +.. toctree:: + :maxdepth: 2 + + gedit_bench diff --git a/docs/source_en/index.rst b/docs/source_en/index.rst index ba2e3f01..b49ef71d 100644 --- a/docs/source_en/index.rst +++ b/docs/source_en/index.rst @@ -54,6 +54,14 @@ To help you quickly get started with AISBench Benchmark Tool, we recommend learn advanced_tutorials/multiturn_benchmark advanced_tutorials/synthetic_dataset advanced_tutorials/custom_dataset + advanced_tutorials/judge_model_evaluate + +.. toctree:: + :maxdepth: 2 + :caption: ๐Ÿ“ Extended Benchmarks + :hidden: + + extended_benchmark/lmm_generate/index .. toctree:: :maxdepth: 2 From cc37b106a578c7d90c29f6ad1c97d96fc14dd231 Mon Sep 17 00:00:00 2001 From: SJTUyh Date: Thu, 5 Mar 2026 20:24:50 +0800 Subject: [PATCH 2/2] fix docs --- docs/source_en/advanced_tutorials/judge_model_evaluate.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/source_en/advanced_tutorials/judge_model_evaluate.md b/docs/source_en/advanced_tutorials/judge_model_evaluate.md index 934ff188..f3ce088f 100644 --- a/docs/source_en/advanced_tutorials/judge_model_evaluate.md +++ b/docs/source_en/advanced_tutorials/judge_model_evaluate.md @@ -288,9 +288,6 @@ graph LR; A[Execute inference based on given dataset] --> B((Tested Model's Inference Results)) B --> C[Judge model evaluates tested model's inference results]:::green C --> D((Judge Model's Evaluation Results)):::green - D --> E[Extract relevant metric scores from judge model's evaluation results] - E --> F((Accuracy Data)) - F --> G[Present Results] classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px; ```