Skip to content

[Bug] textvqa多模态测试预测结果为空,acc为0 #166

@zy-charon

Description

@zy-charon

操作系统及版本

openEuler24.03

安装工具的python环境

docker容器中的python环境

python版本

3.11

AISBench工具版本

3.1.20260211

AISBench执行命令

ais_bench --models vllm_api_general_chat --datasets textvqa_gen --debug

模型配置文件或自定义配置文件内容

from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr="vllm-api-general-chat",
path="",
model="qwen3.5",
stream=False,
request_rate=0,
use_timestamp=False,
retry=2,
api_key="",
host_ip="100.100.135.166",
host_port=8010,
url="",
max_out_len=512,
batch_size=1,
trust_remote_code=False,
generation_kwargs=dict(
temperature=0.6,
ignore_eos=False,
),
pred_postprocessor=dict(type=extract_non_reasoning_content),
)
]

预期行为

之前已用纯文本数据集进行测试,确定服务端模型可正常提供服务;预期测试textvqa也应输出预测结果

实际行为

acc为0,预测结果为空。
执行结果:
[root@fb1f86dde726 train_images]# ais_bench --models vllm_api_general_chat --datasets textvqa_gen --debug
[2026-03-05 02:04:39,255] [ais_bench] [INFO] Loading vllm_api_general_chat: /workspace/benchmark/ais_bench/benchmark/configs/./models/vllm_api/vllm_api_general_chat.py
[2026-03-05 02:04:39,262] [ais_bench] [INFO] Loading textvqa_gen: /workspace/benchmark/ais_bench/benchmark/configs/./datasets/textvqa/textvqa_gen.py
[2026-03-05 02:04:39,265] [ais_bench] [INFO] Loading example: /workspace/benchmark/ais_bench/benchmark/configs/./summarizers/example.py
[2026-03-05 02:04:39,304] [ais_bench] [INFO] Current exp folder: outputs/default/20260305_020428
[2026-03-05 02:04:39,379] [ais_bench] [INFO] Starting inference tasks...
[2026-03-05 02:04:39,383] [ais_bench] [INFO] Partitioned into 1 tasks.
[2026-03-05 02:04:39,398] [ais_bench] [INFO] Launch TasksMonitor, PID: 2360, Refresh interval: 0.5, Run in background: True
[2026-03-05 02:04:51,910] [ais_bench] [INFO] Debug mode, print progress directly
[2026-03-05 02:04:51,912] [ais_bench] [INFO] Task [vllm-api-general-chat/textvqa]
[2026-03-05 02:04:52,992] [ais_bench] [INFO] Zero Retriever initialized, returning empty shot case for all queries
[2026-03-05 02:04:55,348] [ais_bench] [INFO] Apply ice template finished
[2026-03-05 02:04:55,833] [ais_bench] [INFO] Start warmup, run with concurrency: 1
Warmup: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 198.55case/s]
[2026-03-05 02:04:55,841] [ais_bench] [INFO] Warmup finished Total Count: 1 Success Count: 1 Failed Count: 0
[2026-03-05 02:04:56,175] [ais_bench] [INFO] Dataset needed memory size: 5.24575233 MB
[2026-03-05 02:04:56,175] [ais_bench] [INFO] Memory usage check passed: 2.84% < 80% (Available: 1957.29 GB)
[2026-03-05 02:04:56,184] [ais_bench] [WARNING] The request rate is below 0.1, resulting in an excessively long interval between two consecutive requests.
[2026-03-05 02:04:56,184] [ais_bench] [INFO] Traffic request rate: 0 RPS with burstiness 1.0.
[2026-03-05 02:04:56,188] [ais_bench] [INFO] Request rate (0.0) or ramp end rps (None) < 0.001, sending all requests simultaneously
[2026-03-05 02:04:56,190] [ais_bench] [INFO] Debug mode, run with concurrency: 1
[2026-03-05 02:04:56,290] [ais_bench] [INFO] All subprocesses have finished deserializing the first batch of data
[2026-03-05 02:04:56,389] [ais_bench] [INFO] Starting progress bar Total data num: 4984 Finished data num: 0 Left data num: 4984
Progress: 100%|███████████████████████████████████████████████████████████████████████████| 4984/4984 [00:31<00:00, 158.92case/s]
POST=4984 (0.0/s) RECV=4984 (0.0/s) FAIL=0 (0.0/s) FINISH=4984 (0.0/s)
[2026-03-05 02:05:27,774] [ais_bench] [INFO] Api infer task time elapsed: 35.86s
[2026-03-05 02:05:29,445] [ais_bench] [INFO] Inference tasks completed.
[2026-03-05 02:05:29,447] [ais_bench] [INFO] Starting evaluation tasks...
[2026-03-05 02:05:29,451] [ais_bench] [INFO] Partitioned into 1 tasks.
[2026-03-05 02:05:29,465] [ais_bench] [INFO] Launch TasksMonitor, PID: 2378, Refresh interval: 0.5, Run in background: True
[2026-03-05 02:05:41,899] [ais_bench] [INFO] Debug mode, print progress directly
[2026-03-05 02:05:44,130] [ais_bench] [INFO] Running 1-th replica of evaluation
[2026-03-05 02:05:46,859] [ais_bench] [INFO] Task vllm-api-general-chat/textvqa: {'accuracy': 0.0}
[2026-03-05 02:05:46,888] [ais_bench] [INFO] Evaluation task time elapsed: 4.99s
[2026-03-05 02:05:48,447] [ais_bench] [INFO] Evaluation tasks completed.
[2026-03-05 02:05:48,447] [ais_bench] [INFO] Summarizing evaluation results...
dataset version metric mode vllm-api-general-chat


textvqa 4005f4 accuracy gen 0.00
[2026-03-05 02:05:48,451] [ais_bench] [INFO] write summary to /workspace/benchmark/ais_bench/datasets/textvqa/train_images/outputs/default/20260305_020428/summary/summary_20260305_020428.txt
[2026-03-05 02:05:48,452] [ais_bench] [INFO] write csv to /workspace/benchmark/ais_bench/datasets/textvqa/train_images/outputs/default/20260305_020428/summary/summary_20260305_020428.csv

The markdown format results is as below:

dataset version metric mode vllm-api-general-chat
textvqa 4005f4 accuracy gen 0.00

[2026-03-05 02:05:48,452] [ais_bench] [INFO] write markdown summary to /workspace/benchmark/ais_bench/datasets/textvqa/train_images/outputs/default/20260305_020428/summary/summary_20260305_020428.md

具体预测值:
{"data_abbr": "textvqa", "id": 0, "success": true, "uuid": "f072c2c71fd3495e9bd3d49f0646d830", "origin_prompt": [{"role": "H UMAN", "prompt": [{"image_url": {"url": "file:///workspace/benchmark/ais_bench/datasets/textvqa/train_images/003a8ae2ef43b90 1.jpg"}, "type": "image_url"}, {"text": "what is the brand of this camera? Answer the question using a single word or phrase .", "type": "text"}]}], "prediction": "", "gold": [{"answer": "nous les gosses", "answer_confidence": "yes", "answer_id": 0} , {"answer": "dakota", "answer_confidence": "yes", "answer_id": 1}, {"answer": "clos culombu", "answer_confidence": "yes", " answer_id": 2}, {"answer": "dakota digital", "answer_confidence": "yes", "answer_id": 3}, {"answer": "dakota", "answer_confi dence": "yes", "answer_id": 4}, {"answer": "dakota", "answer_confidence": "yes", "answer_id": 5}, {"answer": "dakota digital ", "answer_confidence": "yes", "answer_id": 6}, {"answer": "dakota digital", "answer_confidence": "yes", "answer_id": 7}, {" answer": "dakota", "answer_confidence": "yes", "answer_id": 8}, {"answer": "dakota", "answer_confidence": "yes", "answer_id" : 9}]}
2 {"data_abbr": "textvqa", "id": 1, "success": true, "uuid": "34f79be48244405e81e046bcfb86ed80", "origin_prompt": [{"role": "H UMAN", "prompt": [{"image_url": {"url": "file:///workspace/benchmark/ais_bench/datasets/textvqa/train_images/b9dc400eb20bad6 4.jpg"}, "type": "image_url"}, {"text": "what does the small white text spell? Answer the question using a single word or ph rase.", "type": "text"}]}], "prediction": "", "gold": [{"answer": "copenhagen", "answer_confidence": "yes", "answer_id": 0}, {"answer": "copenhagen", "answer_confidence": "yes", "answer_id": 1}, {"answer": "copenhagen", "answer_confidence": "yes", "answer_id": 2}, {"answer": "copenhagen", "answer_confidence": "yes", "answer_id": 3}, {"answer": "copenhagen", "answer_conf idence": "yes", "answer_id": 4}, {"answer": "thursday", "answer_confidence": "yes", "answer_id": 5}, {"answer": "copenhagen" , "answer_confidence": "yes", "answer_id": 6}, {"answer": "copenhagen", "answer_confidence": "yes", "answer_id": 7}, {"answe r": "copenhagen", "answer_confidence": "yes", "answer_id": 8}, {"answer": "copenhagen", "answer_confidence": "yes", "answer_ id": 9}]}

前置检查

  • 我已读懂主页文档的快速入门,无法解决问题
  • 我已检索过FAQ,无重复问题
  • 我已搜索过现有Issue,无重复问题
  • 我已更新到最新版本,问题仍存在

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcontent_check_passedissue content check passed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions