FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Chenxi Zhang^1,2* , Ziliang Gan^1,3* , Liyun Zhu^1* , Youwei Pang⁴ , Qing Zhang⁵ , Rongjunchen Zhang^{1 ♠}

¹ HiThink Research ²Wuhan University ³Zhejiang University ⁴ Nanyang Technological University ⁵Shanghai Institute of Technology
_{^*Equal Contribution ^♠Corresponding Author}
Correspondence: zhangrongjunchen@myhexin.com

📖[Paper] | 🏠[Project Page]|🤗[Huggingface]

Overview of FinMTM: task types and capability coverage.

🔥 Updates

2026-01: Initial release of benchmark dataset and paper.
TBD: Online leaderboard opens for submissions.

🧭 Overview

Financial reasoning is challenging for VLMs due to specialized chart formats, dense domain knowledge, long-horizon dependencies, and evidence-grounded tool use. Existing benchmarks are mostly single-turn and do not sufficiently measure multi-turn dialogue stability, session-level memory, or agentic planning and execution.

FinMTM addresses this gap by providing:

Objective questions: single and multiple choice questions grounded in financial visuals.
Open-ended questions: multi-turn conversations that stress compositional reasoning, multi-step calculation, self-correction, and memory.
Financial agent task: tool-augmented multi-source workflows with long-horizon planning and evidence-grounded answers.

Data Construction Pipeline

We propose a novel multi-stage data construction pipeline to scale multi-turn financial sessions, ensuring alignment with targeted cognitive requirements and traceability to verifiable evidence.

Our multi-stage construction pipeline. We progressively build (i) objective visual-grounded items, (ii) multi-turn open-ended sessions emphasizing composition/calculation/self-correction/memory, and (iii) agentic workflows with tool planning, tool execution, and evidence-grounded responses.

📊 Results

We benchmark a range of 22 leading VLMs on FinMTM. The final score is the average across: Objective Questions, Open-Ended Questions, and Financial Agent.

Comparison of leading VLMs on FinMTM. Final score is the average of Objective, Open-Ended, and Agent tasks.

Benchmark Results

Benchmark Results (Click to Expand)

Column Definitions

Objective Questions: Single-choice (Obj-Single), Multiple-choice (Obj-Multi)
Open-Ended Questions: Comprehension (Open-Com.), Calculation (Open-Cal.), Self-Correlation (Open-SelfCorr.), Memory (Open-Mem.)
Financial Agent Tasks: With fuzzing (Agent-w fuzz), Without fuzzing (Agent-w/o fuzz)

Method	Obj-Single	Obj-Multi	Open-Com.	Open-Cal.	Open-SelfCorr.	Open-Mem.	Agent-w fuzz	Agent-w/o fuzz
Proprietary Models
ChatGPT-4o	79.3	49.1	77.2	76.8	46.2	38.9	29.7	34.8
ChatGPT-o3*	85.8	73.3	83.8	78.6	52.8	43.6	31.4	35.2
ChatGPT-5*	89.0	79.6	86.9	80.7	56.9	46.7	35.9	49.7
Gemini 3 Flash	91.9	78.1	82.2	76.0	55.4	41.6	53.6	62.6
Grok-4-fast-non-reasoning*	71.0	46.8	66.0	61.2	39.9	24.8	30.2	39.7
Gemini 3 Pro	92.1	78.4	87.5	82.8	58.8	48.5	48.3	54.3
InternVL Series
InternVL2.5-8B	63.8	25.7	55.1	49.2	26.5	16.7	8.4	10.5
InternVL2.5-26B	70.5	31.3	61.7	57.7	32.3	22.8	11.2	14.0
InternVL2.5-40B	72.3	35.2	66.1	64.6	36.2	26.7	13.5	16.8
InternVL3-78B	75.6	42.4	76.2	77.6	43.6	32.6	18.2	22.8
Other VL Series
MiMo-VL-7B	61.1	21.4	75.1	75.4	47.2	39.9	20.2	25.5
GLM4.5V-108B	73.7	51.0	85.4	79.6	51.1	42.2	26.5	32.4
Qwen VL Series
Qwen2.5-VL-3B	64.5	16.4	68.2	67.7	40.5	27.6	9.4	11.9
Qwen2.5-VL-7B	73.4	24.1	74.3	73.4	43.1	33.9	11.1	14.2
Qwen3-VL-4B-Instruct	73.3	34.2	74.5	71.2	39.5	25.9	15.1	19.1
Qwen3-VL-4B-Thinking	66.1	24.3	71.2	68.5	42.5	31.0	12.8	15.6
Qwen3-VL-30B-A3B-Instruct	77.2	47.3	82.1	76.5	42.5	33.7	16.2	20.8
Qwen3-VL-30B-A3B-Thinking	71.5	49.4	80.7	67.1	44.2	35.1	18.9	23.3
Qwen3-VL-32B-Instruct	84.5	39.9	84.3	80.7	50.8	40.3	19.6	25.1
Qwen3-VL-32B-Thinking	83.4	46.5	80.3	68.6	43.5	33.7	23.2	28.6
Qwen3-VL-235B-A22B-Instruct	81.3	48.5	85.5	80.9	54.5	41.5	32.1	38.7
Qwen3-VL-235B-A22B-Thinking	80.5	42.3	84.5	79.4	52.5	43.0	35.2	41.5

💡 Key Observations

Agentic settings expose larger gaps than pure reasoning-only settings.
Removing identifiable entities increases difficulty and stresses evidence-grounded reasoning.
Scaling helps, but robust tool planning and execution remain a major bottleneck for open-source models.

📏 Evaluation

FinMTM uses task-aware evaluation protocols across the three tasks.

1) Objective Questions

Exact-match scoring over the predicted option(s).
Multi-choice uses a set-overlap rule (precision/recall/F-score style) to penalize missing or spurious selections.

2) Open-Ended Dialogues (Multi-turn)

We score dialogues with a weighted combination of:

turn-level quality (per-turn correctness, grounding, reasoning quality)
session-level quality (cross-turn consistency, long-context stability, memory correctness)

Notably, the level taxonomy is defined at the session level, i.e., each level characterizes the overall cognitive requirement of an entire multi-turn conversation rather than any single turn in isolation.

3) Financial Agent Tasks

We evaluate:

planning quality (step ordering, tool selection, decomposition)
tool execution (tool name + core args correctness; evidence sufficiency)
final outcome (answer correctness + evidence-grounded summarization)

⚡ Quickstart

1. Environment Setup

Download the dataset from the huggingface link. For evaluation, run the following commands to set up the environment:

cd finmtm
conda create -n finmtm_env python=3.10 -y
conda activate finmtm_env
pip install -r requirements.txt

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

cd ./inference/SC_MC
chmod +x etest.sh
./etest.sh

2.2 Inference for Multi-Turn QA

cd ./inference/MTQA
chmod +x etest.sh
./etest.sh

2.3 General Inference Command (Optional)

To customize inference parameters, run the command below directly:

python inference.py \
  --backend qwen3vl \
  --api-base http://localhost:8000/v1 \
  --model qwen3vl-4b-instruct \
  --input-dir ./inputs \
  --output-dir ./outputs \
  --include "*.jsonl"

3. Evaluation

For results of multi-turn QA tasks, run the following commands to start evaluation:

python -m eval_runner.main \
  --dirs /path/to/data \            # Directory of data to evaluate
  --client qwen \                   # Client type
  --api_base http://127.0.0.1:8000/v1 \  # API service address
  --model Qwen3-VL-30B-A3B-Instruct  # Model used for evaluation

# Alternatively, run via the script (optional)
chmod +x etest.sh
./etest.sh

📄 License

Code: Apache 2.0 Dataset: CC BY-NC 4.0 Research-use only. Must comply with: https://openai.com/policies/terms-of-use.

📚 Citation

If you find our work useful, please consider citing:

@article{
  Coming Soon!
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
inference		inference
judge		judge
static		static
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

🔥 Updates

📌 Contents

🧭 Overview

📊 Results

Benchmark Results

💡 Key Observations

📏 Evaluation

1) Objective Questions

2) Open-Ended Dialogues (Multi-turn)

3) Financial Agent Tasks

⚡ Quickstart

1. Environment Setup

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

2.2 Inference for Multi-Turn QA

2.3 General Inference Command (Optional)

3. Evaluation

📄 License

📚 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

HiThink-Research/FinMTM

Folders and files

Latest commit

History

Repository files navigation

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

🔥 Updates

📌 Contents

🧭 Overview

📊 Results

Benchmark Results

💡 Key Observations

📏 Evaluation

1) Objective Questions

2) Open-Ended Dialogues (Multi-turn)

3) Financial Agent Tasks

⚡ Quickstart

1. Environment Setup

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

2.2 Inference for Multi-Turn QA

2.3 General Inference Command (Optional)

3. Evaluation

📄 License

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages