Skip to content

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Notifications You must be signed in to change notification settings

HiThink-Research/FinMTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BizFinBench logo FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Chenxi Zhang1,2* , Ziliang Gan1,3* , Liyun Zhu1* , Youwei Pang4 , Qing Zhang5 , Rongjunchen Zhang1 ♠

1 HiThink Research   2Wuhan University   3Zhejiang University   4 Nanyang Technological University   5Shanghai Institute of Technology  
*Equal Contribution   Corresponding Author  
Correspondence: zhangrongjunchen@myhexin.com

📖[Paper] | 🏠[Project Page]|🤗[Huggingface]


Overview of FinMTM: task types and capability coverage.


🔥 Updates

  • 2026-01: Initial release of benchmark dataset and paper.
  • TBD: Online leaderboard opens for submissions.

📌 Contents


🧭 Overview

Financial reasoning is challenging for VLMs due to specialized chart formats, dense domain knowledge, long-horizon dependencies, and evidence-grounded tool use. Existing benchmarks are mostly single-turn and do not sufficiently measure multi-turn dialogue stability, session-level memory, or agentic planning and execution.

FinMTM addresses this gap by providing:

  • Objective questions: single and multiple choice questions grounded in financial visuals.

  • Open-ended questions: multi-turn conversations that stress compositional reasoning, multi-step calculation, self-correction, and memory.

  • Financial agent task: tool-augmented multi-source workflows with long-horizon planning and evidence-grounded answers.


Data Construction Pipeline

We propose a novel multi-stage data construction pipeline to scale multi-turn financial sessions, ensuring alignment with targeted cognitive requirements and traceability to verifiable evidence.

Fig. X. Multi-stage construction pipeline of FinMTM.

Our multi-stage construction pipeline. We progressively build (i) objective visual-grounded items, (ii) multi-turn open-ended sessions emphasizing composition/calculation/self-correction/memory, and (iii) agentic workflows with tool planning, tool execution, and evidence-grounded responses.

📊 Results

We benchmark a range of 22 leading VLMs on FinMTM. The final score is the average across: Objective Questions, Open-Ended Questions, and Financial Agent.

Comparison of leading VLMs on FinMTM. Final score is the average of Objective, Open-Ended, and Agent tasks.

Benchmark Results

Benchmark Results (Click to Expand)

Column Definitions

  • Objective Questions: Single-choice (Obj-Single), Multiple-choice (Obj-Multi)
  • Open-Ended Questions: Comprehension (Open-Com.), Calculation (Open-Cal.), Self-Correlation (Open-SelfCorr.), Memory (Open-Mem.)
  • Financial Agent Tasks: With fuzzing (Agent-w fuzz), Without fuzzing (Agent-w/o fuzz)
Method Obj-Single Obj-Multi Open-Com. Open-Cal. Open-SelfCorr. Open-Mem. Agent-w fuzz Agent-w/o fuzz
Proprietary Models
ChatGPT-4o 79.3 49.1 77.2 76.8 46.2 38.9 29.7 34.8
ChatGPT-o3* 85.8 73.3 83.8 78.6 52.8 43.6 31.4 35.2
ChatGPT-5* 89.0 79.6 86.9 80.7 56.9 46.7 35.9 49.7
Gemini 3 Flash 91.9 78.1 82.2 76.0 55.4 41.6 53.6 62.6
Grok-4-fast-non-reasoning* 71.0 46.8 66.0 61.2 39.9 24.8 30.2 39.7
Gemini 3 Pro 92.1 78.4 87.5 82.8 58.8 48.5 48.3 54.3
InternVL Series
InternVL2.5-8B 63.8 25.7 55.1 49.2 26.5 16.7 8.4 10.5
InternVL2.5-26B 70.5 31.3 61.7 57.7 32.3 22.8 11.2 14.0
InternVL2.5-40B 72.3 35.2 66.1 64.6 36.2 26.7 13.5 16.8
InternVL3-78B 75.6 42.4 76.2 77.6 43.6 32.6 18.2 22.8
Other VL Series
MiMo-VL-7B 61.1 21.4 75.1 75.4 47.2 39.9 20.2 25.5
GLM4.5V-108B 73.7 51.0 85.4 79.6 51.1 42.2 26.5 32.4
Qwen VL Series
Qwen2.5-VL-3B 64.5 16.4 68.2 67.7 40.5 27.6 9.4 11.9
Qwen2.5-VL-7B 73.4 24.1 74.3 73.4 43.1 33.9 11.1 14.2
Qwen3-VL-4B-Instruct 73.3 34.2 74.5 71.2 39.5 25.9 15.1 19.1
Qwen3-VL-4B-Thinking 66.1 24.3 71.2 68.5 42.5 31.0 12.8 15.6
Qwen3-VL-30B-A3B-Instruct 77.2 47.3 82.1 76.5 42.5 33.7 16.2 20.8
Qwen3-VL-30B-A3B-Thinking 71.5 49.4 80.7 67.1 44.2 35.1 18.9 23.3
Qwen3-VL-32B-Instruct 84.5 39.9 84.3 80.7 50.8 40.3 19.6 25.1
Qwen3-VL-32B-Thinking 83.4 46.5 80.3 68.6 43.5 33.7 23.2 28.6
Qwen3-VL-235B-A22B-Instruct 81.3 48.5 85.5 80.9 54.5 41.5 32.1 38.7
Qwen3-VL-235B-A22B-Thinking 80.5 42.3 84.5 79.4 52.5 43.0 35.2 41.5

💡 Key Observations

  • Agentic settings expose larger gaps than pure reasoning-only settings.
  • Removing identifiable entities increases difficulty and stresses evidence-grounded reasoning.
  • Scaling helps, but robust tool planning and execution remain a major bottleneck for open-source models.

📏 Evaluation

FinMTM uses task-aware evaluation protocols across the three tasks.

1) Objective Questions

  • Exact-match scoring over the predicted option(s).
  • Multi-choice uses a set-overlap rule (precision/recall/F-score style) to penalize missing or spurious selections.

2) Open-Ended Dialogues (Multi-turn)

We score dialogues with a weighted combination of:

  • turn-level quality (per-turn correctness, grounding, reasoning quality)
  • session-level quality (cross-turn consistency, long-context stability, memory correctness)

Notably, the level taxonomy is defined at the session level, i.e., each level characterizes the overall cognitive requirement of an entire multi-turn conversation rather than any single turn in isolation.

3) Financial Agent Tasks

We evaluate:

  • planning quality (step ordering, tool selection, decomposition)
  • tool execution (tool name + core args correctness; evidence sufficiency)
  • final outcome (answer correctness + evidence-grounded summarization)

⚡ Quickstart

1. Environment Setup

Download the dataset from the huggingface link. For evaluation, run the following commands to set up the environment:

cd finmtm
conda create -n finmtm_env python=3.10 -y
conda activate finmtm_env
pip install -r requirements.txt

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

cd ./inference/SC_MC
chmod +x etest.sh
./etest.sh

2.2 Inference for Multi-Turn QA

cd ./inference/MTQA
chmod +x etest.sh
./etest.sh

2.3 General Inference Command (Optional)

To customize inference parameters, run the command below directly:

python inference.py \
  --backend qwen3vl \
  --api-base http://localhost:8000/v1 \
  --model qwen3vl-4b-instruct \
  --input-dir ./inputs \
  --output-dir ./outputs \
  --include "*.jsonl"

3. Evaluation

For results of multi-turn QA tasks, run the following commands to start evaluation:

python -m eval_runner.main \
  --dirs /path/to/data \            # Directory of data to evaluate
  --client qwen \                   # Client type
  --api_base http://127.0.0.1:8000/v1 \  # API service address
  --model Qwen3-VL-30B-A3B-Instruct  # Model used for evaluation

# Alternatively, run via the script (optional)
chmod +x etest.sh
./etest.sh

📄 License

Code License Data License

Code: Apache 2.0 Dataset: CC BY-NC 4.0 Research-use only. Must comply with: https://openai.com/policies/terms-of-use.

📚 Citation

If you find our work useful, please consider citing:

@article{
  Coming Soon!
}

About

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •