This repository contains the data and resources accompanying our paper: “Context is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation”
📄 @WMT 2025 Research Paper
Figure 1: The evaluation process of FALCON (Kim, 2025), consisting of (1) labeling relevant contextual knowledge and (2) assessing translation skills, followed by (3) rating. In H-FALCON, this dual-phase process is streamlined by simultaneously conducting labeling and rating for all sentences.- H-FALCON Protocol: A reproducible document-level human evaluation method for MT aligned with human preferences.
- Empirical Findings: Contextual information is universal but exerts limited impact on holistic MT judgments.
- Call for Richer Metrics: Moving beyond narrow sentence-bounded metrics toward document-level, pragmatic evaluation
As sentence-level performance in modern Machine Translation (MT) has plateaued, reliable document-level evaluation is increasingly needed. While the recent FALCON (Kim, 2025) framework with pragmatic features offers a promising direction, its reliability and reproducibility are unclear. We address this gap through human evaluation, analyzing sources of low inter-annotator agreement and identifying key factors. Based on these findings,we introduce H-FALCON,a Human-centered refinement of FALCON. Our experiments show that,even with limited annotator consensus, H-FALCON achieves correlations comparable to or better than standard sentence-level protocols.
Furthermore,we find that contextual information is inherent in all sentences, challenging the view that only some require it. This suggests that prior estimates such as“n% of sentences require context” may stem from methodological artifacts. At the same time,we show that while context is pervasive,not all of it directly influences human judgment.
data/: Evaluation dataset of WMT24++ (en–ko) (Deutsch et al.,2025).model/: Model judgments from GPT-4o-mini, o3, and o4-mini.human/: Human judgments of FALCON and H-FALCON.
A demo of the H-FALCON human evaluation environment will be released soon. Stay tuned!
If you use this repository, please cite our preprint:
@InProceedings{kim:2025:WMT2,
author = {Kim, Ahrii},
title = {A Preliminary Study of AI Agent Model in Machine Translation},
booktitle = {Proceedings of the Tenth Conference on Machine Translation (WMT 2025)},
month = {November},
year = {2025},
address = {Suzhou, China},
publisher = {Association for Computational Linguistics},
pages = {583--586},
abstract = {We present IR\_Multi-agentMT, our submission to the WMT25 General Shared Task. The system adopts an AI-agent paradigm implemented through a multi-agent workflow, Prompt Chaining, in combination with RUBRIC-MQM, an automatic MQM-based error annotation metric. Our primary configuration follows the Translate–Postedit–Proofread paradigm, where each stage progressively enhances translation quality. We conduct a preliminary study to investigate (i) the impact of initial translation quality and (ii) the effect of enforcing explicit responses from the Postedit Agent. Our findings highlight the importance of both factors in shaping the overall performance of multi-agent translation systems.},
url = {https://aclanthology.org/2025.wmt-1.32}
}
The data is licensed under CC BY 4.0.
