H-FALCON: Human-centered Functional Assessment of Language and Contextuality in Narratives

This repository contains the data and resources accompanying our paper: “Context is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation”

📄 @WMT 2025 Research Paper

Figure 1: The evaluation process of FALCON (Kim, 2025), consisting of (1) labeling relevant contextual knowledge and (2) assessing translation skills, followed by (3) rating. In H-FALCON, this dual-phase process is streamlined by simultaneously conducting labeling and rating for all sentences.

🔑 Key Contributions

H-FALCON Protocol: A reproducible document-level human evaluation method for MT aligned with human preferences.
Empirical Findings: Contextual information is universal but exerts limited impact on holistic MT judgments.
Call for Richer Metrics: Moving beyond narrow sentence-bounded metrics toward document-level, pragmatic evaluation

Abstract

As sentence-level performance in modern Machine Translation (MT) has plateaued, reliable document-level evaluation is increasingly needed. While the recent FALCON (Kim, 2025) framework with pragmatic features offers a promising direction, its reliability and reproducibility are unclear. We address this gap through human evaluation, analyzing sources of low inter-annotator agreement and identifying key factors. Based on these findings,we introduce H-FALCON,a Human-centered refinement of FALCON. Our experiments show that,even with limited annotator consensus, H-FALCON achieves correlations comparable to or better than standard sentence-level protocols.

Furthermore,we find that contextual information is inherent in all sentences, challenging the view that only some require it. This suggests that prior estimates such as“n% of sentences require context” may stem from methodological artifacts. At the same time,we show that while context is pervasive,not all of it directly influences human judgment.

Content

data/ : Evaluation dataset of WMT24++ (en–ko) (Deutsch et al.,2025).
model/ : Model judgments from GPT-4o-mini, o3, and o4-mini.
human/ : Human judgments of FALCON and H-FALCON.

🚀 Demo

A demo of the H-FALCON human evaluation environment will be released soon. Stay tuned!

Citation

If you use this repository, please cite our preprint:

@InProceedings{kim:2025:WMT2,
  author    = {Kim, Ahrii},
  title     = {A Preliminary Study of AI Agent Model in Machine Translation},
  booktitle      = {Proceedings of the Tenth Conference on Machine Translation (WMT 2025)},
  month          = {November},
  year           = {2025},
  address        = {Suzhou, China},
  publisher      = {Association for Computational Linguistics},
  pages     = {583--586},
  abstract  = {We present IR\_Multi-agentMT, our submission to the WMT25 General Shared Task. The system adopts an AI-agent paradigm implemented through a multi-agent workflow, Prompt Chaining, in combination with RUBRIC-MQM, an automatic MQM-based error annotation metric. Our primary configuration follows the Translate–Postedit–Proofread paradigm, where each stage progressively enhances translation quality. We conduct a preliminary study to investigate (i) the impact of initial translation quality and (ii) the effect of enforcing explicit responses from the Postedit Agent. Our findings highlight the importance of both factors in shaping the overall performance of multi-agent translation systems.},
  url       = {https://aclanthology.org/2025.wmt-1.32}
}

License

The data is licensed under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
human		human
model		model
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

H-FALCON: Human-centered Functional Assessment of Language and Contextuality in Narratives

🔑 Key Contributions

Abstract

Content

🚀 Demo

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

trotacodigos/H-FALCON

Folders and files

Latest commit

History

Repository files navigation

H-FALCON: Human-centered Functional Assessment of Language and Contextuality in Narratives

🔑 Key Contributions

Abstract

Content

🚀 Demo

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages