Learning High-Quality Machine Translation Evaluation from Human Preferences with Reward Modeling
ReMedy is a new state-of-the-art machine translation (MT) evaluation framework that reframes the task as reward modeling rather than direct regression. Instead of relying on noisy human scores, ReMedy learns from pairwise human preferences, leading to better alignment with human judgments.
- π State-of-the-art accuracy on WMT22β24 (39 language pairs, 111 systems)
- βοΈ Segment- and system-level evaluation, outperforming GPT-4, PaLM-540B, Finetuned-PaLM2, MetricX-13B, and XCOMET
- π More robust on low-quality and out-of-domain translations (ACES, MSLC benchmarks)
- π§ Can be used as a reward model in RLHF pipelines to improve MT systems
ReMedy demonstrates that reward modeling with pairwise preferences offers a more reliable and human-aligned approach for MT evaluation.
- π¦ Quick Installation
- βοΈ Requirements
- π Usage
- βοΈ Full Argument List
- π§ Model Variants
- π Reproducing WMT Results
- π Citation
ReMedy requires Python β₯ 3.12, and leverages VLLM for fast inference.
pip install --upgrade pip
pip install remedy-mt-evalgit clone https://github.com/Smu-Tan/Remedy
cd Remedy
pip install -e .git clone https://github.com/Smu-Tan/Remedy
cd Remedy
poetry installPythonβ₯ 3.12transformers<4.54.0vllm== 0.9.2torchβ₯ 2.6.0- (See
pyproject.tomlfor full dependencies)
Before using, you can download the model from HuggingFace:
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ShaomuTan/ReMedy-9B-22 --local-dir Models/ReMedy-9B-22You can replace ReMedy-9B-22 with other variants like ReMedy-9B-23.
remedy-score \
--model ShaomuTan/ReMedy-9B-22 \
--src_file ./testcase/en.src \
--mt_file ./testcase/en-de.hyp \
--ref_file ./testcase/de.ref \
--src_lang en --tgt_lang de \
--cache_dir Models \
--save_dir testcase \
--num_gpus 4 \
--calibrate \
--gpu_memory_utilization 0.9remedy-score \
--model ShaomuTan/ReMedy-9B-22 \
--src_file ./testcase/en.src \
--mt_file ./testcase/en-de.hyp \
--no_ref \
--src_lang en --tgt_lang de \
--cache_dir Models \
--save_dir ./testcase \
--num_gpus 4 \
--calibrate \
--gpu_memory_utilization 0.9src-tgt_raw_scores.txtsrc-tgt_sigmoid_scores.txtsrc-tgt_calibration_scores.txtsrc-tgt_detailed_results.tsvsrc-tgt_result.json
Inspired by SacreBLEU, ReMedy provides JSON-style results to ensure transparency and comparability.
π Example JSON Output
{
"metric_name": "remedy-9B-22",
"raw_score": 4.502863049214531,
"sigmoid_score": 0.9613502018042875,
"calibration_score": 0.9029647169507162,
"calibration_temp": 1.7999999999999998,
"signature": "metric_name:remedy-9B-22|lp:en-de|ref:yes|version:0.1.1",
"language_pair": "en-de",
"source_language": "en",
"target_language": "de",
"segments": 2037,
"version": "0.1.1",
"args": {
"src_file": "testcase/en.src",
"mt_file": "testcase/en-de.hyp",
"src_lang": "en",
"tgt_lang": "de",
"model": "Models/remedy-9B-22",
"cache_dir": "Models",
"save_dir": "testcase",
"ref_file": "testcase/de.ref",
"no_ref": false,
"calibrate": true,
"num_gpus": 4,
"num_seqs": 256,
"max_length": 4096,
"enable_truncate": false,
"version": false,
"list_languages": false
}
}π Show CLI Arguments
--src_file # Path to source file
--mt_file # Path to MT output file
--src_lang # Source language code
--tgt_lang # Target language code
--model # Model path or HuggingFace ID
--save_dir # Output directory--ref_file # Reference file path
--no_ref # Reference-free mode
--cache_dir # Cache directory
--calibrate # Enable calibration
--num_gpus # Number of GPUs
--num_seqs # Number of sequences (default: 256)
--max_length # Max token length (default: 4096)
--enable_truncate # Truncate sequences
--version # Print version
--list_languages # List supported languages| Model | Size | Base Model | Ref/QE | Download |
|---|---|---|---|---|
| ReMedy-2B | 2B | Gemma-2-2B | Both | π€ HuggingFace |
| ReMedy-9B-22 | 9B | Gemma-2-9B | Both | π€ HuggingFace |
| ReMedy-9B-23 | 9B | Gemma-2-9B | Both | π€ HuggingFace |
| ReMedy-9B-24 | 9B | Gemma-2-9B | Both | π€ HuggingFace |
We recommend using ReMedy-2B and ReMedy-9B-22, as we found that ReMedy-9B-24 sometimes tends to produce more compressed score distributions.
Click to show instructions for reproducing WMT22β24 evaluation
git clone https://github.com/Smu-Tan/Remedy
cd Remedy# Install MTME and download WMT data
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
cd ..
python3 -m mt_metrics_eval.mtme --downloadsbatch wmt/wmt22.sh
sbatch wmt/wmt23.sh
sbatch wmt/wmt24.shπ Results will be comparable with other metrics reported in WMT shared tasks.
If you use ReMedy, please cite the following paper:
@inproceedings{tan-monz-2025-remedy,
title = "{R}e{M}edy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling",
author = "Tan, Shaomu and
Monz, Christof",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.217/",
doi = "10.18653/v1/2025.emnlp-main.217",
pages = "4370--4387",
ISBN = "979-8-89176-332-6"
}