Skip to content

Smu-Tan/Remedy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ReMedy Logo

πŸš€ReMedy: Machine Translation Evaluation via Reward Modeling

Learning High-Quality Machine Translation Evaluation from Human Preferences with Reward Modeling

arXiv PyPI version GitHub Stars License


✨ About ReMedy

ReMedy is a new state-of-the-art machine translation (MT) evaluation framework that reframes the task as reward modeling rather than direct regression. Instead of relying on noisy human scores, ReMedy learns from pairwise human preferences, leading to better alignment with human judgments.

  • πŸ“ˆ State-of-the-art accuracy on WMT22–24 (39 language pairs, 111 systems)
  • βš–οΈ Segment- and system-level evaluation, outperforming GPT-4, PaLM-540B, Finetuned-PaLM2, MetricX-13B, and XCOMET
  • πŸ” More robust on low-quality and out-of-domain translations (ACES, MSLC benchmarks)
  • 🧠 Can be used as a reward model in RLHF pipelines to improve MT systems

ReMedy demonstrates that reward modeling with pairwise preferences offers a more reliable and human-aligned approach for MT evaluation.


πŸ“š Contents


πŸ“¦ Quick Installation

ReMedy requires Python β‰₯ 3.12, and leverages VLLM for fast inference.

βœ… Recommended: Install via pip

pip install --upgrade pip
pip install remedy-mt-eval

πŸ› οΈ Install from Source

git clone https://github.com/Smu-Tan/Remedy
cd Remedy
pip install -e .

πŸ“œ Install via Poetry

git clone https://github.com/Smu-Tan/Remedy
cd Remedy
poetry install

βš™οΈ Requirements

  • Python β‰₯ 3.12
  • transformers <4.54.0
  • vllm == 0.9.2
  • torch β‰₯ 2.6.0
  • (See pyproject.toml for full dependencies)

πŸš€ Usage

πŸ’Ύ Download ReMedy Models

Before using, you can download the model from HuggingFace:

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ShaomuTan/ReMedy-9B-22 --local-dir Models/ReMedy-9B-22

You can replace ReMedy-9B-22 with other variants like ReMedy-9B-23.


πŸ”Ή Basic Usage

remedy-score \
    --model ShaomuTan/ReMedy-9B-22 \
    --src_file ./testcase/en.src \
    --mt_file ./testcase/en-de.hyp \
    --ref_file ./testcase/de.ref \
    --src_lang en --tgt_lang de \
    --cache_dir Models \
    --save_dir testcase \
    --num_gpus 4 \
    --calibrate \
    --gpu_memory_utilization 0.9

πŸ”Ή Reference-Free Mode (Quality Estimation)

remedy-score \
    --model ShaomuTan/ReMedy-9B-22 \
    --src_file ./testcase/en.src \
    --mt_file ./testcase/en-de.hyp \
    --no_ref \
    --src_lang en --tgt_lang de \
    --cache_dir Models \
    --save_dir ./testcase \
    --num_gpus 4 \
    --calibrate \
    --gpu_memory_utilization 0.9

πŸ“„ Output Files

  • src-tgt_raw_scores.txt
  • src-tgt_sigmoid_scores.txt
  • src-tgt_calibration_scores.txt
  • src-tgt_detailed_results.tsv
  • src-tgt_result.json

Inspired by SacreBLEU, ReMedy provides JSON-style results to ensure transparency and comparability.

πŸ“˜ Example JSON Output
{
  "metric_name": "remedy-9B-22",
  "raw_score": 4.502863049214531,
  "sigmoid_score": 0.9613502018042875,
  "calibration_score": 0.9029647169507162,
  "calibration_temp": 1.7999999999999998,
  "signature": "metric_name:remedy-9B-22|lp:en-de|ref:yes|version:0.1.1",
  "language_pair": "en-de",
  "source_language": "en",
  "target_language": "de",
  "segments": 2037,
  "version": "0.1.1",
  "args": {
    "src_file": "testcase/en.src",
    "mt_file": "testcase/en-de.hyp",
    "src_lang": "en",
    "tgt_lang": "de",
    "model": "Models/remedy-9B-22",
    "cache_dir": "Models",
    "save_dir": "testcase",
    "ref_file": "testcase/de.ref",
    "no_ref": false,
    "calibrate": true,
    "num_gpus": 4,
    "num_seqs": 256,
    "max_length": 4096,
    "enable_truncate": false,
    "version": false,
    "list_languages": false
  }
}

βš™οΈ Full Argument List

πŸ“‹ Show CLI Arguments

πŸ”Έ Required

--src_file           # Path to source file
--mt_file            # Path to MT output file
--src_lang           # Source language code
--tgt_lang           # Target language code
--model              # Model path or HuggingFace ID
--save_dir           # Output directory

πŸ”Έ Optional

--ref_file           # Reference file path
--no_ref             # Reference-free mode
--cache_dir          # Cache directory
--calibrate          # Enable calibration
--num_gpus           # Number of GPUs
--num_seqs           # Number of sequences (default: 256)
--max_length         # Max token length (default: 4096)
--enable_truncate    # Truncate sequences
--version            # Print version
--list_languages     # List supported languages

🧠 Model Variants

Model Size Base Model Ref/QE Download
ReMedy-2B 2B Gemma-2-2B Both πŸ€— HuggingFace
ReMedy-9B-22 9B Gemma-2-9B Both πŸ€— HuggingFace
ReMedy-9B-23 9B Gemma-2-9B Both πŸ€— HuggingFace
ReMedy-9B-24 9B Gemma-2-9B Both πŸ€— HuggingFace

We recommend using ReMedy-2B and ReMedy-9B-22, as we found that ReMedy-9B-24 sometimes tends to produce more compressed score distributions.


πŸ” Reproducing WMT Results

Click to show instructions for reproducing WMT22–24 evaluation

1. Clone ReMedy repo

git clone https://github.com/Smu-Tan/Remedy
cd Remedy

2. Install mt-metrics-eval

# Install MTME and download WMT data
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
cd ..
python3 -m mt_metrics_eval.mtme --download

3. Run ReMedy on WMT data

sbatch wmt/wmt22.sh
sbatch wmt/wmt23.sh
sbatch wmt/wmt24.sh

πŸ“„ Results will be comparable with other metrics reported in WMT shared tasks.


πŸ“š Citation

If you use ReMedy, please cite the following paper:

@inproceedings{tan-monz-2025-remedy,
    title = "{R}e{M}edy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling",
    author = "Tan, Shaomu  and
      Monz, Christof",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.217/",
    doi = "10.18653/v1/2025.emnlp-main.217",
    pages = "4370--4387",
    ISBN = "979-8-89176-332-6"
}

About

[EMNLP2025] Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors