-
Notifications
You must be signed in to change notification settings - Fork 27
Description
This is a follow-up on #21. I tried to reproduce the results on Robust 04 but failed to do so using the code in this repository. In the following I report my results on test fold f1 obtained in 3 experiments:
Experiment 1: Use provided CEDR-KNRM weights and .run files.
When evaluating the provided cedrknrm-robust-f1.run file in #18 with
bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.runI'm getting P@20 = 0.4470 and nDCG@20 = 0.5177. When using a .run file generated with the provided weights cedrknrm-robust-f1.p
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--run data/robust/f1.test.run --model_weights cedrknrm-robust-f1.p --out_path cedrknrm-robust-f1.extra.run
bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.extra.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.extra.runI'm getting P@20 = 0.4290 and nDCG@20 = 0.5038. I'd expect these metrics to be equal to those of the provided cedrknrm-robust-f1.run file. What is the reason for this difference?
Experiment 2: Train my own BERT and CEDR-KNRM models.
This is were I'm getting results that are far below the expected results (only for CEDR-KNRM, not for Vanilla BERT). I started by training and evaluating a Vanilla BERT ranker:
python train.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run --model_out_dir trained_bert
python rerank.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--run data/robust/f1.test.run --model_weights trained_bert/weights.p --out_path trained_bert/test.run
bin/trec_eval -m P.20 data/robust/qrels trained_bert/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_bert/test.runI'm getting P@20 = 0.3690 and nDCG@20 = 0.4231 which is consistent with evaluating the provided vbert-robust-f1.run file:
bin/trec_eval -m P.20 data/robust/qrels vbert-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels vbert-robust-f1.runThis gives P@20 = 0.3550 and nDCG@20 = 0.4219 which comes quite close. I understand that here I simply ignored the inconsistencies reported in #21 but it is at least coarse cross-check of model performance on a single fold. When training a CEDR-KNRM model with this BERT model as initialization
python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
--initial_bert_weights trained_bert/weights.p --model_out_dir trained_cedr
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--run data/robust/f1.test.run --model_weights trained_cedr/weights.p --out_path trained_cedr/test.run
bin/trec_eval -m P.20 data/robust/qrels trained_cedr/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr/test.runI'm getting P@20 = 0.3790 and nDCG@20 = 0.4347. This is slightly better than a Vanilla BERT ranker but far below the performance obtained in Experiment 1. I also repeated Experiment 2 with f1.test.run, f1.valid.run and f1.train.pairs files that I generated myself from Anserini runs with a default BM25 configuration and still get results very close to those above.
Has anyone been able to get results similar to those as in Experiment 1 by training a BERT and CEDR-KNRM model as explained in the project's README?
Experiment 3: Use provided vbert-robust-f1.p weights as initialization to CEDR-KNRM training
I made this experiment in an attempt to debug the performance gap found in the previous experiment. I'm fully aware that training and evaluating a CEDR-KNRM model on fold 1 (i.e. f1) with the provided vbert-robust-f1.p is invalid because of the inconsistencies reported in #21. This is because the folds used for training/validating/testing vbert-robust-f1.p differ from those in data/robust/f[1-5]*.
In other words, validation and evaluation of the trained CEDR-KNRM model is done with queries that have been used for training the provided vbert-robust-f1.p. So this setup is using partially training data for evaluation which of course gives better evaluation results. I was surprised to see that with this invalid setup, I'm able to reproduce the numbers obtained in Experiment 1, or at least come very close. Here's what I did:
python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
--initial_bert_weights vbert-robust-f1.p --model_out_dir trained_cedr_invalid
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
--run data/robust/f1.test.run --model_weights trained_cedr_invalid/weights.p --out_path trained_cedr_invalid/test.run
bin/trec_eval -m P.20 data/robust/qrels trained_cedr_invalid/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr_invalid/test.runWith this setup I'm getting a CEDR-KNRM performance of P@20 = 0.4400 and nDCG@20 = 0.5050. Given these results and the inconsistencies reported in #21, I wonder if the performance of the cedrknrm-robust-f[1-5].run checkpoints is the result of an invalid CEDR-KNRM training and evaluation setup or, more likely, if I did something wrong? Any hints appreciated!