Skip to content

Difficulties to reproduce results on Robust 04 #22

@krasserm

Description

@krasserm

This is a follow-up on #21. I tried to reproduce the results on Robust 04 but failed to do so using the code in this repository. In the following I report my results on test fold f1 obtained in 3 experiments:

Experiment 1: Use provided CEDR-KNRM weights and .run files.

When evaluating the provided cedrknrm-robust-f1.run file in #18 with

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.run

I'm getting P@20 = 0.4470 and nDCG@20 = 0.5177. When using a .run file generated with the provided weights cedrknrm-robust-f1.p

python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
  --run data/robust/f1.test.run --model_weights cedrknrm-robust-f1.p --out_path cedrknrm-robust-f1.extra.run

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.extra.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.extra.run

I'm getting P@20 = 0.4290 and nDCG@20 = 0.5038. I'd expect these metrics to be equal to those of the provided cedrknrm-robust-f1.run file. What is the reason for this difference?

Experiment 2: Train my own BERT and CEDR-KNRM models.

This is were I'm getting results that are far below the expected results (only for CEDR-KNRM, not for Vanilla BERT). I started by training and evaluating a Vanilla BERT ranker:

python train.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run --model_out_dir trained_bert
python rerank.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_bert/weights.p --out_path trained_bert/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_bert/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_bert/test.run

I'm getting P@20 = 0.3690 and nDCG@20 = 0.4231 which is consistent with evaluating the provided vbert-robust-f1.run file:

bin/trec_eval -m P.20 data/robust/qrels vbert-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels vbert-robust-f1.run

This gives P@20 = 0.3550 and nDCG@20 = 0.4219 which comes quite close. I understand that here I simply ignored the inconsistencies reported in #21 but it is at least coarse cross-check of model performance on a single fold. When training a CEDR-KNRM model with this BERT model as initialization

python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights trained_bert/weights.p --model_out_dir trained_cedr
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr/weights.p --out_path trained_cedr/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr/test.run

I'm getting P@20 = 0.3790 and nDCG@20 = 0.4347. This is slightly better than a Vanilla BERT ranker but far below the performance obtained in Experiment 1. I also repeated Experiment 2 with f1.test.run, f1.valid.run and f1.train.pairs files that I generated myself from Anserini runs with a default BM25 configuration and still get results very close to those above.

Has anyone been able to get results similar to those as in Experiment 1 by training a BERT and CEDR-KNRM model as explained in the project's README?

Experiment 3: Use provided vbert-robust-f1.p weights as initialization to CEDR-KNRM training

I made this experiment in an attempt to debug the performance gap found in the previous experiment. I'm fully aware that training and evaluating a CEDR-KNRM model on fold 1 (i.e. f1) with the provided vbert-robust-f1.p is invalid because of the inconsistencies reported in #21. This is because the folds used for training/validating/testing vbert-robust-f1.p differ from those in data/robust/f[1-5]*.

In other words, validation and evaluation of the trained CEDR-KNRM model is done with queries that have been used for training the provided vbert-robust-f1.p. So this setup is using partially training data for evaluation which of course gives better evaluation results. I was surprised to see that with this invalid setup, I'm able to reproduce the numbers obtained in Experiment 1, or at least come very close. Here's what I did:

python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights vbert-robust-f1.p --model_out_dir trained_cedr_invalid
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr_invalid/weights.p --out_path trained_cedr_invalid/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr_invalid/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr_invalid/test.run

With this setup I'm getting a CEDR-KNRM performance of P@20 = 0.4400 and nDCG@20 = 0.5050. Given these results and the inconsistencies reported in #21, I wonder if the performance of the cedrknrm-robust-f[1-5].run checkpoints is the result of an invalid CEDR-KNRM training and evaluation setup or, more likely, if I did something wrong? Any hints appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions