Difficulties to reproduce results on Robust 04

This is a follow-up on #21. I tried to reproduce the results on Robust 04 but failed to do so using the code in this repository. In the following I report my results on test fold `f1` obtained in 3 experiments:  

## Experiment 1: Use provided CEDR-KNRM weights and `.run` files.

When evaluating the provided `cedrknrm-robust-f1.run` file in #18 with

```shell script
bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.run
```

I'm getting P@20 = 0.4470 and nDCG@20 = 0.5177. When using a `.run` file generated with the provided weights `cedrknrm-robust-f1.p`

```shell script
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
  --run data/robust/f1.test.run --model_weights cedrknrm-robust-f1.p --out_path cedrknrm-robust-f1.extra.run

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.extra.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.extra.run
```

I'm getting P@20 = 0.4290 and nDCG@20 = 0.5038. I'd expect these metrics to be equal to those of the provided `cedrknrm-robust-f1.run` file. What is the reason for this difference?

## Experiment 2: Train my own BERT and CEDR-KNRM models.

This is were I'm getting results that are far below the expected results (only for CEDR-KNRM, not for Vanilla BERT). I started by training and evaluating a Vanilla BERT ranker: 


```shell script
python train.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run --model_out_dir trained_bert
python rerank.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_bert/weights.p --out_path trained_bert/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_bert/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_bert/test.run
```

I'm getting P@20 = 0.3690 and nDCG@20 = 0.4231 which is consistent with evaluating the provided `vbert-robust-f1.run` file:

```shell script
bin/trec_eval -m P.20 data/robust/qrels vbert-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels vbert-robust-f1.run
```

This gives P@20 = 0.3550 and nDCG@20 = 0.4219 which comes quite close. I understand that here I simply ignored the inconsistencies reported in #21 but it is at least coarse cross-check of model performance on a single fold. When training a CEDR-KNRM model with this BERT model as initialization

```shell script
python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights trained_bert/weights.p --model_out_dir trained_cedr
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr/weights.p --out_path trained_cedr/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr/test.run
```

I'm getting P@20 = 0.3790 and nDCG@20 = 0.4347. This is slightly better than a Vanilla BERT ranker but far below the performance obtained in Experiment 1. I also repeated Experiment 2 with `f1.test.run`, `f1.valid.run` and `f1.train.pairs` files that I generated myself from Anserini runs with a default BM25 configuration and still get results very close to those above.

Has anyone been able to get results similar to those as in Experiment 1 by training a BERT and CEDR-KNRM model as explained in the project's README?  

## Experiment 3: Use provided `vbert-robust-f1.p` weights as initialization to CEDR-KNRM training

I made this experiment in an attempt to debug the performance gap found in the previous experiment. I'm fully aware that training and evaluating a CEDR-KNRM model on fold 1 (i.e. `f1`) with the provided `vbert-robust-f1.p` is invalid  because of the inconsistencies reported in #21. This is because the folds used for training/validating/testing `vbert-robust-f1.p` differ from those in `data/robust/f[1-5]*`.

In other words, validation and evaluation of the trained CEDR-KNRM model is done with queries that have been used for training the provided `vbert-robust-f1.p`. So this setup is using partially training data for evaluation which of course gives better evaluation results. I was surprised to see that with this invalid setup, I'm able to reproduce the numbers obtained in Experiment 1, or at least come very close. Here's what I did:

```shell script
python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights vbert-robust-f1.p --model_out_dir trained_cedr_invalid
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr_invalid/weights.p --out_path trained_cedr_invalid/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr_invalid/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr_invalid/test.run
```

With this setup I'm getting a CEDR-KNRM performance of P@20 = 0.4400 and nDCG@20 = 0.5050. Given these results and the inconsistencies reported in #21, I wonder if the performance of the `cedrknrm-robust-f[1-5].run` checkpoints is the result of an invalid CEDR-KNRM training and evaluation setup or, more likely, if I did something wrong? Any hints appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulties to reproduce results on Robust 04 #22

Experiment 1: Use provided CEDR-KNRM weights and `.run` files.

Experiment 2: Train my own BERT and CEDR-KNRM models.

Experiment 3: Use provided `vbert-robust-f1.p` weights as initialization to CEDR-KNRM training

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Difficulties to reproduce results on Robust 04 #22

Description

Experiment 1: Use provided CEDR-KNRM weights and .run files.

Experiment 2: Train my own BERT and CEDR-KNRM models.

Experiment 3: Use provided vbert-robust-f1.p weights as initialization to CEDR-KNRM training

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Experiment 1: Use provided CEDR-KNRM weights and `.run` files.

Experiment 3: Use provided `vbert-robust-f1.p` weights as initialization to CEDR-KNRM training