Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3

I found Contriever quite interesting based on the table 3 of the paper (few-shot retrieval) as Contriever-MSMarco achieves a score of 38.1 when finetuned on FiQA, which is much higher than the BERT-MSMarco which is at ~31. The difference is even bigger when comparing contriever and BERT (the checkpoints that were not first finetuned on msmarco), achieving a 10 pts improvements:

![image](https://user-images.githubusercontent.com/21180505/219884035-ce252cc1-469d-4e16-81d6-abb8e74adfa1.png)


I’ve tried a similar set up (similar to DPR), with the differences being:
1. Trained for 20 epochs instead of 500
2. AdamW instead of ASAM
3. Included BM25 hard negatives (i.e. top results that are not a gold label) in addition to in-batch negative sampling
4. Batch size of 128 instead of 256 (though the number of negatives should be the same due to HNs)
5. Instead of early stopping, I just trained for 20 epochs and save the checkpoint at the epoch with the best dev NDCG@10

It seems that under those settings, the improvements isn't as high as the difference reported in the paper:
| split   |   epoch | metric   | model_name                  |   learning_rate |     k=1 |     k=3 |     k=5 |    k=10 |   k=100 |   k=1000 |
|:--------|--------:|:---------|:----------------------------|----------------:|--------:|--------:|--------:|--------:|--------:|---------:|
| test    |       7 | ndcg     | facebook/contriever-msmarco |           1e-05 | 0.24383 | 0.25005 | 0.2608  | 0.28715 | 0.36118 |  0.39975 |
| test    |      16 | ndcg     | facebook/contriever         |           3e-05 | 0.25    | 0.23583 | 0.24952 | 0.2732  | 0.35149 |  0.39019 |
| test    |      12 | ndcg     | roberta-base                |           5e-05 | 0.25309 | 0.22701 | 0.24416 | 0.26293 | 0.33809 |  0.37927 |
| test    |      16 | ndcg     | bert-base-uncased           |           2e-05 | 0.21451 | 0.20465 | 0.21947 | 0.23826 | 0.31088 |  0.35118 |

Note the NDCG@10 of the `contriever` model is 3.49 higher than the `bert-base-uncased` (I tried learning rates between 1e-5 and 5e-5), which is small than the 10.3 pts improvements show in the screenshot (26.1 -> 36.4). I am not surprised that the results themselves are lower due to the differences in hyperperameters, but the delta in improvements surprises me. Is contriever harder to finetune when using the Adam optimizer? Or are we expected to use 256 batch sizes and/or avoid hard negatives from BM25?


Is it possible to either:
1. provide the huggingface checkpoints of contriever and contriever-msmarco finetuned on fiqa, or
2. share scripts that let me reproduce the process of finetuning contriever or contriever-msmarco on fiqa and save the checkpoint as huggingface model

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

split	epoch	metric	model_name	learning_rate	k=1	k=3	k=5	k=10	k=100	k=1000
test	7	ndcg	facebook/contriever-msmarco	1e-05	0.24383	0.25005	0.2608	0.28715	0.36118	0.39975
test	16	ndcg	facebook/contriever	3e-05	0.25	0.23583	0.24952	0.2732	0.35149	0.39019
test	12	ndcg	roberta-base	5e-05	0.25309	0.22701	0.24416	0.26293	0.33809	0.37927
test	16	ndcg	bert-base-uncased	2e-05	0.21451	0.20465	0.21947	0.23826	0.31088	0.35118

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions