Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

@xhluca

Description

@xhluca

I found Contriever quite interesting based on the table 3 of the paper (few-shot retrieval) as Contriever-MSMarco achieves a score of 38.1 when finetuned on FiQA, which is much higher than the BERT-MSMarco which is at ~31. The difference is even bigger when comparing contriever and BERT (the checkpoints that were not first finetuned on msmarco), achieving a 10 pts improvements:

image

I’ve tried a similar set up (similar to DPR), with the differences being:

  1. Trained for 20 epochs instead of 500
  2. AdamW instead of ASAM
  3. Included BM25 hard negatives (i.e. top results that are not a gold label) in addition to in-batch negative sampling
  4. Batch size of 128 instead of 256 (though the number of negatives should be the same due to HNs)
  5. Instead of early stopping, I just trained for 20 epochs and save the checkpoint at the epoch with the best dev NDCG@10

It seems that under those settings, the improvements isn't as high as the difference reported in the paper:

split epoch metric model_name learning_rate k=1 k=3 k=5 k=10 k=100 k=1000
test 7 ndcg facebook/contriever-msmarco 1e-05 0.24383 0.25005 0.2608 0.28715 0.36118 0.39975
test 16 ndcg facebook/contriever 3e-05 0.25 0.23583 0.24952 0.2732 0.35149 0.39019
test 12 ndcg roberta-base 5e-05 0.25309 0.22701 0.24416 0.26293 0.33809 0.37927
test 16 ndcg bert-base-uncased 2e-05 0.21451 0.20465 0.21947 0.23826 0.31088 0.35118

Note the NDCG@10 of the contriever model is 3.49 higher than the bert-base-uncased (I tried learning rates between 1e-5 and 5e-5), which is small than the 10.3 pts improvements show in the screenshot (26.1 -> 36.4). I am not surprised that the results themselves are lower due to the differences in hyperperameters, but the delta in improvements surprises me. Is contriever harder to finetune when using the Adam optimizer? Or are we expected to use 256 batch sizes and/or avoid hard negatives from BM25?

Is it possible to either:

  1. provide the huggingface checkpoints of contriever and contriever-msmarco finetuned on fiqa, or
  2. share scripts that let me reproduce the process of finetuning contriever or contriever-msmarco on fiqa and save the checkpoint as huggingface model

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions