This project explores and benchmarks multiple retrieval techniques on the HotpotQA corpus. The goal was to optimize retrieval quality as measured by mean nDCG@10, focusing on improving beyond standard dense retrievers and cross-encoder rerankers. After iterative experimentation with dense retrieval, BM25, and reranking models, the best performance was achieved through a LambdaRank-based ranker trained on a custom feature set including token counts and scores from other reranking models.
- Dense Retriever: Vector-based retrieval using pretrained (bge-large-en-v1.5) embeddings .
- Dense Retriever + Cross Encoder: Dense top-50 retrieval followed by cross-ecnoder (bge-reranker-large) rereanking.
The final system achieved superior results through a two-staged pipeline:
- Retrieve top 50 candidate documents using a dense retriever.
- Use a LambdaRank model to optimize ranking based on multiple informative features:
- BM25 Score: lexical similarity between query and document.
- Cross Encoder Score: semantic similarity between query and document.
- LLM Score: contextual relevance estimated by a large language model (Mistral-7B-Instruct-v0.3.Q8_0).
- Document Length: number of tokens in the document.
- Query Length: number of tokens in the query.
- The model is trained on a total of 200000 query-document pairs - 10000 queries x 20 documents retrieved per query by dense retriever.
- The model reranks the top 50 candidates to produce the final top 10 ranking.
- The model was evaluated using the first 4000 queries from the HotpotQA validation set.
| Model | Mean nDCG@10 |
|---|---|
| Dense Retriever | 0.86235 |
| Dense Retriever + Cross Encoder | 0.93665 |
| Dense Retriever + LambdaRank Reranker | 0.94159 |
- Combining semantic, lexical, and structural features significantly improves retrieval quality.
- LambdaRank provides a flexible framework for leveraging diverse signals without retraining large encoder models.
- Perform an ablation study to investigate which features are truly signficant.
- Incorporate multi-hop retrieval signals to better handle reasoning chains.
- Experiment with pairwise LLM preference data for improved LambdaRank supervision.
- Extend the pipeline to end-to-end QA generation using retrieved contexts.