Skip to content

Exploration of retrieval methods on the HotpotQA corpus, combining dense retrieval and feature-based reranking. Achieved a mean nDCG@10 of 0.9416 using LambdaRank with features such as cross-encoder score, LLM score, BM25 score, and token-based statistics—surpassing dense retriever + cross-encoder baselines.

Notifications You must be signed in to change notification settings

ycz425/qa_retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QA Retrieval

This project explores and benchmarks multiple retrieval techniques on the HotpotQA corpus. The goal was to optimize retrieval quality as measured by mean nDCG@10, focusing on improving beyond standard dense retrievers and cross-encoder rerankers. After iterative experimentation with dense retrieval, BM25, and reranking models, the best performance was achieved through a LambdaRank-based ranker trained on a custom feature set including token counts and scores from other reranking models.

Methodology

1. Baselines

  • Dense Retriever: Vector-based retrieval using pretrained (bge-large-en-v1.5) embeddings .
  • Dense Retriever + Cross Encoder: Dense top-50 retrieval followed by cross-ecnoder (bge-reranker-large) rereanking.

2. Proposed System

The final system achieved superior results through a two-staged pipeline:

Stage 1: Dense Retrieval

  • Retrieve top 50 candidate documents using a dense retriever.

Stage 2: LambdaRank Reranking

  • Use a LambdaRank model to optimize ranking based on multiple informative features:
    • BM25 Score: lexical similarity between query and document.
    • Cross Encoder Score: semantic similarity between query and document.
    • LLM Score: contextual relevance estimated by a large language model (Mistral-7B-Instruct-v0.3.Q8_0).
    • Document Length: number of tokens in the document.
    • Query Length: number of tokens in the query.
  • The model is trained on a total of 200000 query-document pairs - 10000 queries x 20 documents retrieved per query by dense retriever.
  • The model reranks the top 50 candidates to produce the final top 10 ranking.

Results

  • The model was evaluated using the first 4000 queries from the HotpotQA validation set.
Model Mean nDCG@10
Dense Retriever 0.86235
Dense Retriever + Cross Encoder 0.93665
Dense Retriever + LambdaRank Reranker 0.94159

Key Insights

  • Combining semantic, lexical, and structural features significantly improves retrieval quality.
  • LambdaRank provides a flexible framework for leveraging diverse signals without retraining large encoder models.

Future Work

  • Perform an ablation study to investigate which features are truly signficant.
  • Incorporate multi-hop retrieval signals to better handle reasoning chains.
  • Experiment with pairwise LLM preference data for improved LambdaRank supervision.
  • Extend the pipeline to end-to-end QA generation using retrieved contexts.

About

Exploration of retrieval methods on the HotpotQA corpus, combining dense retrieval and feature-based reranking. Achieved a mean nDCG@10 of 0.9416 using LambdaRank with features such as cross-encoder score, LLM score, BM25 score, and token-based statistics—surpassing dense retriever + cross-encoder baselines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages