The source code used for paper Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking, published in EMNLP 2025.
SemRank is an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy.
Please refer to our paper for more details (paper).
We use CSFCube, DORISMAE, and LitSearch in our experiments. We use the processed version of CSFCube and DORISMAE available here and LitSearch from HuggingFace.
Run the following commands to build the semantic index.
# Predict candidate topic labels (GPU needed)
python eval_classifier.py
# Get LLM-assigned topic labels (OpenAI key needed)
python llm-topic.py
# Encode corpus + semantic labels (GPU needed)
python encoding.py
Our code by default load and process LitSearch with gpt-4.1-mini and specter2. Please check the detailed arguments for changing to different encoders or LLMs and how to load local corpus at eval_classifier.py.
We provide the trained topic classifier checkpoint on the CSRanking domain using MAPLE. The checkpoint can be downloaded here and please put it in the ./classifier folder which also includes the complete label space.
If you want to use semantic indexing in domains other than Computer Science, we recommend you to look at other available corpora from MAPLE and check the text classifier training code by TELEClass which also supports training a hierarchical text classifier without labeled data.
Please check SemRank.ipynb which includes step-by-step running of SemRank
If you find our work useful for your research, please cite the following paper:
@inproceedings{zhang2025semrank,
title={Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking},
author={Yunyi Zhang and Ruozhen Yang and Siqi Jiao and SeongKu Kang and Jiawei Han},
booktitle={Findings of EMNLP},
year={2025}
}
