Skip to content

The source code used for paper "Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking", published in Findings of EMNLP 2025.

Notifications You must be signed in to change notification settings

yzhan238/SemRank

Repository files navigation

SemRank

The source code used for paper Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking, published in EMNLP 2025.

Overview

SemRank is an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy.

Please refer to our paper for more details (paper).

Datasets

We use CSFCube, DORISMAE, and LitSearch in our experiments. We use the processed version of CSFCube and DORISMAE available here and LitSearch from HuggingFace.

Build Index

Run the following commands to build the semantic index.

# Predict candidate topic labels (GPU needed)
python eval_classifier.py

# Get LLM-assigned topic labels (OpenAI key needed)
python llm-topic.py

# Encode corpus + semantic labels (GPU needed)
python encoding.py

Our code by default load and process LitSearch with gpt-4.1-mini and specter2. Please check the detailed arguments for changing to different encoders or LLMs and how to load local corpus at eval_classifier.py.

We provide the trained topic classifier checkpoint on the CSRanking domain using MAPLE. The checkpoint can be downloaded here and please put it in the ./classifier folder which also includes the complete label space.

If you want to use semantic indexing in domains other than Computer Science, we recommend you to look at other available corpora from MAPLE and check the text classifier training code by TELEClass which also supports training a hierarchical text classifier without labeled data.

Run SemRank Retrieval

Please check SemRank.ipynb which includes step-by-step running of SemRank

Citations

If you find our work useful for your research, please cite the following paper:

@inproceedings{zhang2025semrank,
    title={Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking},
    author={Yunyi Zhang and Ruozhen Yang and Siqi Jiao and SeongKu Kang and Jiawei Han},
    booktitle={Findings of EMNLP},
    year={2025}
}

About

The source code used for paper "Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking", published in Findings of EMNLP 2025.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published