SemRank

The source code used for paper Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking, published in EMNLP 2025.

Overview

SemRank is an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy.

Please refer to our paper for more details (paper).

Datasets

We use CSFCube, DORISMAE, and LitSearch in our experiments. We use the processed version of CSFCube and DORISMAE available here and LitSearch from HuggingFace.

Build Index

Run the following commands to build the semantic index.

# Predict candidate topic labels (GPU needed)
python eval_classifier.py

# Get LLM-assigned topic labels (OpenAI key needed)
python llm-topic.py

# Encode corpus + semantic labels (GPU needed)
python encoding.py

Our code by default load and process LitSearch with gpt-4.1-mini and specter2. Please check the detailed arguments for changing to different encoders or LLMs and how to load local corpus at eval_classifier.py.

We provide the trained topic classifier checkpoint on the CSRanking domain using MAPLE. The checkpoint can be downloaded here and please put it in the ./classifier folder which also includes the complete label space.

If you want to use semantic indexing in domains other than Computer Science, we recommend you to look at other available corpora from MAPLE and check the text classifier training code by TELEClass which also supports training a hierarchical text classifier without labeled data.

Run SemRank Retrieval

Please check SemRank.ipynb which includes step-by-step running of SemRank

Citations

If you find our work useful for your research, please cite the following paper:

@inproceedings{zhang2025semrank,
    title={Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking},
    author={Yunyi Zhang and Ruozhen Yang and Siqi Jiao and SeongKu Kang and Jiawei Han},
    booktitle={Findings of EMNLP},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LitSearch		LitSearch
api/openai		api/openai
classifier		classifier
.gitignore		.gitignore
README.md		README.md
SemRank.ipynb		SemRank.ipynb
classifier_utils.py		classifier_utils.py
encoding.py		encoding.py
eval_classifier.py		eval_classifier.py
llm-topic.py		llm-topic.py
semrank-example.png		semrank-example.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemRank

Overview

Datasets

Build Index

Run SemRank Retrieval

Citations

About

Uh oh!

Releases

Packages

Languages

yzhan238/SemRank

Folders and files

Latest commit

History

Repository files navigation

SemRank

Overview

Datasets

Build Index

Run SemRank Retrieval

Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages