-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Hi @AkariAsai and team,
First, thank you for your incredible work on this project!
I've encountered a small technical question during the final inference step. Here is the process:
Steps to Reproduce:
Successfully generate the input data for Self-CRAG (pubqa_selfcrag_cleaned.json) using the CRAG project's scripts.
Run retrieval_lm/run_short_form.py with this data.
The script runs successfully for most of the data but eventually crashes.
Observation:
The crash is always preceded by warnings from vllm, such as:
WARNING ... Input prompt (5238 tokens) is too long and exceeds limit of 4096
The final error is an IndexError: list index out of range inside the call_model_rerank_w_scores_batch function, which my analysis suggests is because vllm returns an empty logprobs list for these oversized inputs, and the code attempts to access it without a safety check.
I was able to successfully work around this by adding a patch to my local code to skip these instances.
This leads to my main question about the intended methodology:
For the original experiments, what was the standard procedure for handling these few samples that exceed the model's context length?
I was considering a few possibilities:
a) Pre-filtering: Were these oversized samples identified and removed from the test set before the final inference run?
b) Truncation: Were the prompts simply truncated to the 4096 token limit on-the-fly?
c) Or another method to handle this gracefully at runtime?
Knowing the official approach would be very helpful for ensuring my reproduction is as accurate as possible.
Thank you again for your time and for this fantastic research!