Skip to content

Oversized inputs in Self-RAG #35

@Hongwuwu

Description

@Hongwuwu

Hi @AkariAsai and team,

First, thank you for your incredible work on this project!

I've encountered a small technical question during the final inference step. Here is the process:

Steps to Reproduce:

Successfully generate the input data for Self-CRAG (pubqa_selfcrag_cleaned.json) using the CRAG project's scripts.

Run retrieval_lm/run_short_form.py with this data.

The script runs successfully for most of the data but eventually crashes.

Observation:

The crash is always preceded by warnings from vllm, such as:
WARNING ... Input prompt (5238 tokens) is too long and exceeds limit of 4096

The final error is an IndexError: list index out of range inside the call_model_rerank_w_scores_batch function, which my analysis suggests is because vllm returns an empty logprobs list for these oversized inputs, and the code attempts to access it without a safety check.

I was able to successfully work around this by adding a patch to my local code to skip these instances.

This leads to my main question about the intended methodology:

For the original experiments, what was the standard procedure for handling these few samples that exceed the model's context length?

I was considering a few possibilities:

a) Pre-filtering: Were these oversized samples identified and removed from the test set before the final inference run?
b) Truncation: Were the prompts simply truncated to the 4096 token limit on-the-fly?
c) Or another method to handle this gracefully at runtime?

Knowing the official approach would be very helpful for ensuring my reproduction is as accurate as possible.

Thank you again for your time and for this fantastic research!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions