Hi , I found that in the link you provide (https://huggingface.co/datasets/MiniLLM/roberta-corpus-processed) there is a huge size gap between 'llama/512/20M/train_0.bin'(200MB) and 'opt/512/20M/train_0.idx'(20.5GB) . And I do suffer a Traceback that says :
ValueError: offset must be non-negative and no greater than buffer length (200060928
when loading the LM dataset . Is there some errors in your HF dataset link and how to fix it?