Hello, thank you for your excellent work.
I would like to reproduce your work on a new model. The download link for OpenWebText you provided is no longer valid.
Although you provided processed data, the repository only provides GPT2 tokenized data, which cannot be used to train other family models.
Could you please provide the original training corpus?