-
Notifications
You must be signed in to change notification settings - Fork 357
Open
Description
Thank you for your excellent work. I'm trying to reproduce KD baseline on gpt2.
I noticed that in the appendix of your paper, you mentioned that your training hyperparameters were obtained through a search.
I'd like to know if you trained the program for a smaller number of epochs and then determined the learning rate based on the validation results? Because complete training is very costly.
And could you provide the training hyperparameter settings used in the final paper results?
I tried training with the hyperparameters provided in your code on gpt2-base, only modifying the batch size, but I obtained results far below those in the paper:

Metadata
Metadata
Assignees
Labels
No labels