[MiniLLM] baseline hyperparameter settings

Thank you for your excellent work. I'm trying to reproduce KD baseline on gpt2.
I noticed that in the appendix of your paper, you mentioned that your training hyperparameters were obtained through a search.
I'd like to know if you trained the program for a smaller number of epochs and then determined the learning rate based on the validation results? Because complete training is very costly.
And could you provide the training hyperparameter settings used in the final paper results?
I tried training with the hyperparameters provided in your [code](https://github.com/microsoft/LMOps/blob/main/minillm/scripts/gpt2/kd/kd_base.sh) on gpt2-base, only modifying the batch size, but I obtained results far below those in the paper:

<img width="1388" height="146" alt="Image" src="https://github.com/user-attachments/assets/ec8c25e6-39f8-4e00-a651-b141b33e5047" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MiniLLM] baseline hyperparameter settings #333

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MiniLLM] baseline hyperparameter settings #333

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions