The questions about the training process

Thank you for your source code and dataset.

I want to ask about the training budget you use in the script stage_one.sh, the training batch size is 128 with 32 samples on each, and your dataset([sparkle-reasoning/dsr40k](https://huggingface.co/datasets/sparkle-reasoning/dsr40k)) includes around 40k prompts. Does this mean you have trained the model with 40k * 30 = 1200k prompts for total_epochs=30? Since I am using the same dataset to train the same Qwen model, it seems the performance doesn't change that much after the 400 training steps with a batch size of 256. I was wondering if it is because the amount of training is not enough, should I continue to train in order to improve the in-domain performance?  I have attached my training dynamics for the reward:

<img width="450" height="317" alt="Image" src="https://github.com/user-attachments/assets/62d0ff1b-bae6-4cfa-b733-fef3bde46ab8" />



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The questions about the training process #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The questions about the training process #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions