-
Notifications
You must be signed in to change notification settings - Fork 101
Description
Hi, you did a great job!
While running the training scripts, I got stuck with an OOM issue right after the step 1 :
[36m(main_task pid=6705)�[0m collected 1100 / 1024 rollouts and each prompt has 4 responses
�[36m(main_task pid=6705)�[0m rollout batch size: 1024
�[36m(main_task pid=6705)�[0m reward: 511.8 seconds
�[36m(main_task pid=6705)�[0m adv: 1.4 seconds]
Error ....ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
It’s a bit weird because the resources and parameters I'm using are the same as yours (7B model).
I noticed that 8 GPUs match with 8 PIDs, and when I ran top to monitor the memory usage, it was at 12.1% per PID before the crash.
I also tried the methods mentioned in the Ray docs, like controlling num_cpus to limit the number of concurrently running tasks. I set num_cpus=4, but that didn't work – the task is pending forever due to resource demands. It seems like there’s a strict pack of 1 GPU and 1 CPU.
Finally, I switched to 4 GPUs, but now I get a CUDA out-of-memory error, even though I changed the train_batch_size to 64.
I'm really confused about this. I hope the 8 GPU setup will work. Could you help me with that?