Skip to content

OOM with 8 A800 80G #59

@Fayeww

Description

@Fayeww

Hi, you did a great job!

While running the training scripts, I got stuck with an OOM issue right after the step 1 :

[36m(main_task pid=6705)�[0m collected 1100 / 1024 rollouts and each prompt has 4 responses
�[36m(main_task pid=6705)�[0m rollout batch size: 1024
�[36m(main_task pid=6705)�[0m reward: 511.8 seconds
�[36m(main_task pid=6705)�[0m adv: 1.4 seconds]
Error ....ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

It’s a bit weird because the resources and parameters I'm using are the same as yours (7B model).
I noticed that 8 GPUs match with 8 PIDs, and when I ran top to monitor the memory usage, it was at 12.1% per PID before the crash.
I also tried the methods mentioned in the Ray docs, like controlling num_cpus to limit the number of concurrently running tasks. I set num_cpus=4, but that didn't work – the task is pending forever due to resource demands. It seems like there’s a strict pack of 1 GPU and 1 CPU.

Finally, I switched to 4 GPUs, but now I get a CUDA out-of-memory error, even though I changed the train_batch_size to 64.

I'm really confused about this. I hope the 8 GPU setup will work. Could you help me with that?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions