-
Notifications
You must be signed in to change notification settings - Fork 28
Description
In config yaml files i see that num_worker are all set to 1. But during my testing, it tries to reach 4 workers and got a stuck as there's only one gpu on my server. Does anyone know where's the problem?
Feedback:
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/3/error.json
Stuck here.