Skip to content

Starting worker group stuck #16

@l9761116

Description

@l9761116

In config yaml files i see that num_worker are all set to 1. But during my testing, it tries to reach 4 workers and got a stuck as there's only one gpu on my server. Does anyone know where's the problem?
Feedback:

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_eslpage1/none_0vzsx60e/attempt_0/3/error.json

Stuck here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions