Skip to content

Master node not working when using distributed queue #72

@forxyum

Description

@forxyum

I'm running a 4 GPU cluster, utilizing this node package for load balancing and ease of use, so I don't need to manage 4 instances myself.
When using the Distributed Queue node, the master node never assigns a job to itself, even though it's NOT set to orchestrator-only.
The workaround I've found for the time being is running a 5th instance for cuda0, and setting the master to orchestrator-only, but since both of them are on cuda0, if the master fails to assign the job to a worker and falls back to local execution, I might get into a situation where both instances are trying to work on cuda0 and OOM out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions