Skip to content

Distributed emulation stuck with >= 12 ranks (2+ nodes) #125

@PedrooHR

Description

@PedrooHR

I'm working on the integration of ACCL and OMPC. Currently now using ACCL distributed emulation approach to start testing offloading computation to Alveo boards in a distributed system using ACCL as the communication backend.

I've tried some scenarios:

  • 4 nodes: Every time I go over 3 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 3 nodes: Every time I go over 4 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 2 nodes: Every time I go over 6 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 1 node (not distributing): Tested up to 20 ACCL instances, this works with no problem

Any scenario with 10 (or fewer) instances in total do work fine (can't test with 11 instances due to some integration constraints)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions