Skip to content

Question about synthesizing Allreduce #36

@JASUEXIII

Description

@JASUEXIII

Hi. Thanks for previous prompt response. I'm currently tring to synthesize the Allreduce for a custom topology(let's say a ring with 4 or 8 nodes as an example). Some strange problems occurs when doing so. I wonder if you can help.

My Codes:

topology = Ring(num_Nodes=4)
from msccl.collectives import allgather,allreduce,reduce_scatter,reduce,alltoall
collective_allgather = allgather(topology.num_nodes())
collective_reduce_scatter = reduce_scatter(topology.num_nodes())
save_msccl_object(topology,'SG2260_topo_Ring4.json')
save_msccl_object(collective_allgather,'coll_allgather.json')
save_msccl_object(collective_reduce_scatter,'coll_reducescatter.json')
assert 0 == os.system('msccl solve pareto-optimal custom custom --topology-file SG2260_topo_Ring4.json --collective-file coll_allgather.json')
assert 0 == os.system('msccl solve pareto-optimal custom custom --topology-file SG2260_topo_Ring4.json --collective-file coll_reducescatter.json')
assert 0 == os.system('msccl compose allreduce ReduceScatter.n4-MYTP-steps2.rounds3.chunks2.msccl.json Allgather.n4-MYTP-steps2.rounds3.chunks2.msccl.json -o allreduce_ring4.json')

I stored the collective also into json file for better debug. The logged allreduce json has strange input and output map as follows:

"input_map": { "0": [0, 1], "1": [0, 1], "2": [0, 1], "3": [0, 1] },
  "output_map": { "0": [0, 1], "1": [0, 1], "2": [0, 1], "3": [0, 1] },
  "steps": [
    {
      "msccl_type": "step",
      "rounds": 1,
      "sends": [
        [0, 2, 1],
        [1, 2, 3],
        [2, 3, 0],
        [3, 3, 2],
        [4, 0, 3],
        [5, 0, 1],
        [6, 1, 0],
        [7, 1, 2]
      ]
    },
    {
      "msccl_type": "step",
      "rounds": 2,
      "sends": [
        [0, 1, 0],
        [0, 3, 0],
        [1, 1, 0],
        [1, 3, 0],
        [2, 0, 1],
        [2, 2, 1],
        [3, 0, 1],
        [3, 2, 1],
        [4, 1, 2],
        [4, 3, 2],
        [5, 1, 2],
        [5, 3, 2],
        [6, 0, 3],
        [6, 2, 3],
        [7, 0, 3],
        [7, 2, 3]
      ]
    },
    {
      "msccl_type": "step",
      "rounds": 2,
      "sends": [
        [0, 0, 1],
        [0, 0, 3],
        [1, 0, 1],
        [1, 0, 3],
        [2, 1, 0],
        [2, 1, 2],
        [3, 1, 0],
        [3, 1, 2],
        [4, 2, 1],
        [4, 2, 3],
        [5, 2, 1],
        [5, 2, 3],
        [6, 3, 0],
        [6, 3, 2],
        [7, 3, 0],
        [7, 3, 2]
      ]
    },
    {
      "msccl_type": "step",
      "rounds": 1,
      "sends": [
        [0, 1, 2],
        [1, 3, 2],
        [2, 0, 3],
        [3, 2, 3],
        [4, 3, 0],
        [5, 1, 0],
        [6, 0, 1],
        [7, 2, 1]
      ]
    }
  ],
  "collective": {
    "msccl_type": "collective",
    "name": "Allreduce(n=4)",
    "nodes": 4,
    "chunks": [
      { "msccl_type": "chunk", "pre": [0], "post": [0, 1, 2, 3], "addr": 0 },
      { "msccl_type": "chunk", "pre": [1], "post": [0, 1, 2, 3], "addr": 0 },
      { "msccl_type": "chunk", "pre": [2], "post": [0, 1, 2, 3], "addr": 0 },
      { "msccl_type": "chunk", "pre": [3], "post": [0, 1, 2, 3], "addr": 0 }

I want to know how to understand this output. The chunck id seems to not match with each other. And the input/output map is not a proper solution for allreduce.
I'll be really appreciated and happy to offer other trail logs if anyone can help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions