-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Hi. Thanks for previous prompt response. I'm currently tring to synthesize the Allreduce for a custom topology(let's say a ring with 4 or 8 nodes as an example). Some strange problems occurs when doing so. I wonder if you can help.
My Codes:
topology = Ring(num_Nodes=4)
from msccl.collectives import allgather,allreduce,reduce_scatter,reduce,alltoall
collective_allgather = allgather(topology.num_nodes())
collective_reduce_scatter = reduce_scatter(topology.num_nodes())
save_msccl_object(topology,'SG2260_topo_Ring4.json')
save_msccl_object(collective_allgather,'coll_allgather.json')
save_msccl_object(collective_reduce_scatter,'coll_reducescatter.json')
assert 0 == os.system('msccl solve pareto-optimal custom custom --topology-file SG2260_topo_Ring4.json --collective-file coll_allgather.json')
assert 0 == os.system('msccl solve pareto-optimal custom custom --topology-file SG2260_topo_Ring4.json --collective-file coll_reducescatter.json')
assert 0 == os.system('msccl compose allreduce ReduceScatter.n4-MYTP-steps2.rounds3.chunks2.msccl.json Allgather.n4-MYTP-steps2.rounds3.chunks2.msccl.json -o allreduce_ring4.json')I stored the collective also into json file for better debug. The logged allreduce json has strange input and output map as follows:
"input_map": { "0": [0, 1], "1": [0, 1], "2": [0, 1], "3": [0, 1] },
"output_map": { "0": [0, 1], "1": [0, 1], "2": [0, 1], "3": [0, 1] },
"steps": [
{
"msccl_type": "step",
"rounds": 1,
"sends": [
[0, 2, 1],
[1, 2, 3],
[2, 3, 0],
[3, 3, 2],
[4, 0, 3],
[5, 0, 1],
[6, 1, 0],
[7, 1, 2]
]
},
{
"msccl_type": "step",
"rounds": 2,
"sends": [
[0, 1, 0],
[0, 3, 0],
[1, 1, 0],
[1, 3, 0],
[2, 0, 1],
[2, 2, 1],
[3, 0, 1],
[3, 2, 1],
[4, 1, 2],
[4, 3, 2],
[5, 1, 2],
[5, 3, 2],
[6, 0, 3],
[6, 2, 3],
[7, 0, 3],
[7, 2, 3]
]
},
{
"msccl_type": "step",
"rounds": 2,
"sends": [
[0, 0, 1],
[0, 0, 3],
[1, 0, 1],
[1, 0, 3],
[2, 1, 0],
[2, 1, 2],
[3, 1, 0],
[3, 1, 2],
[4, 2, 1],
[4, 2, 3],
[5, 2, 1],
[5, 2, 3],
[6, 3, 0],
[6, 3, 2],
[7, 3, 0],
[7, 3, 2]
]
},
{
"msccl_type": "step",
"rounds": 1,
"sends": [
[0, 1, 2],
[1, 3, 2],
[2, 0, 3],
[3, 2, 3],
[4, 3, 0],
[5, 1, 0],
[6, 0, 1],
[7, 2, 1]
]
}
],
"collective": {
"msccl_type": "collective",
"name": "Allreduce(n=4)",
"nodes": 4,
"chunks": [
{ "msccl_type": "chunk", "pre": [0], "post": [0, 1, 2, 3], "addr": 0 },
{ "msccl_type": "chunk", "pre": [1], "post": [0, 1, 2, 3], "addr": 0 },
{ "msccl_type": "chunk", "pre": [2], "post": [0, 1, 2, 3], "addr": 0 },
{ "msccl_type": "chunk", "pre": [3], "post": [0, 1, 2, 3], "addr": 0 }
I want to know how to understand this output. The chunck id seems to not match with each other. And the input/output map is not a proper solution for allreduce.
I'll be really appreciated and happy to offer other trail logs if anyone can help.
Metadata
Metadata
Assignees
Labels
No labels