-
Notifications
You must be signed in to change notification settings - Fork 7
Description
I generated two allreduce algorithms in DGX1. One works(C48-S14-R14) but one not(C8-S4-R4).
It failed to generate allreduce directly. User should generate reducescatter and allgather seperately and compose them.
environment
Server: DGX1
OS: Ubuntu 22.04.3 LTS
Driver Version: 535.161.07
Docker Engine - Community, Version: 27.5.1
Docker image: nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04 (b56b435576e8)
steps to reproduce
compile runtime
# start container
docker run -dt --gpus all --hostname msccl-azure --name msccl-azure nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04
docker exec -it msccl-azure bash
apt update && apt install git libopenmpi-dev python3 python3-pip -y
adduser azure
su - azure
cd
git clone https://github.com/Azure/msccl.git --recurse-submodules
cd ~/msccl/executor/msccl-executor-nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"
cd ~/msccl/tests/msccl-tests-nccl/
make MPI=1 MPI_HOME=/usr/include/x86_64-linux-gnu/mpi,/ NCCL_HOME=~/msccl/executor/msccl-executor-nccl/build/ -j
synthesis algorithm
generated latency optimal C8-S4-R4 algo and bandwidth optimal C48-S14-R14 algorithm.
see attachment Allreduce.n8-DGX1-steps4.msccl.xml.txt & Allreduce.n8-DGX1-steps14.chunks6.msccl.xml.txt
cd
pip install git+https://github.com/azure/msccl-tools.git
# failed to generate allreduce directly!
azure@msccl-azure:~$ msccl solve instance DGX1 Allreduce --chunks 8 --steps 4 --rounds 4
Solving instance steps=4,chunks=8... unsatisfiable. (94.0s)
# generate C8-S4-R4 Allreduce by compose ReduceScatter & Allgather
azure@msccl-azure:~$ msccl solve instance DGX1 ReduceScatter --chunks 1 --steps 2 --rounds 2
Solving instance steps=2... synthesized! (0.2s)
Wrote to ReduceScatter.n8-DGX1-steps2.msccl.json
azure@msccl-azure:~$ msccl solve instance DGX1 Allgather --chunks 1 --steps 2 --rounds 2
Solving instance steps=2... synthesized! (0.2s)
Wrote to Allgather.n8-DGX1-steps2.msccl.json
azure@msccl-azure:~$ msccl compose allreduce ReduceScatter.n8-DGX1-steps2.msccl.json Allgather.n8-DGX1-steps2.msccl.json
Wrote to Allreduce.n8-DGX1-steps4.msccl.json
azure@msccl-azure:~$ msccl ncclize Allreduce.n8-DGX1-steps4.msccl.json
Wrote to Allreduce.n8-DGX1-steps4.msccl.xml
# C48-S14-R14
azure@msccl-azure:~$ msccl solve instance DGX1 ReduceScatter --chunks 6 --steps 7 --rounds 7
Solving instance steps=7,chunks=6... synthesized! (16.6s)
Wrote to ReduceScatter.n8-DGX1-steps7.chunks6.msccl.json
azure@msccl-azure:~$ msccl solve instance DGX1 Allgather --chunks 6 --steps 7 --rounds 7
Solving instance steps=7,chunks=6... synthesized! (16.6s)
Wrote to Allgather.n8-DGX1-steps7.chunks6.msccl.json
azure@msccl-azure:~$ msccl compose allreduce ReduceScatter.n8-DGX1-steps7.chunks6.msccl.json Allgather.n8-DGX1-steps7.chunks6.msccl.json
Wrote to Allreduce.n8-DGX1-steps14.chunks6.msccl.json
azure@msccl-azure:~$ msccl ncclize Allreduce.n8-DGX1-steps14.chunks6.msccl.json
Wrote to Allreduce.n8-DGX1-steps14.chunks6.msccl.xmlbench algorithm
bench C8-S4-R4
according to first line of the xml file, inplace="0" outofplace="1", outofplace is msccl, inplace is nccl.
azure@msccl-azure:~$ head Allreduce.n8-DGX1-steps4.msccl.xml
<algo name="Allreduce(n=8)-DGX1-steps=4" proto="Simple" nchannels="2" ngpus="8" inplace="0" outofplace="1" minBytes="0" maxBytes="0" coll="allreduce" nchunksperloop="1">
<gpu id="0" i_chunks="1" o_chunks="1" s_chunks="7">
<tb id="0" send="-1" recv="1" chan="0">
<step s="0" type="rrc" srcbuf="s" srcoff="3" dstbuf="s" dstoff="3" cnt="1" depid="2" deps="0" hasdep="1"/>
<step s="1" type="rrc" srcbuf="i" srcoff="0" dstbuf="i" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="1"/>
<step s="2" type="r" srcbuf="s" srcoff="4" dstbuf="s" dstoff="4" cnt="1" depid="4" deps="0" hasdep="0"/>
</tb>
<tb id="1" send="-1" recv="2" chan="0">
<step s="0" type="r" srcbuf="s" srcoff="3" dstbuf="s" dstoff="3" cnt="1" depid="-1" deps="-1" hasdep="1"/>
<step s="1" type="r" srcbuf="s" srcoff="1" dstbuf="s" dstoff="1" cnt="1" depid="5" deps="1" hasdep="0"/>output of all_reduce_perf shows that result of msccl is wrong(#wrong inside out-of-place column).
mpi_out_azure_4-rank.0-stdout.txt
cp Allreduce.n8-DGX1-steps4.msccl.xml msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms
azure@msccl-azure:~$ mpirun -np 8 --output-filename mpi_out_azure_4 --merge-stderr-to-stdout -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO
=MSCCL,RING,TREE /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3
# nThread 1 nGpus 1 minBytes 128 maxBytes 1073741824 step: 2(factor) warmup iters: 3 iters: 5 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 14179 on msccl-azure device 0 [0x1a] Tesla V100-SXM2-32GB
# Rank 1 Group 0 Pid 14180 on msccl-azure device 1 [0x1b] Tesla V100-SXM2-32GB
# Rank 2 Group 0 Pid 14181 on msccl-azure device 2 [0x3d] Tesla V100-SXM2-32GB
# Rank 3 Group 0 Pid 14182 on msccl-azure device 3 [0x3e] Tesla V100-SXM2-32GB
# Rank 4 Group 0 Pid 14183 on msccl-azure device 4 [0x88] Tesla V100-SXM2-32GB
# Rank 5 Group 0 Pid 14184 on msccl-azure device 5 [0x89] Tesla V100-SXM2-32GB
# Rank 6 Group 0 Pid 14185 on msccl-azure device 6 [0xb1] Tesla V100-SXM2-32GB
# Rank 7 Group 0 Pid 14187 on msccl-azure device 7 [0xb2] Tesla V100-SXM2-32GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
128 32 float sum -1 51.31 0.00 0.00 220 16.50 0.01 0.01 0
256 64 float sum -1 54.38 0.00 0.01 426 14.83 0.02 0.03 0
512 128 float sum -1 51.79 0.01 0.02 857 14.84 0.03 0.06 0
1024 256 float sum -1 52.50 0.02 0.03 1699 15.37 0.07 0.12 0
2048 512 float sum -1 52.21 0.04 0.07 3411 17.07 0.12 0.21 0
4096 1024 float sum -1 52.19 0.08 0.14 6844 17.67 0.23 0.41 0
8192 2048 float sum -1 53.48 0.15 0.27 13689 18.92 0.43 0.76 0
16384 4096 float sum -1 58.86 0.28 0.49 27356 19.31 0.85 1.48 0
32768 8192 float sum -1 74.37 0.44 0.77 54774 23.33 1.40 2.46 0
65536 16384 float sum -1 100.8 0.65 1.14 109458 24.07 2.72 4.76 0
131072 32768 float sum -1 137.4 0.95 1.67 219235 24.76 5.29 9.26 0
262144 65536 float sum -1 218.9 1.20 2.10 438106 26.49 9.90 17.32 0
524288 131072 float sum -1 371.7 1.41 2.47 876530 33.24 15.77 27.61 0
1048576 262144 float sum -1 669.3 1.57 2.74 1.75311e+06 67.05 15.64 27.37 0
2097152 524288 float sum -1 1269.0 1.65 2.89 3.50607e+06 90.05 23.29 40.75 0
4194304 1048576 float sum -1 2509.7 1.67 2.92 7.01167e+06 133.4 31.44 55.02 0
8388608 2097152 float sum -1 4960.7 1.69 2.96 1.40248e+07 223.4 37.54 65.70 0
16777216 4194304 float sum -1 9824.3 1.71 2.99 2.8049e+07 314.5 53.34 93.35 0
33554432 8388608 float sum -1 19505 1.72 3.01 5.60985e+07 560.9 59.83 104.69 0
67108864 16777216 float sum -1 38952 1.72 3.01 1.12195e+08 1007.5 66.61 116.56 0
134217728 33554432 float sum -1 77785 1.73 3.02 2.24396e+08 1910.7 70.24 122.93 0
268435456 67108864 float sum -1 155343 1.73 3.02 4.48791e+08 3773.8 71.13 124.48 0
536870912 134217728 float sum -1 310660 1.73 3.02 8.97575e+08 7430.9 72.25 126.43 0
1073741824 268435456 float sum -1 620955 1.73 3.03 1.79515e+09 14741 72.84 127.47 0
# Out of bounds values : 168 FAILED
# Avg bus bandwidth : 23.1468
#
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[61481,1],4]
Exit code: 1
--------------------------------------------------------------------------
debug output (added -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV options) (see attachment mpi_out_azure_4_debug-rank.0-stdout.txt )
mpirun -np 8 --output-filename mpi_out_azure_4_debug --merge-stderr-to-stdout -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO=MSCCL,RING,TREE /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3bench C48-S14-R14, allreduce result of msccl is right (#wrong of out-of-place are all zero)
mpi_out_azure_14-rank.0-stdout.txt
rm msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms/*
cp Allreduce.n8-DGX1-steps14.chunks6.msccl.xml msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms
azure@msccl-azure:~$ mpirun -np 8 --output-filename mpi_out_azure_14 --merge-stderr-to-stdout -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALG
O=MSCCL,RING,TREE /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3
# nThread 1 nGpus 1 minBytes 128 maxBytes 1073741824 step: 2(factor) warmup iters: 3 iters: 5 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 123 on msccl-azure device 0 [0x1a] Tesla V100-SXM2-32GB
# Rank 1 Group 0 Pid 124 on msccl-azure device 1 [0x1b] Tesla V100-SXM2-32GB
# Rank 2 Group 0 Pid 125 on msccl-azure device 2 [0x3d] Tesla V100-SXM2-32GB
# Rank 3 Group 0 Pid 126 on msccl-azure device 3 [0x3e] Tesla V100-SXM2-32GB
# Rank 4 Group 0 Pid 127 on msccl-azure device 4 [0x88] Tesla V100-SXM2-32GB
# Rank 5 Group 0 Pid 128 on msccl-azure device 5 [0x89] Tesla V100-SXM2-32GB
# Rank 6 Group 0 Pid 129 on msccl-azure device 6 [0xb1] Tesla V100-SXM2-32GB
# Rank 7 Group 0 Pid 130 on msccl-azure device 7 [0xb2] Tesla V100-SXM2-32GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
128 32 float sum -1 16.74 0.01 0.01 0 16.27 0.01 0.01 0
256 64 float sum -1 17.04 0.02 0.03 0 16.34 0.02 0.03 0
512 128 float sum -1 16.37 0.03 0.05 0 16.26 0.03 0.06 0
1024 256 float sum -1 17.08 0.06 0.10 0 16.72 0.06 0.11 0
2048 512 float sum -1 18.17 0.11 0.20 0 17.57 0.12 0.20 0
4096 1024 float sum -1 19.77 0.21 0.36 0 19.11 0.21 0.38 0
8192 2048 float sum -1 21.27 0.39 0.67 0 20.42 0.40 0.70 0
16384 4096 float sum -1 23.28 0.70 1.23 0 22.16 0.74 1.29 0
32768 8192 float sum -1 27.61 1.19 2.08 0 27.01 1.21 2.12 0
65536 16384 float sum -1 28.70 2.28 4.00 0 27.54 2.38 4.16 0
131072 32768 float sum -1 29.07 4.51 7.89 0 27.77 4.72 8.26 0
262144 65536 float sum -1 29.84 8.79 15.37 0 28.77 9.11 15.95 0
524288 131072 float sum -1 36.89 14.21 24.87 0 36.17 14.49 25.36 0
1048576 262144 float sum -1 78.43 13.37 23.40 0 77.08 13.60 23.81 0
2097152 524288 float sum -1 102.4 20.47 35.83 0 101.8 20.60 36.06 0
4194304 1048576 float sum -1 153.4 27.35 47.86 0 137.9 30.41 53.22 0
8388608 2097152 float sum -1 232.8 36.03 63.05 0 229.6 36.53 63.93 0
16777216 4194304 float sum -1 318.5 52.67 92.18 0 318.8 52.63 92.10 0
33554432 8388608 float sum -1 563.6 59.54 104.19 0 567.8 59.09 103.42 0
67108864 16777216 float sum -1 1020.4 65.77 115.09 0 1008.9 66.51 116.40 0
134217728 33554432 float sum -1 1904.4 70.48 123.34 0 1909.5 70.29 123.01 0
268435456 67108864 float sum -1 3770.3 71.20 124.60 0 3763.7 71.32 124.81 0
536870912 134217728 float sum -1 7428.9 72.27 126.47 0 7408.3 72.47 126.82 0
1073741824 268435456 float sum -1 14755 72.77 127.35 0 14750 72.80 127.40 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 43.5381
#
debug output (see attachment mpi_out_azure_14_debug-rank.0-stdout.txt )
mpirun -np 8 --output-filename mpi_out_azure_14_debug --merge-stderr-to-stdout -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO=MSCCL,RING,TREE /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3