Skip to content

failed to run some allreduce algorithm in msccl executor #42

@banjiaojuhao

Description

@banjiaojuhao

I generated two allreduce algorithms in DGX1. One works(C48-S14-R14) but one not(C8-S4-R4).
It failed to generate allreduce directly. User should generate reducescatter and allgather seperately and compose them.

environment

Server: DGX1
OS: Ubuntu 22.04.3 LTS
Driver Version: 535.161.07
Docker Engine - Community, Version: 27.5.1
Docker image: nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04 (b56b435576e8)

steps to reproduce

compile runtime

# start container
docker run -dt --gpus all --hostname msccl-azure --name msccl-azure nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04
docker exec -it msccl-azure bash
apt update && apt install git libopenmpi-dev python3 python3-pip -y
adduser azure
su - azure

cd
git clone https://github.com/Azure/msccl.git --recurse-submodules

cd ~/msccl/executor/msccl-executor-nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

cd ~/msccl/tests/msccl-tests-nccl/
make MPI=1 MPI_HOME=/usr/include/x86_64-linux-gnu/mpi,/ NCCL_HOME=~/msccl/executor/msccl-executor-nccl/build/ -j

synthesis algorithm

generated latency optimal C8-S4-R4 algo and bandwidth optimal C48-S14-R14 algorithm.
see attachment Allreduce.n8-DGX1-steps4.msccl.xml.txt & Allreduce.n8-DGX1-steps14.chunks6.msccl.xml.txt

cd
pip install git+https://github.com/azure/msccl-tools.git

# failed to generate allreduce directly!
azure@msccl-azure:~$ msccl solve instance DGX1 Allreduce --chunks 8 --steps 4 --rounds 4
Solving instance steps=4,chunks=8... unsatisfiable. (94.0s)

# generate C8-S4-R4 Allreduce by compose ReduceScatter & Allgather
azure@msccl-azure:~$ msccl solve instance DGX1 ReduceScatter --chunks 1 --steps 2 --rounds 2
Solving instance steps=2... synthesized! (0.2s)
Wrote to ReduceScatter.n8-DGX1-steps2.msccl.json
azure@msccl-azure:~$ msccl solve instance DGX1 Allgather --chunks 1 --steps 2 --rounds 2
Solving instance steps=2... synthesized! (0.2s)
Wrote to Allgather.n8-DGX1-steps2.msccl.json
azure@msccl-azure:~$ msccl compose allreduce ReduceScatter.n8-DGX1-steps2.msccl.json Allgather.n8-DGX1-steps2.msccl.json
Wrote to Allreduce.n8-DGX1-steps4.msccl.json
azure@msccl-azure:~$ msccl ncclize Allreduce.n8-DGX1-steps4.msccl.json
Wrote to Allreduce.n8-DGX1-steps4.msccl.xml

# C48-S14-R14
azure@msccl-azure:~$ msccl solve instance DGX1 ReduceScatter --chunks 6 --steps 7 --rounds 7
Solving instance steps=7,chunks=6... synthesized! (16.6s)
Wrote to ReduceScatter.n8-DGX1-steps7.chunks6.msccl.json
azure@msccl-azure:~$ msccl solve instance DGX1 Allgather --chunks 6 --steps 7 --rounds 7
Solving instance steps=7,chunks=6... synthesized! (16.6s)
Wrote to Allgather.n8-DGX1-steps7.chunks6.msccl.json
azure@msccl-azure:~$ msccl compose allreduce ReduceScatter.n8-DGX1-steps7.chunks6.msccl.json Allgather.n8-DGX1-steps7.chunks6.msccl.json
Wrote to Allreduce.n8-DGX1-steps14.chunks6.msccl.json
azure@msccl-azure:~$ msccl ncclize Allreduce.n8-DGX1-steps14.chunks6.msccl.json
Wrote to Allreduce.n8-DGX1-steps14.chunks6.msccl.xml

bench algorithm

bench C8-S4-R4

according to first line of the xml file, inplace="0" outofplace="1", outofplace is msccl, inplace is nccl.

azure@msccl-azure:~$ head Allreduce.n8-DGX1-steps4.msccl.xml
<algo name="Allreduce(n=8)-DGX1-steps=4" proto="Simple" nchannels="2" ngpus="8" inplace="0" outofplace="1" minBytes="0" maxBytes="0" coll="allreduce" nchunksperloop="1">
  <gpu id="0" i_chunks="1" o_chunks="1" s_chunks="7">
    <tb id="0" send="-1" recv="1" chan="0">
      <step s="0" type="rrc" srcbuf="s" srcoff="3" dstbuf="s" dstoff="3" cnt="1" depid="2" deps="0" hasdep="1"/>
      <step s="1" type="rrc" srcbuf="i" srcoff="0" dstbuf="i" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="1"/>
      <step s="2" type="r" srcbuf="s" srcoff="4" dstbuf="s" dstoff="4" cnt="1" depid="4" deps="0" hasdep="0"/>
    </tb>
    <tb id="1" send="-1" recv="2" chan="0">
      <step s="0" type="r" srcbuf="s" srcoff="3" dstbuf="s" dstoff="3" cnt="1" depid="-1" deps="-1" hasdep="1"/>
      <step s="1" type="r" srcbuf="s" srcoff="1" dstbuf="s" dstoff="1" cnt="1" depid="5" deps="1" hasdep="0"/>

output of all_reduce_perf shows that result of msccl is wrong(#wrong inside out-of-place column).

mpi_out_azure_4-rank.0-stdout.txt

cp Allreduce.n8-DGX1-steps4.msccl.xml msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms

azure@msccl-azure:~$ mpirun -np 8 --output-filename mpi_out_azure_4 --merge-stderr-to-stdout -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO
=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3
# nThread 1 nGpus 1 minBytes 128 maxBytes 1073741824 step: 2(factor) warmup iters: 3 iters: 5 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  14179 on msccl-azure device  0 [0x1a] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid  14180 on msccl-azure device  1 [0x1b] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid  14181 on msccl-azure device  2 [0x3d] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid  14182 on msccl-azure device  3 [0x3e] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid  14183 on msccl-azure device  4 [0x88] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid  14184 on msccl-azure device  5 [0x89] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid  14185 on msccl-azure device  6 [0xb1] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid  14187 on msccl-azure device  7 [0xb2] Tesla V100-SXM2-32GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
         128            32     float     sum      -1    51.31    0.00    0.00    220    16.50    0.01    0.01      0
         256            64     float     sum      -1    54.38    0.00    0.01    426    14.83    0.02    0.03      0
         512           128     float     sum      -1    51.79    0.01    0.02    857    14.84    0.03    0.06      0
        1024           256     float     sum      -1    52.50    0.02    0.03   1699    15.37    0.07    0.12      0
        2048           512     float     sum      -1    52.21    0.04    0.07   3411    17.07    0.12    0.21      0
        4096          1024     float     sum      -1    52.19    0.08    0.14   6844    17.67    0.23    0.41      0
        8192          2048     float     sum      -1    53.48    0.15    0.27  13689    18.92    0.43    0.76      0
       16384          4096     float     sum      -1    58.86    0.28    0.49  27356    19.31    0.85    1.48      0
       32768          8192     float     sum      -1    74.37    0.44    0.77  54774    23.33    1.40    2.46      0
       65536         16384     float     sum      -1    100.8    0.65    1.14  109458    24.07    2.72    4.76      0
      131072         32768     float     sum      -1    137.4    0.95    1.67  219235    24.76    5.29    9.26      0
      262144         65536     float     sum      -1    218.9    1.20    2.10  438106    26.49    9.90   17.32      0
      524288        131072     float     sum      -1    371.7    1.41    2.47  876530    33.24   15.77   27.61      0
     1048576        262144     float     sum      -1    669.3    1.57    2.74  1.75311e+06    67.05   15.64   27.37      0
     2097152        524288     float     sum      -1   1269.0    1.65    2.89  3.50607e+06    90.05   23.29   40.75      0
     4194304       1048576     float     sum      -1   2509.7    1.67    2.92  7.01167e+06    133.4   31.44   55.02      0
     8388608       2097152     float     sum      -1   4960.7    1.69    2.96  1.40248e+07    223.4   37.54   65.70      0
    16777216       4194304     float     sum      -1   9824.3    1.71    2.99  2.8049e+07    314.5   53.34   93.35      0
    33554432       8388608     float     sum      -1    19505    1.72    3.01  5.60985e+07    560.9   59.83  104.69      0
    67108864      16777216     float     sum      -1    38952    1.72    3.01  1.12195e+08   1007.5   66.61  116.56      0
   134217728      33554432     float     sum      -1    77785    1.73    3.02  2.24396e+08   1910.7   70.24  122.93      0
   268435456      67108864     float     sum      -1   155343    1.73    3.02  4.48791e+08   3773.8   71.13  124.48      0
   536870912     134217728     float     sum      -1   310660    1.73    3.02  8.97575e+08   7430.9   72.25  126.43      0
  1073741824     268435456     float     sum      -1   620955    1.73    3.03  1.79515e+09    14741   72.84  127.47      0
# Out of bounds values : 168 FAILED
# Avg bus bandwidth    : 23.1468
#

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61481,1],4]
  Exit code:    1
--------------------------------------------------------------------------

debug output (added -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV options) (see attachment mpi_out_azure_4_debug-rank.0-stdout.txt )

mpirun -np 8 --output-filename mpi_out_azure_4_debug --merge-stderr-to-stdout -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3

bench C48-S14-R14, allreduce result of msccl is right (#wrong of out-of-place are all zero)

mpi_out_azure_14-rank.0-stdout.txt

rm msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms/*
cp Allreduce.n8-DGX1-steps14.chunks6.msccl.xml msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms

azure@msccl-azure:~$ mpirun -np 8 --output-filename mpi_out_azure_14 --merge-stderr-to-stdout -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALG
O=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3
# nThread 1 nGpus 1 minBytes 128 maxBytes 1073741824 step: 2(factor) warmup iters: 3 iters: 5 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    123 on msccl-azure device  0 [0x1a] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid    124 on msccl-azure device  1 [0x1b] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid    125 on msccl-azure device  2 [0x3d] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid    126 on msccl-azure device  3 [0x3e] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid    127 on msccl-azure device  4 [0x88] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid    128 on msccl-azure device  5 [0x89] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid    129 on msccl-azure device  6 [0xb1] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid    130 on msccl-azure device  7 [0xb2] Tesla V100-SXM2-32GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
         128            32     float     sum      -1    16.74    0.01    0.01      0    16.27    0.01    0.01      0
         256            64     float     sum      -1    17.04    0.02    0.03      0    16.34    0.02    0.03      0
         512           128     float     sum      -1    16.37    0.03    0.05      0    16.26    0.03    0.06      0
        1024           256     float     sum      -1    17.08    0.06    0.10      0    16.72    0.06    0.11      0
        2048           512     float     sum      -1    18.17    0.11    0.20      0    17.57    0.12    0.20      0
        4096          1024     float     sum      -1    19.77    0.21    0.36      0    19.11    0.21    0.38      0
        8192          2048     float     sum      -1    21.27    0.39    0.67      0    20.42    0.40    0.70      0
       16384          4096     float     sum      -1    23.28    0.70    1.23      0    22.16    0.74    1.29      0
       32768          8192     float     sum      -1    27.61    1.19    2.08      0    27.01    1.21    2.12      0
       65536         16384     float     sum      -1    28.70    2.28    4.00      0    27.54    2.38    4.16      0
      131072         32768     float     sum      -1    29.07    4.51    7.89      0    27.77    4.72    8.26      0
      262144         65536     float     sum      -1    29.84    8.79   15.37      0    28.77    9.11   15.95      0
      524288        131072     float     sum      -1    36.89   14.21   24.87      0    36.17   14.49   25.36      0
     1048576        262144     float     sum      -1    78.43   13.37   23.40      0    77.08   13.60   23.81      0
     2097152        524288     float     sum      -1    102.4   20.47   35.83      0    101.8   20.60   36.06      0
     4194304       1048576     float     sum      -1    153.4   27.35   47.86      0    137.9   30.41   53.22      0
     8388608       2097152     float     sum      -1    232.8   36.03   63.05      0    229.6   36.53   63.93      0
    16777216       4194304     float     sum      -1    318.5   52.67   92.18      0    318.8   52.63   92.10      0
    33554432       8388608     float     sum      -1    563.6   59.54  104.19      0    567.8   59.09  103.42      0
    67108864      16777216     float     sum      -1   1020.4   65.77  115.09      0   1008.9   66.51  116.40      0
   134217728      33554432     float     sum      -1   1904.4   70.48  123.34      0   1909.5   70.29  123.01      0
   268435456      67108864     float     sum      -1   3770.3   71.20  124.60      0   3763.7   71.32  124.81      0
   536870912     134217728     float     sum      -1   7428.9   72.27  126.47      0   7408.3   72.47  126.82      0
  1073741824     268435456     float     sum      -1    14755   72.77  127.35      0    14750   72.80  127.40      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 43.5381
#

debug output (see attachment mpi_out_azure_14_debug-rank.0-stdout.txt )

mpirun -np 8 --output-filename mpi_out_azure_14_debug --merge-stderr-to-stdout -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions