-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Description
How is this issue impacting you?
Lower performance than expected
Share Your Debug Logs
Latency increases up until 131072 bytes, and then drops again. I would expect latency to only increase as the size of transfer increases. Also overall, these latencies seem rather high when comparing to NCCL.
#reduction_on_stream
size(B) count type redop latency(us) min_lat(us) max_lat(us) algbw(GB/s) busbw(GB/s)
8 2 int sum 7.193600 7.194 7.194 0.001 0.002
16 4 int sum 7.374836 7.375 7.375 0.002 0.004
32 8 int sum 7.259927 7.260 7.260 0.004 0.008
64 16 int sum 7.203200 7.203 7.203 0.009 0.016
128 32 int sum 7.167418 7.167 7.167 0.018 0.031
256 64 int sum 7.167418 7.167 7.167 0.036 0.063
512 128 int sum 7.522909 7.523 7.523 0.068 0.119
1024 256 int sum 8.881746 8.882 8.882 0.115 0.202
2048 512 int sum 9.792873 9.793 9.793 0.209 0.366
4096 1024 int sum 20.711856 20.712 20.712 0.198 0.346
8192 2048 int sum 22.723781 22.724 22.724 0.361 0.631
16384 4096 int sum 66.774108 66.774 66.774 0.245 0.429
32768 8192 int sum 117.724799 117.725 117.725 0.278 0.487
65536 16384 int sum 231.608436 231.608 231.608 0.283 0.495
131072 32768 int sum 459.331214 459.331 459.331 0.285 0.499
262144 65536 int sum 81.685528 81.686 81.686 3.209 5.616
524288 131072 int sum 139.331207 139.331 139.331 3.763 6.585
1048576 262144 int sum 273.026317 273.026 273.026 3.841 6.721
2097152 524288 int sum 542.235911 542.236 542.236 3.868 6.768
4194304 1048576 int sum 1099.375129 1099.375 1099.375 3.815 6.677
Attached rank0 logs below, all 8 ranks have very similar looking logs. rank0.log
Steps to Reproduce the Issue
- Build the perftest binaries.
- Run
mpirun \
-x NVSHMEMTEST_USE_MPI_LAUNCHER=1 \
-x NVSHMEM_DEBUG=TRACE \
-x NVSHMEM_DEBUG_SUBSYS=ALL \
-x NVSHMEM_REMOTE_TRANSPORT=None \
-n 8 -npernode 8 \
/path/to/perftest/install/host/coll/reduction_on_stream \
-n 100 -w 10 -b 8 -e 4194304 --cudagraph
NVSHMEM Version
3.4.5-1+cuda12.9
Your platform details
This is on an 8xB200 box with NVLINK.
`nvidia-smi topo -m`
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE SYS SYS SYS SYS 0-23,144-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE SYS SYS SYS SYS 0-23,144-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS 48-71,192-215 2 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS 48-71,192-215 2 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS NODE NODE NODE NODE 72-95,216-239 3 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS NODE NODE NODE NODE 72-95,216-239 3 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS 120-143,264-287 5 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS 120-143,264-287 5 N/A
NIC0 NODE NODE SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC1 SYS SYS SYS SYS NODE NODE SYS SYS SYS X PIX PIX PIX
NIC2 SYS SYS SYS SYS NODE NODE SYS SYS SYS PIX X PIX PIX
NIC3 SYS SYS SYS SYS NODE NODE SYS SYS SYS PIX PIX X PIX
NIC4 SYS SYS SYS SYS NODE NODE SYS SYS SYS PIX PIX PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
`ibstatus`
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:bae9:24ff:fec7:0a32
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: Ethernet
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:e09d:7303:0044:c6c4
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (2X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_2' port 1 status:
default gid: fe80:0000:0000:0000:e09d:7303:0044:c6c5
base lid: 0x3
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (2X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_3' port 1 status:
default gid: fe80:0000:0000:0000:e09d:7303:0044:c6c6
base lid: 0x5
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (2X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_4' port 1 status:
default gid: fe80:0000:0000:0000:e09d:7303:0044:c6c7
base lid: 0x6
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (2X HDR)
link_layer: InfiniBand
Error Message & Behavior
I would expect that as the transfer size increases the latency will increase, but it seems likes there's some sort of 2-tiered behavior where the latency increases up to a certain point, and then drops back down for larger sizes and increases again.
Metadata
Metadata
Assignees
Labels
No labels