[Issue]: Unexpected perftest results for reduction_on_stream

### How is this issue impacting you?

Lower performance than expected

### Share Your Debug Logs

Latency increases up until 131072 bytes, and then drops again. I would expect latency to only increase as the size of transfer increases. Also overall, these latencies seem rather high when comparing to NCCL.
```
#reduction_on_stream
size(B)     count     type      redop     latency(us)       min_lat(us)       max_lat(us)       algbw(GB/s)   busbw(GB/s)
8           2         int       sum       7.193600          7.194             7.194             0.001         0.002
16          4         int       sum       7.374836          7.375             7.375             0.002         0.004
32          8         int       sum       7.259927          7.260             7.260             0.004         0.008
64          16        int       sum       7.203200          7.203             7.203             0.009         0.016
128         32        int       sum       7.167418          7.167             7.167             0.018         0.031
256         64        int       sum       7.167418          7.167             7.167             0.036         0.063
512         128       int       sum       7.522909          7.523             7.523             0.068         0.119
1024        256       int       sum       8.881746          8.882             8.882             0.115         0.202
2048        512       int       sum       9.792873          9.793             9.793             0.209         0.366
4096        1024      int       sum       20.711856         20.712            20.712            0.198         0.346
8192        2048      int       sum       22.723781         22.724            22.724            0.361         0.631
16384       4096      int       sum       66.774108         66.774            66.774            0.245         0.429
32768       8192      int       sum       117.724799        117.725           117.725           0.278         0.487
65536       16384     int       sum       231.608436        231.608           231.608           0.283         0.495
131072      32768     int       sum       459.331214        459.331           459.331           0.285         0.499
262144      65536     int       sum       81.685528         81.686            81.686            3.209         5.616
524288      131072    int       sum       139.331207        139.331           139.331           3.763         6.585
1048576     262144    int       sum       273.026317        273.026           273.026           3.841         6.721
2097152     524288    int       sum       542.235911        542.236           542.236           3.868         6.768
4194304     1048576   int       sum       1099.375129       1099.375          1099.375          3.815         6.677
```

Attached rank0 logs below, all 8 ranks have very similar looking logs. [rank0.log](https://gist.github.com/JamesMBartlett/90308f2ddbb284e12d5a06fdc6de78e4)


### Steps to Reproduce the Issue

1. Build the perftest binaries.
2. Run
```
mpirun \
  -x NVSHMEMTEST_USE_MPI_LAUNCHER=1 \
  -x NVSHMEM_DEBUG=TRACE \
  -x NVSHMEM_DEBUG_SUBSYS=ALL \
  -x NVSHMEM_REMOTE_TRANSPORT=None \
  -n 8 -npernode 8 \
  /path/to/perftest/install/host/coll/reduction_on_stream \
  -n 100 -w 10 -b 8 -e 4194304 --cudagraph
```

### NVSHMEM Version

3.4.5-1+cuda12.9

### Your platform details

This is on an 8xB200 box with NVLINK.

<details><summary> `nvidia-smi topo -m` </summary>

```

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     SYS     SYS     0-23,144-167    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     SYS     SYS     0-23,144-167    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     48-71,192-215   2               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     48-71,192-215   2               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     NODE    NODE    NODE    NODE    72-95,216-239   3               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     NODE    NODE    NODE    NODE    72-95,216-239   3               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     120-143,264-287 5               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     120-143,264-287 5               N/A
NIC0    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC1    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS      X      PIX     PIX     PIX
NIC2    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     PIX      X      PIX     PIX
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     PIX     PIX      X      PIX
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     PIX     PIX     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
```
</details>

<details><summary> `ibstatus` </summary>

```

Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:bae9:24ff:fec7:0a32
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            400 Gb/sec (4X NDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:e09d:7303:0044:c6c4
        base lid:        0x1
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (2X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_2' port 1 status:
        default gid:     fe80:0000:0000:0000:e09d:7303:0044:c6c5
        base lid:        0x3
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (2X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_3' port 1 status:
        default gid:     fe80:0000:0000:0000:e09d:7303:0044:c6c6
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (2X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_4' port 1 status:
        default gid:     fe80:0000:0000:0000:e09d:7303:0044:c6c7
        base lid:        0x6
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (2X HDR)
        link_layer:      InfiniBand

```

</details>

### Error Message & Behavior

I would expect that as the transfer size increases the latency will increase, but it seems likes there's some sort of 2-tiered behavior where the latency increases up to a certain point, and then drops back down for larger sizes and increases again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Unexpected perftest results for reduction_on_stream #39

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

NVSHMEM Version

Your platform details

Error Message & Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Unexpected perftest results for reduction_on_stream #39

Description

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

NVSHMEM Version

Your platform details

Error Message & Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions