Xilinx · lawirz · Mar 18, 2024 · Mar 23, 2024 · Mar 24, 2024 · Mar 24, 2024
diff --git a/integrations/pytorch_ddp/DEVELOPMENT.md b/integrations/pytorch_ddp/DEVELOPMENT.md
@@ -0,0 +1,53 @@
+This document explains, what the state of development is at and tries to document some of the decisions made
+
+## Structure
+
+Consists of
+
+- wrapper, bindings and helper functionality found in ./accl_process_group
+- main C++ files in ./src
+- The ACCL repo the process group itself builds on top will be in ./accl . This is replicated such that you can try different versions
+- ./test testscripts
+
+## Build process
+
+Check the ./install.py helper for dependency versions
+
+./setup.py sets up the build
+
+See the section in the README on how to avoid the long build using pip
+
+## Basics
+
+- Currently only runs via Coyote RDMA. XRT and GPU support was dropped. Simulator still runs over XRT UDP though
+- Needs MPI Library to work. Set in setup.py. Tested only with MPICH
+- The test setup in run.sh is for the HACC cluster
+- use ACCL_DEBUG=1 both during build and runs
+- Everything runs in rendezvous mode
+- if you call collectives directly they are run synchronously, but eg allreduce used internally in DDP is executed async
+- The PG allocates 2 buffers and reuses them to avoid reallocation. This is supposed to be replaced with a host buffer constructor which takes an existing memory region. To change buffer type you need to use the change_buffer_type branch(maybe already pulled) at https://github.com/lawirz/ACCL 
+- The torch profiler can see the overall execution time, but setting it up to measure sub-operation within the workerthread was attempted but failed.
+
+## ProcessGroupACCL.cpp
+
+### ProcessGroup structure
+
+A lot of the design comes from the ProcessGroupMPI. There is a concept of WorkEntries, which schedule Work on a separate worker thread. This is currently done using a single Worker thread as is the case with the MPI PG. There is still a lock, probably only relevant in case of a few management operations from the DDP side. With async execution in ACCL, we could try a different structure with AsyncWork as is done on Gloo PG I think.
+
+### Collectives
+
+- There are small wrappers, which do a few checks mostly copied from MPI PG, do the sidestep then setup the WorkEntry
+- The WorkEntries manage the Segmentation, which is not yet correctly implemented everywhere. Some collectives still use a version which relies on the input to have one-dimensional shape. Others, which require multiple Segmentations such as Scatter have similar limitations
+- Input is copied to the pre-allocated buffer. Generally copies using memcpy seem to be much faster, than using tensor.copy_ for some reason
+- ACCL does a host-to-host call. The driver figures out, that it's host to host using the buffer type. The compressed type should be added as an argument to make that work again
+- copy back
+
+## Hardware issues
+
+A lot of collectives still fail in hardware. The following can produce issues
+
+- Mixing datatypes especially ints
+- High variablity in length
+- MPI sidestepping(can't explain why this causes issues)
+
+If you run test-resnet50, you will encounter them.
diff --git a/integrations/pytorch_ddp/README.md b/integrations/pytorch_ddp/README.md
@@ -12,6 +12,8 @@ python3 -m venv venv
 source venv/bin/activate
 ```
 
+Activate an XRT 21 version. Later versions led to issues before.
+
 <details><summary>Installation without GPU support</summary>
   To install the plugin without GPU support, simply run the following from within the venv:
 
@@ -42,8 +44,42 @@ source venv/bin/activate
 </details>
 
 ## Running the plugin
+
 Make sure to source the `setup.sh` script in this directory to load the ACCL plugin before starting a Python script.
 Example usage can be found in the various test files under [`test/`](test).
 
 Do make sure not to run python from within the root directory of `pytorch_ddp`, because Python will try to import the
 local incomplete [`accl_process_group/`](accl_process_group) folder instead of the actual installation.
+
+The provided `test/run.sh` will launch a testscript via mpirun
+
+## Setup overview
+
+- The whole Processgroup is wrapped in OpenMPI, which is used for initialization
+- You can use the OpenMPI implementation of certain collectives using the "sidestep" flags in the ProcessGroupACCL.cpp
+- Recompilation using `./install` or `pip install .` can be very slow, you can run `python setup.py build_ext --inplace` and then copy the binary or other files directly. `cp accl_process_group/_c/ProcessGroupACCL.cpython-38-x86_64-linux-gnu.so ~/.local/lib/python3.8/site-packages/accl_process_group/_c/`
+- The `install.py` script will not reinstall the driver in case of ACCL updates. You will need to rebuild it yourself
+- Set `ACCL_DEBUG=1` if you want more output(also set during build). Stdout is sometimes not complete(in simulator), so best log most things in stderr
+- The runscript currently just outputs the command to be run(better not use the `&` at the end), which you then run manually. This is because I had bad experiences with the missing output(maybe coinciding with issues mentioned above) and termination on multiple machines, but should also work if you comment the `exit 0` and the `&` at the end of mpirun out. Don't forget, that you should still run the script to clear log files.
+- ACCL only supports sizes up to 4MB, If you give it tensors of higher sizes, the PG will try to segment it in first dim. Not all collectives correctly handle multi-dimensional tensors yet. 
+- Setting up the simulator with 4MB takes long, better set it lower for quick tests.
+- You can init the process group as if it were udp and run on a `cyt_rdma` simulator
+- There is no reason to not support the rdma + SIM initialization. It just hasn't been implemented yet. Certain case-splits assume no-sim if cyt_rdma is given...
+
+### How to install torchvision
+
+- install torch using the script
+- clone vision, go to the fitting version v0.16.0
+- clone libpng, configure with prefix set to local directory
+- add the bin to the path
+- not sure if needed: supply the path of the library and include to torchvision as in their development doc
+- disable the version check in torchvision setup.py, because it doesn't correctly parse the version.
+- run vision setup.py with debug, include, library and use png flags
+
+### Tests available
+Check `test/run.sh` for ACCL_SCRIPT examples
+
+- `test-generic.py` tests everything in isolation + a small dual layer model learning a linear function
+- `test-mnist.py` should be able to be run non-distributed as well(check arguments)
+- `test-imagenet.py` does finetuning of Resnet50 according to: <https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html> and should alse be able to be run non-distributed
+- For DLRM you will need to use a small fork of the DLRM-repo with ACCL-support hosted at <https://gitlab.ethz.ch/lawirz/dlrm>. It contains a `run.sh`
diff --git a/integrations/pytorch_ddp/accl_process_group/__init__.py b/integrations/pytorch_ddp/accl_process_group/__init__.py
@@ -17,5 +17,5 @@
 
 from ._c.ProcessGroupACCL import ProcessGroupACCL, Rank, DataType, ACCLDesign
 from .process_group_wrapper import create_process_group, \
-    create_process_group_coyote, create_simulate_process_group, initialize, \
+    initialize, \
     set_compression, get_compression, get_local_qp, set_remote_qp
diff --git a/integrations/pytorch_ddp/accl_process_group/process_group_wrapper.py b/integrations/pytorch_ddp/accl_process_group/process_group_wrapper.py
@@ -19,117 +19,58 @@
 from typing import Optional
 from . import ProcessGroupACCL, Rank, DataType, ACCLDesign
 import torch
+import logging
 from torch.distributed import Backend
 from torch.distributed.distributed_c10d import ProcessGroup, Store
-
+import sys
+import os
 
 process_group: Optional[ProcessGroupACCL] = None
 
+#Configure logging
+logger = logging.getLogger(__name__)
+if "ACCL_DEBUG" in os.environ and os.environ["ACCL_DEBUG"]=="1":
+    logger.setLevel(logging.DEBUG)
+else:
+    logger.setLevel(logging.WARNING)
 
 def create_process_group(
-        ranks: list[Rank],
-        xclbin: str, device_index: int, design: ACCLDesign,
+        ranks: list[Rank], design: ACCLDesign,
         *, nbufs: int = 16, bufsize: int = 1024,
         compression: Optional[dict[DataType, DataType]] = None,
         p2p_enabled: bool = False, profiling_ranks: Optional[list[int]] = None,
         profiling_timeout: float = 0.0, rsfec: bool = False,
+        simulation: bool = False,
         initialize: bool = True) -> ProcessGroup:
-    if design == ACCLDesign.cyt_rdma or design == ACCLDesign.cyt_tcp:
-        raise RuntimeError(f"{design} is an incompatible design for XRT")
 
     if compression is None:
         compression = {}
     else:
         # Copy compression since it will be used later in the lambda function
         compression = compression.copy()
 
+    logger.debug(f'Compression: {compression}')
+
     if profiling_ranks is None:
         profiling_ranks = []
     else:
         profiling_ranks = profiling_ranks.copy()
 
-    def create_process_group_wrapper(store, rank, size, _timeout):
-        global process_group
-        if process_group is not None:
-            raise RuntimeError("ACCL ProcessGroup already created, "
-                               "can only create one.")
-
-        pg = ProcessGroupACCL(store, rank, size, ranks, False, design,
-                              xclbin=xclbin, device_index=device_index,
-                              bufsize=bufsize, rsfec=rsfec, nbufs=nbufs,
-                              compression=compression,
-                              p2p_enabled=p2p_enabled,
-                              profiling_ranks=profiling_ranks,
-                              profiling_timeout=profiling_timeout)
-
-        process_group = pg
-        if initialize:
-            pg.initialize()
-
-        return pg
-
-    Backend.register_backend("ACCL", create_process_group_wrapper)
-
-def create_simulate_process_group(ranks: list[Rank], *,
-                                  nbufs: int = 16, udp: bool = False,
-                                  compression: Optional[dict[DataType,
-                                                             DataType]] = None,
-                                  bufsize: int = 1024,
-                                  initialize: bool = True) -> ProcessGroup:
-    if compression is None:
-        compression = {}
-    else:
-        # Copy compression since it will be used later in the lambda function
-        compression = compression.copy()
+    logger.debug(f'Profiling_ranks: {profiling_ranks}')        
 
     def create_process_group_wrapper(store, rank, size, _timeout):
         global process_group
         if process_group is not None:
             raise RuntimeError("ACCL ProcessGroup already created, "
                                "can only create one.")
 
-        design = ACCLDesign.udp if udp else ACCLDesign.tcp
+        # if simulation:
+            #overwrite the design choice in simulation
+            # design = ACCLDesign.udp
 
-        pg = ProcessGroupACCL(store, rank, size, ranks, True, design,
-                              compression=compression, nbufs=nbufs,
-                              bufsize=bufsize)
-
-        process_group = pg
-        if initialize:
-            pg.initialize()
-
-        return pg
-
-    Backend.register_backend("ACCL", create_process_group_wrapper)
-
-def create_process_group_coyote(
-        ranks: list[Rank], design: ACCLDesign,
-        *, nbufs: int = 16, bufsize: int = 1024,
-        compression: Optional[dict[DataType, DataType]] = None,
-        p2p_enabled: bool = False, profiling_ranks: Optional[list[int]] = None,
-        profiling_timeout: float = 0.0, rsfec: bool = False,
-        initialize: bool = False) -> ProcessGroup:
-    if design != ACCLDesign.cyt_rdma and design != ACCLDesign.cyt_tcp:
-        raise RuntimeError(f"{design} is an incompatible design for coyote")
-
-    if compression is None:
-        compression = {}
-    else:
-        # Copy compression since it will be used later in the lambda function
-        compression = compression.copy()
-
-    if profiling_ranks is None:
-        profiling_ranks = []
-    else:
-        profiling_ranks = profiling_ranks.copy()
-
-    def create_process_group_wrapper(store, rank, size, _timeout):
-        global process_group
-        if process_group is not None:
-            raise RuntimeError("ACCL ProcessGroup already created, "
-                               "can only create one.")
-
-        pg = ProcessGroupACCL(store, rank, size, ranks, False, design,
+        logger.debug(f'Creating ProcessGroupACCL for: rank {rank}')
+
+        pg = ProcessGroupACCL(store, rank, size, ranks, simulation, design,
                               bufsize=bufsize, rsfec=rsfec, nbufs=nbufs,
                               compression=compression,
                               p2p_enabled=p2p_enabled,
@@ -138,31 +79,37 @@ def create_process_group_wrapper(store, rank, size, _timeout):
 
         process_group = pg
         if initialize:
+            logger.debug('Initializing Process Group')
             pg.initialize()
-
         return pg
 
-    Backend.register_backend("ACCL", create_process_group_wrapper)
+    #CPU only for now
+    logger.debug('Registering ACCL Backend')
+    Backend.register_backend("ACCL", create_process_group_wrapper, devices='cpu')
 
 def initialize() -> None:
+    logger.debug('Initialize called')
     if process_group is None:
         raise RuntimeError("Cannot initialize before ACCL ProcessGroup "
                            "is created.")
     process_group.initialize()
 
 def get_local_qp(rank: int) -> list[int]:
+    logger.debug('Get_local_qp called')
     if process_group is None:
         raise RuntimeError("Cannot get local qp before ACCL ProcessGroup "
                            "is created.")
     return process_group.get_local_qp(rank)
 
 def set_remote_qp(rank: int, qp: list[int]) -> None:
+    logger.debug('Set_remote_qp called')
     if process_group is None:
         raise RuntimeError("Cannot set remote qp before ACCL ProcessGroup "
                            "is created.")
     return process_group.set_remote_qp(rank, qp)
 
 def set_compression(compression: dict[DataType, DataType]):
+    logger.debug(f'Setting compression to {compression}')
     if process_group is None:
         raise RuntimeError("Cannot set compression before ACCL ProcessGroup "
                            "is initialized.")

diff --git a/integrations/pytorch_ddp/include/ProcessGroupACCL.hpp b/integrations/pytorch_ddp/include/ProcessGroupACCL.hpp
@@ -266,17 +266,21 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
 
   void run_send(at::Tensor tensor, int dstRank, int tag);
   void run_recv(at::Tensor tensor, int rcvRank, int tag);
-  void run_broadcast(at::Tensor tensor, const BroadcastOptions &opts);
-  void run_allreduce(at::Tensor tensor, const AllreduceOptions &opts);
-  void run_reduce(at::Tensor tensor, const ReduceOptions &opts);
-  void run_allgather(at::Tensor srctensor,
+  void run_broadcast(at::Tensor in_tensor, const BroadcastOptions &opts);
+  void run_allreduce(at::Tensor in_tensor, const AllreduceOptions &opts);
+  void run_reduce(at::Tensor in_tensor, const ReduceOptions &opts);
+  void run_allgather(at::Tensor in_tensor,
                      const std::vector<at::Tensor> &dsttensors);
-  void run_gather(at::Tensor srctensor,
+  void run_gather(at::Tensor in_tensor,
                   const std::vector<at::Tensor> &dsttensors,
                   const GatherOptions &opts);
-  void run_scatter(std::vector<at::Tensor> &srctensors, at::Tensor dsttensor,
+  void run_scatter(std::vector<at::Tensor> &in_tensors, at::Tensor dsttensor,
                    const ScatterOptions &opts);
-  void run_alltoall(at::Tensor srctensor, at::Tensor dsttensor, const AllToAllOptions &opts);
+
+  void run_alltoall(at::Tensor in_tensor, at::Tensor dsttensor, const AllToAllOptions &opts);
+
+  void run_alltoall_vec(std::vector<at::Tensor> &in_tensor_vec,
+                                    std::vector<at::Tensor> &out_tensor_vec, const AllToAllOptions &opts);
 
   ACCL::dataType get_compressed_type(c10::ScalarType datatype);
 
@@ -292,6 +296,17 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
   // Global states
   static void initACCLOnce();
   static void acclExit();
+
+  void init_input_tensor(at::Tensor &tensor, std::unique_ptr<ACCL::BaseBuffer> &data, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+
+  void init_input_tensor_new(at::Tensor &tensor, ACCL::BaseBuffer *data, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+
+  void init_input_data_vec(std::vector<at::Tensor> &tensor_vec, std::unique_ptr<ACCL::BaseBuffer> &data, const at::TensorOptions &options, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+
+  void copy_back_tensor(at::Tensor tensor_original, std::unique_ptr<ACCL::BaseBuffer> &data, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+
+  void copy_back_tensorvec(const std::vector<at::Tensor> &dsttensorvec, std::unique_ptr<ACCL::BaseBuffer> &data, at::Tensor &dsttensor, int numel, int offset, bool do_on_root, bool do_on_others, int opts_root_rank = 0);
+
   static std::once_flag onceFlagInitACCL;
 
   static std::mutex pgGlobalMutex_;
@@ -309,6 +324,7 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
 
   ACCL::CoyoteDevice *cyt_device;
   std::vector<fpga::ibvQpConn*> ibvQpConn_vec;
+  xrt::device xrt_device;
 
   std::unique_ptr<ACCL::ACCL> accl;
   uint64_t bufsize;
@@ -318,6 +334,9 @@ class TORCH_API ProcessGroupACCL : public ProcessGroup {
   bool initialized;
   xrt::bo buf0;
   xrt::bo buf1;
+
+  std::unique_ptr<ACCL::BaseBuffer> in_buf;
+  std::unique_ptr<ACCL::BaseBuffer> out_buf;
 };
 
 } // namespace c10d
diff --git a/integrations/pytorch_ddp/include/coyote_init.hpp b/integrations/pytorch_ddp/include/coyote_init.hpp